file processing : index and hash 2015, spring pusan national university ki-joune li
TRANSCRIPT
File Processing : Index and Hash
2015, Spring
Pusan National University
Ki-Joune Li
STEMPNU
What is index ?
Index in a book Index : Keyword Pages Without Index
Exhaustive search : Too Expensive
Index for a file or database A function or mechanism
FIndex : SPredicate B (block numbers on hard disk) e.g. find student records where student.GPA > 4.0
STEMPNU
Data Retrieval Time
Data retrieval on disk : Two phases 1st phase : Search with a condition (Predicate) 2nd phase : Data access
Search ConditionSearch
Condition { Block# }{ Block# }Search Block Number
Databaseon Disk
1st Phase
2nd Phase
Data Access Time- File Structure- Disk Placement- Clustering, etc..
STEMPNU
Blocking Factor Bf
Blocking Factor Number of Records in a Block
Blocking Number and Number of Disk Accesses ND = Nrecord / Bf
By maximizing blocking factor, we reduce the number of disk accessesBy maximizing blocking factor, we reduce the number of disk accesses
STEMPNU
How to Accelerate Phase 1 ?
Of course, we could accelerate the phase 1 by index or by hash
Index vs. Hash Index : a type of data structures
Needs additional data structures Hash : a type of mechanism
May not need any additional data structure (not exactly true)
STEMPNU
A Simple Idea on Index
Mapping Table from keywords to block numbers Inverted File Why inverted file is better than nothing ?
If the table is too large (to fit in main memory) It has to be stored on disk Disk Access for Index Access
Keyword Block#
Romeo B26
Hamlet B22
… …
Carmen B212
JulietJuliet
STEMPNU
Searching Algorithms and Index
A good way to accelerate searching Tree : O( logn ) Reorganize Inverted File to Tree Binary Search Tree : Branching Factor = 2
Tree in memory space vs. in disk space Memory space : Number of Comparisons Disk space : Number of Block Accesses
30, b27
14, b17 40, b26
34, b17 55, b26
STEMPNU
Paged Tree : m-way search tree
57, b2734 103, b28 … 343, b14
1, b2944 … 54, b21 58, b1732 … 96, b127
Number of delimiters
DelimiterBlock number
How to determine m ? One Node : One Disk Page
e.g. When 1 disk page is 4 K bytes 4+4m+8(m-1) = 4096 m = 341
Very fat tree
STEMPNU
Problem of m-Way search tree
m-way search tree Search Performance : determined by the height Not balanced
Average : O(log n) Worst case : n / Bf O(n) Height : determined by insertion order
e.g : insertion by ascending order
How to make it balanced ? Balanced m-Way search tree : B-tree
STEMPNU
B-tree
B-tree : Balanced m-way search Tree Root Node : no child node or more than one child nodes Internal Node : m/2 ~ m child nodes (block number) External Node : data block number instead of child node Balanced
Upward split instead of downward split : Binary Tree
STEMPNU
Downward Split
10 20
Suppose m=3
Insert 10, 20
Insert 30 10 20 30Upward Split
overflow
Insert 40
10
20
30 40
10 30
20
STEMPNU
Downward Split
Insert 50
3010
20
5010
20
30 40 50
Insert 70
10
20
30
40
50 60 70
Insert 60
50
60
40
10
20
30
40 60
70
40
5010
20
30
60
70
STEMPNU
Meaning of Downward Split
Always Balanced Not so much influenced by the order of insertions
Internal Nodes : m/2 ~ m child nodes (block number)
40
5010
20
30
60
70
Root Node
Internal Node
External Node
STEMPNU
Search by B-tree
40
5010
20
30
60
70
? 45 45
45
45
Not Found
STEMPNU
Performance of B-tree
Number of Comparison within a node : Trivial Number of Nodes to visit : Depth
STEMPNU
Problem of B-tree
Types of Search Exact Match Search Range Search
E.g. find students where 25<student.GPA<50
B-tree Good for Exact match search Bad for range search
40
5010
20
30
60
70
STEMPNU
B+-tree
A Variant of B-tree Duplicate all elements at leaf nodes (external nodes) Linked List of Leaf Nodes
Performance Exact Match Search and Insertion
A small fraction of performance sacrifice Range Search : much more powerful than B-tree
STEMPNU
B+-tree : Example
10 20 30 4010 20 30
overflow4010 20 30
20
4010 20 30 50
20
4010 20 30 50 60
20
4010 20 30 50 60
4020
Linked List
Duplication
STEMPNU
Range Search with B+-tree
Find students where GPA>3.5
4010 20 30 50 60
402035
4010 20 30 50 60
402035
4010 20 30 50 60
4020
354010 20 30 50 60
4020
35
STEMPNU
Performance of B+-tree
Performance Determined by the Depth
Exact Match Search and Insertion (without split) d node (page) accesses
Range Search
node accesses ( nq : number of records to retrieve)
nd m
2
log
f
q
B
nd