file processing : index and hash 2015, spring pusan national university ki-joune li

20
File Processing : Index and Hash 2015, Spring Pusan National University Ki-Joune Li

Upload: cornelius-porter

Post on 13-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: File Processing : Index and Hash 2015, Spring Pusan National University Ki-Joune Li

File Processing : Index and Hash

2015, Spring

Pusan National University

Ki-Joune Li

Page 2: File Processing : Index and Hash 2015, Spring Pusan National University Ki-Joune Li

STEMPNU

What is index ?

Index in a book Index : Keyword Pages Without Index

Exhaustive search : Too Expensive

Index for a file or database A function or mechanism

FIndex : SPredicate B (block numbers on hard disk) e.g. find student records where student.GPA > 4.0

Page 3: File Processing : Index and Hash 2015, Spring Pusan National University Ki-Joune Li

STEMPNU

Data Retrieval Time

Data retrieval on disk : Two phases 1st phase : Search with a condition (Predicate) 2nd phase : Data access

Search ConditionSearch

Condition { Block# }{ Block# }Search Block Number

Databaseon Disk

1st Phase

2nd Phase

Data Access Time- File Structure- Disk Placement- Clustering, etc..

Page 4: File Processing : Index and Hash 2015, Spring Pusan National University Ki-Joune Li

STEMPNU

Blocking Factor Bf

Blocking Factor Number of Records in a Block

Blocking Number and Number of Disk Accesses ND = Nrecord / Bf

By maximizing blocking factor, we reduce the number of disk accessesBy maximizing blocking factor, we reduce the number of disk accesses

Page 5: File Processing : Index and Hash 2015, Spring Pusan National University Ki-Joune Li

STEMPNU

How to Accelerate Phase 1 ?

Of course, we could accelerate the phase 1 by index or by hash

Index vs. Hash Index : a type of data structures

Needs additional data structures Hash : a type of mechanism

May not need any additional data structure (not exactly true)

Page 6: File Processing : Index and Hash 2015, Spring Pusan National University Ki-Joune Li

STEMPNU

A Simple Idea on Index

Mapping Table from keywords to block numbers Inverted File Why inverted file is better than nothing ?

If the table is too large (to fit in main memory) It has to be stored on disk Disk Access for Index Access

Keyword Block#

Romeo B26

Hamlet B22

… …

Carmen B212

JulietJuliet

Page 7: File Processing : Index and Hash 2015, Spring Pusan National University Ki-Joune Li

STEMPNU

Searching Algorithms and Index

A good way to accelerate searching Tree : O( logn ) Reorganize Inverted File to Tree Binary Search Tree : Branching Factor = 2

Tree in memory space vs. in disk space Memory space : Number of Comparisons Disk space : Number of Block Accesses

30, b27

14, b17 40, b26

34, b17 55, b26

Page 8: File Processing : Index and Hash 2015, Spring Pusan National University Ki-Joune Li

STEMPNU

Paged Tree : m-way search tree

57, b2734 103, b28 … 343, b14

1, b2944 … 54, b21 58, b1732 … 96, b127

Number of delimiters

DelimiterBlock number

How to determine m ? One Node : One Disk Page

e.g. When 1 disk page is 4 K bytes 4+4m+8(m-1) = 4096 m = 341

Very fat tree

Page 9: File Processing : Index and Hash 2015, Spring Pusan National University Ki-Joune Li

STEMPNU

Problem of m-Way search tree

m-way search tree Search Performance : determined by the height Not balanced

Average : O(log n) Worst case : n / Bf O(n) Height : determined by insertion order

e.g : insertion by ascending order

How to make it balanced ? Balanced m-Way search tree : B-tree

Page 10: File Processing : Index and Hash 2015, Spring Pusan National University Ki-Joune Li

STEMPNU

B-tree

B-tree : Balanced m-way search Tree Root Node : no child node or more than one child nodes Internal Node : m/2 ~ m child nodes (block number) External Node : data block number instead of child node Balanced

Upward split instead of downward split : Binary Tree

Page 11: File Processing : Index and Hash 2015, Spring Pusan National University Ki-Joune Li

STEMPNU

Downward Split

10 20

Suppose m=3

Insert 10, 20

Insert 30 10 20 30Upward Split

overflow

Insert 40

10

20

30 40

10 30

20

Page 12: File Processing : Index and Hash 2015, Spring Pusan National University Ki-Joune Li

STEMPNU

Downward Split

Insert 50

3010

20

5010

20

30 40 50

Insert 70

10

20

30

40

50 60 70

Insert 60

50

60

40

10

20

30

40 60

70

40

5010

20

30

60

70

Page 13: File Processing : Index and Hash 2015, Spring Pusan National University Ki-Joune Li

STEMPNU

Meaning of Downward Split

Always Balanced Not so much influenced by the order of insertions

Internal Nodes : m/2 ~ m child nodes (block number)

40

5010

20

30

60

70

Root Node

Internal Node

External Node

Page 14: File Processing : Index and Hash 2015, Spring Pusan National University Ki-Joune Li

STEMPNU

Search by B-tree

40

5010

20

30

60

70

? 45 45

45

45

Not Found

Page 15: File Processing : Index and Hash 2015, Spring Pusan National University Ki-Joune Li

STEMPNU

Performance of B-tree

Number of Comparison within a node : Trivial Number of Nodes to visit : Depth

Page 16: File Processing : Index and Hash 2015, Spring Pusan National University Ki-Joune Li

STEMPNU

Problem of B-tree

Types of Search Exact Match Search Range Search

E.g. find students where 25<student.GPA<50

B-tree Good for Exact match search Bad for range search

40

5010

20

30

60

70

Page 17: File Processing : Index and Hash 2015, Spring Pusan National University Ki-Joune Li

STEMPNU

B+-tree

A Variant of B-tree Duplicate all elements at leaf nodes (external nodes) Linked List of Leaf Nodes

Performance Exact Match Search and Insertion

A small fraction of performance sacrifice Range Search : much more powerful than B-tree

Page 18: File Processing : Index and Hash 2015, Spring Pusan National University Ki-Joune Li

STEMPNU

B+-tree : Example

10 20 30 4010 20 30

overflow4010 20 30

20

4010 20 30 50

20

4010 20 30 50 60

20

4010 20 30 50 60

4020

Linked List

Duplication

Page 19: File Processing : Index and Hash 2015, Spring Pusan National University Ki-Joune Li

STEMPNU

Range Search with B+-tree

Find students where GPA>3.5

4010 20 30 50 60

402035

4010 20 30 50 60

402035

4010 20 30 50 60

4020

354010 20 30 50 60

4020

35

Page 20: File Processing : Index and Hash 2015, Spring Pusan National University Ki-Joune Li

STEMPNU

Performance of B+-tree

Performance Determined by the Depth

Exact Match Search and Insertion (without split) d node (page) accesses

Range Search

node accesses ( nq : number of records to retrieve)

nd m

2

log

f

q

B

nd