1 cs232a: database system principles indexing. 2 given condition on attribute find qualified records...

137
1 CS232A: Database System Principles INDEXING

Upload: karen-hancock

Post on 02-Jan-2016

233 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

1

CS232A: Database System Principles

INDEXING

Page 2: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

2

Given condition on attribute find qualified records

Attr = value

Condition may also be • Attr>value• Attr>=value

Indexing

value

Qualified records

valuevalue

Page 3: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

Indexing• Data Stuctures used for quickly locating tuples

that meet a specific type of condition– Equality condition: find Movie tuples where

Director=X– Other conditions possible, eg, range conditions: find

Employee tuples where Salary>40 AND Salary<50

• Many types of indexes. Evaluate them on– Access time– Insertion time– Deletion time– Disk Space needed (esp. as it effects access time)

Page 4: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

4

Topics

• Conventional indexes• B-trees• Hashing schemes

Page 5: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

Terms and Distinctions• Primary index

– the index on the attribute (a.k.a. search key) that determines the sequencing of the table

• Secondary index– index on any other attribute

• Dense index– every value of the indexed

attribute appears in the index

• Sparse index– many values do not appear

10203040

1020

3040

5070

8090

100120

50708090

A Dense Primary Index

100120140150

SequentialFile

Page 6: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

Dense and Sparse Primary Indexes

10203040

1020

3040

5070

8090

100120

50708090

Dense Primary Index

100120140150

Sparse Primary Index

10305080100140160200

1020

3040

5070

8090

100120

Find the index record with largestvalue that is less or equal to thevalue we are looking.

+ can tell if a value exists without accessing file (consider projection)+ better access to overflow records

+ less index space

more + and - in a while

Page 7: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

7

Sparse vs. Dense Tradeoff

• Sparse: Less index space per record can keep more of

index in memory• Dense: Can tell if any record exists

without accessing file

(Later: – sparse better for insertions– dense needed for secondary indexes)

Page 8: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

Multi-Level Indexes

• Treat the index as a file and build an index on it

• “Two levels are usually sufficient. More than three levels are rare.”

• Q: Can we build a dense second level index for a dense index ?

10305080100140160200

1020

3040

5070

8090

100120

10100250400

250270300350400460500550

6007509201000

Page 9: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

A Note on Pointers

• Record pointers consist of block pointer and position of record in the block

• Using the block pointer only saves space at no extra disk accesses cost

Page 10: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

Representation of Duplicate Values in

Primary Indexes• Index may point

to first instance of each value only

104070100

1010

1040

4070

7070

100120

Page 11: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

Deletion from Dense Index

102030

1020

30

5070

90

100120

5070

90

Delete 40, 80

HeaderHeader

Lists of available entries

• Deletion from dense primary index file with no duplicate values is handled in the same way with deletion from a sequential file

• Q: What about deletion from dense primary index with duplicates

Page 12: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

Deletion from Sparse Index

• if the deleted entry does not appear in the index do nothing

10305080100140160200

1020

30

5070

8090

100120

HeaderDelete 40

Page 13: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

Deletion from Sparse Index (cont’d)

• if the deleted entry does not appear in the index do nothing

• if the deleted entry appears in the index replace it with the next search-key value– comment: we could leave

the deleted value in the index assuming that no part of the system may assume it still exists without checking the block

Delete 30

10405080100140160200

1020

40

5070

8090

100120

Header

Page 14: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

Deletion from Sparse Index (cont’d)

• if the deleted entry does not appear in the index do nothing

• if the deleted entry appears in the index replace it with the next search-key value

• unless the next search key value has its own index entry. In this case delete the entry

Delete 40, then 30

10

5080100140160200

1020

5070

8090

100120

HeaderHeader

Page 15: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

Insertion in Sparse Index

• if no new block is created then do nothing

10305080100140160200

1020

30 35

5070

8090

100120

HeaderInsert 35

Page 16: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

16

Insertion in Sparse Index

• if no new block is created then do nothing

• else create overflow record– Reorganize periodically– Could we claim space

of next block?– How often do we

reorganize and how much expensive it is?

– B-trees offer convincing answers

10 30 50 80

100140160200

1020

30

5070

8090

100120

HeaderInsert 15

Page 17: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

17

Secondary indexesSequencefield

5030

7020

4080

10100

6090

File not sorted on secondary search key

Page 18: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

18

Secondary indexesSequencefield

5030

7020

4080

10100

6090

• Sparse index

302080

100

90...

does not make sense!

Page 19: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

19

Secondary indexesSequencefield

5030

7020

4080

10100

6090

• Dense index10203040

506070...

105090...

sparsehighlevel

First level has to be dense,next levels are sparse (as usual)

Page 20: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

20

Duplicate values & secondary indexes

1020

4020

4010

4010

4030

Page 21: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

21

Duplicate values & secondary indexes

1020

4020

4010

4010

4030

10101020

20304040

4040...

one option...

Problem:excess overhead!

• disk space• search time

Page 22: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

22

Duplicate values & secondary indexes

1020

4020

4010

4010

4030

10

another option: lists of pointers

4030

20Problem:variable sizerecords inindex!

Page 23: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

23

Duplicate values & secondary indexes

1020

4020

4010

4010

4030

10203040

5060...

Yet another idea :Chain records with same key?

Problems:• Need to add fields to records, messes up maintenance• Need to follow chain to know records

Page 24: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

24

Duplicate values & secondary indexes

1020

4020

4010

4010

4030

10203040

5060...

buckets

Page 25: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

25

Why “bucket” idea is useful

Indexes RecordsName: primary EMP

(name,dept,year,...)

Dept: secondaryYear: secondary

• Enables the processing of queries workingwith pointers only.

• Very common technique in Information Retrieval

Page 26: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

Advantage of Buckets: Process Queries Using

Pointers OnlyFind employees of the Toys dept with 4 years in the companySELECT Name FROM Employee WHERE Dept=“Toys” AND Year=4

ToysPCsPensSuits

Dept IndexAaron Suits 4Helen Pens 3

Jack PCs 4Jim Toys 4

Joe Toys 3Nick PCs 2

Walt Toys 5Yannis Pens 1

1234

Year Index

Intersect toy bucket and 2nd Floor bucket to get set of matching EMP’s

Page 27: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

27

This idea used in text information retrieval

Documents

...the cat is fat ...

...my cat and my dog like each other...

...Fido the dog ...Buckets known as

Inverted lists

cat

dog

Page 28: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

28

Information Retrieval (IR) Queries

• Find articles with “cat” and “dog”– Intersect inverted lists

• Find articles with “cat” or “dog”– Union inverted lists

• Find articles with “cat” and not “dog”– Subtract list of dog pointers from list of cat

pointers

• Find articles with “cat” in title• Find articles with “cat” and “dog”

within 5 words

Page 29: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

29

Common technique: more info in inverted

list

catTitle 5

Title 100

Author 10

Abstract 57

Title 12

d3d2

d1

dog

typeposit

ion

locatio

n

Page 30: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

30

Posting: an entry in inverted list.Represents occurrence

ofterm in article

Size of a list: 1 Rare words or (in postings) mis-spellings

106 Common words

Size of a posting: 10-15 bits (compressed)

Page 31: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

31

Vector space model

w1 w2 w3 w4 w5 w6 w7 …

DOC = <1 0 0 1 1 0 0 …>

Query= <0 0 1 1 0 0 0 …>

PRODUCT = 1 + ……. = score

Page 32: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

32

• Tricks to weigh scores + normalize

e.g.: Match on common word not asuseful as match on rare

words...

Page 33: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

Summary of Indexing So Far• Basic topics in conventional indexes

– multiple levels– sparse/dense– duplicate keys and buckets– deletion/insertion similar to sequential files

• Advantages– simple algorithms– index is sequential file

• Disadvantages– eventually sequentiality is lost because of overflows,

reorganizations are needed

Page 34: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

34

Example Index (sequential)

continuous

free space

102030

405060

708090

39313536

323834

33

overflow area(not sequential)

Page 35: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

35

Outline:

• Conventional indexes• B-Trees NEXT• Hashing schemes

Page 36: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

36

• NEXT: Another type of index– Give up on sequentiality of index– Try to get “balance”

Page 37: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

37

Root

B+Tree Example n=3

100

120

150

180

30

3 5 11

30

35

100

101

110

120

130

150

156

179

180

200

Page 38: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

38

Sample non-leaf

to keys to keys to keys to keys

< 57 57 k<81 81k<95 95

57

81

95

Page 39: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

39

Sample leaf node:

From non-leaf node

to next leafin

sequence5

7

81

95

To r

eco

rd

wit

h k

ey 5

7

To r

eco

rd

wit

h k

ey 8

1

To r

eco

rd

wit

h k

ey 8

5

Page 40: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

40

In textbook’s notationn=3

Leaf:

Non-leaf:

30

35

30

30 35

30

Page 41: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

41

Size of nodes: n+1 pointersn keys (fixed)

Page 42: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

42

Non-root nodes have to be at least half-full

• Use at least

Non-leaf: (n+1)/2pointers

Leaf: (n+1)/2 pointers to data

Page 43: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

43

Full nodemin. node

Non-leaf

Leaf

n=3

12

01

50

18

0

30

3 5 11

30

35

Page 44: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

44

B+tree rules tree of order n

(1) All leaves at same lowest level(balanced tree)

(2) Pointers in leaves point to records except for “sequence pointer”

Page 45: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

45

(3) Number of pointers/keys for B+tree

Non-leaf(non-root) n+1 n (n+1)/2 (n+1)/2- 1

Leaf(non-root) n+1 n

Root n+1 n 1 1

Max Max Min Min ptrs keys ptrsdata keys

(n+1)/2 (n+1)/2

Page 46: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

46

Insert into B+tree

(a) simple case– space available in leaf

(b) leaf overflow(c) non-leaf overflow(d) new root

Page 47: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

47

(a) Insert key = 32 n=33 5 11

30

31

30

100

32

Page 48: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

48

(a) Insert key = 7 n=3

3 5 11

30

31

30

100

3 5

7

7

Page 49: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

49

(c) Insert key = 160 n=3

10

0

120

150

180

150

156

179

180

200

160

18

0

160

179

Page 50: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

50

(d) New root, insert 45 n=3

10

20

30

1 2 3 10

12

20

25

30

32

40

40

45

40

30new root

Page 51: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

51

(a) Simple case - no example

(b) Coalesce with neighbor (sibling)

(c) Re-distribute keys(d) Cases (b) or (c) at non-leaf

Deletion from B+tree

Page 52: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

52

(b) Coalesce with sibling– Delete 50

10

40

100

10

20

30

40

50

n=4

40

Page 53: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

53

(c) Redistribute keys– Delete 50

10

40

100

10

20

30

35

40

50

n=4

35

35

Page 54: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

54

40

45

30

37

25

26

20

22

10

141 3

10

20

30

40

(d) Non-leaf coalese– Delete 37

n=4

40

30

25

25

new root

Page 55: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

55

B+tree deletions in practice

– Often, coalescing is not implemented– Too hard and not worth it!

Page 56: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

56

Comparison: B-trees vs. static indexed sequential

file

Ref #1: Held & Stonebraker“B-Trees Re-examined”CACM, Feb. 1978

Page 57: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

57

Ref # 1 claims:- Concurrency control harder in B-Trees

- B-tree consumes more space

For their comparison:block = 512 byteskey = pointer = 4 bytes4 data records per block

Page 58: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

58

Example: 1 block static index

127 keys

(127+1)4 = 512 Bytes-> pointers in index implicit! up to

127 contigous blocks

k1

k2

k3

k1

k2

k3

1 datablock

Page 59: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

59

Example: 1 block B-tree

63 keys

63x(4+4)+8 = 512 Bytes-> pointers needed in B-tree up to 63

blocks because index and data blocksare not contiguous

k1

k2

...

k63

k1

k2

k3

1 datablock

next

-

Page 60: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

60

Size comparison Ref. #1Size comparison Ref. #1

Static Index B-tree# data # datablocks height blocks

height

2 -> 127 2 2 -> 63 2128 -> 16,129 3 64 -> 3968 316,130 -> 2,048,383 4 3969 -> 250,047 4

250,048 -> 15,752,961 5

Page 61: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

61

Ref. #1 analysis claims

• For an 8,000 block file,after 32,000 inserts

after 16,000 lookups Static index saves enough accesses

to allow for reorganization

Ref. #1 conclusion Static index better!!

Page 62: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

62

Ref #2: M. Stonebraker, “Retrospective on a database

system,” TODS, June 1980

Ref. #2 conclusion B-trees better!!

Page 63: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

63

• DBA does not know when to reorganize– Self-administration is important target

• DBA does not know how full to loadpages of new index

Ref. #2 conclusion B-trees better!!

Page 64: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

64

• Buffering– B-tree: has fixed buffer requirements– Static index: large & variable size

buffers needed due to overflow

Ref. #2 conclusion B-trees better!!

Page 65: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

65

• Speaking of buffering… Is LRU a good policy for B+tree

buffers? Of course not! Should try to keep root in memory

at all times(and perhaps some nodes from second level)

Page 66: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

66

Interesting problem:

For B+tree, how large should n be?

n is number of keys / node

Page 67: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

Assumptions

• You have the right to set the disk page size for the disk where a B-tree will reside.

• Compute the optimum page size n assuming that– The items are 4 bytes long and the pointers are

also 4 bytes long.– Time to read a node from disk is 12+.003n– Time to process a block in memory is unimportant– B+tree is full (I.e., every page has the maximum

number of items and pointers

Page 68: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

68

Can get: f(n) = time to find a record

f(n)

nopt n

Page 69: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

69

FIND nopt by f’(n) = 0

Answer should be nopt = “few hundred”

What happens to nopt as

• Disk gets faster?• CPU get faster?

Page 70: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

70

Variation on B+tree: B-tree (no +)• Idea:

– Avoid duplicate keys– Have record pointers in non-leaf nodes

Page 71: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

71

to record to record to record with K1 with K2 with K3

to keys to keys to keys to keys < K1 K1<x<K2 K2<x<k3 >k3

K1 P1 K2 P2 K3 P3

Page 72: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

72

B-tree example n=2

65

125

145

165

85

105

25

45

10

20

30

40

110

120

90

100

70

80

170

180

50

60

130

140

150

160

• sequence pointers not useful now! (but keep space for simplicity)

Page 73: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

73

So, for B-trees:

MAX MINTree Rec Keys Tree Rec KeysPtrs Ptrs Ptrs Ptrs

Non-leafnon-root n+1 n n (n+1)/2 (n+1)/2-1

(n+1)/2-1Leafnon-root 1 n n 1 (n+1)/2

(n+1)/2

Rootnon-leafn+1 n n 2 1 1

RootLeaf 1 n n 1 1 1

Page 74: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

74

Tradeoffs:

B-trees have marginally faster average lookup than B+trees (assuming the height does not change)

in B-tree, non-leaf & leaf different sizesSmaller fan-out in B-tree, deletion more complicated

B+trees preferred!

Page 75: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

75

Example:- Pointers 4 bytes- Keys 4 bytes- Blocks 100 bytes (just example)- Look at full 2 level tree

Page 76: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

76

Root has 8 keys + 8 record pointers+ 9 son pointers

= 8x4 + 8x4 + 9x4 = 100 bytes

B-tree:

Each of 9 sons: 12 rec. pointers (+12 keys)= 12x(4+4) + 4 = 100 bytes

2-level B-tree, Max # records =12x9 + 8 = 116

Page 77: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

77

Root has 12 keys + 13 son pointers= 12x4 + 13x4 = 100 bytes

B+tree:

Each of 13 sons: 12 rec. ptrs (+12 keys)= 12x(4 +4) + 4 = 100 bytes

2-level B+tree, Max # records= 13x12 = 156

Page 78: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

78

So...

ooooooooooooo ooooooooo 156 records 108 records

Total = 116

B+ B

8 records

• Conclusion:– For fixed block size,– B+ tree is better because it is bushier

Page 79: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

79

Outline/summary

• Conventional Indexes• Sparse vs. dense• Primary vs. secondary

• B trees• B+trees vs. B-trees• B+trees vs. indexed sequential

• Hashing schemes --> Next

Page 80: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

Hashing

• hash function h(key) returns address of bucket or record

• for secondary index buckets are required

• if the keys for a specific hash value do not fit into one page the bucket is a linked list of pages

key h(key)

key h(key)

Records

Buckets Records

key

Page 81: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

81

Example hash function

• Key = ‘x1 x2 … xn’ n byte character string

• Have b buckets• h: add x1 + x2 + ….. xn

– compute sum modulo b

Page 82: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

82

This may not be best function … Read Knuth Vol. 3 if you really

need to select a good function.

Good hash Expected number of function: keys/bucket is the

same for all buckets

Page 83: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

83

Within a bucket:

• Do we keep keys sorted?

• Yes, if CPU time critical & Inserts/Deletes not too frequent

Page 84: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

84

Next: example to illustrateinserts, overflows,

deletes

h(K)

Page 85: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

85

EXAMPLE 2 records/bucket

INSERT:h(a) = 1h(b) = 2h(c) = 1h(d) = 0

0

1

2

3

d

ac

b

h(e) = 1

e

Page 86: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

86

0

1

2

3

a

bc

e

d

EXAMPLE: deletion

Delete:ef

fg

maybe move“g” up

cd

Page 87: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

87

Rule of thumb:• Try to keep space utilization

between 50% and 80% Utilization = # keys used

total # keys that fit

• If < 50%, wasting space• If > 80%, overflows significant

depends on how good hashfunction is & on # keys/bucket

Page 88: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

88

How do we cope with growth?

• Overflows and reorganizations• Dynamic hashing

• Extensible• Linear

Page 89: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

89

Extensible hashing: two ideas

(a) Use i of b bits output by hash function

b h(K)

use i grows over time….

00110101

Page 90: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

90

(b) Use directory

h(K)[0-i ] to bucket

.

.

.

.

Page 91: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

91

Example: h(k) is 4 bits; 2 keys/bucket

i = 1

1

1

0001

1001

1100

Insert 1010

11100

1010

New directory

200

01

10

11

i =

2

2

Page 92: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

92

10001

21001

1010

21100

Insert:

0111

0000

00

01

10

11

2i =

Example continued

0111

0000

0111

0001

2

2

Page 93: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

93

00

01

10

11

2i =

21001

1010

21100

20111

20000

0001

Insert:

1001

Example continued

1001

1001

1010

000

001

010

011

100

101

110

111

3i =

3

3

Page 94: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

94

Extensible hashing: deletion

• No merging of blocks• Merge blocks

and cut directory if possible(Reverse insert procedure)

Page 95: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

95

Deletion example:

• Run thru insert example in reverse!

Page 96: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

96

Extensible hashing

Can handle growing files- with less wasted space- with no full reorganizations

Summary

+

Indirection(Not bad if directory in

memory)

Directory doubles in size(Now it fits, now it does not)

-

-

Page 97: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

97

Linear hashing

• Another dynamic hashing scheme

Two ideas:(a) Use i low order bits of hash

01110101grows

b

i

(b) File grows linearly

Page 98: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

98

Example b=4 bits, i =2, 2 keys/bucket

00 01 10 11

0101

1111

0000

1010

m = 01 (max used block)

Futuregrowthbuckets

If h(k)[i ] m, then look at bucket h(k)[i ]

else, look at bucket h(k)[i ] - 2i -1

Rule

0101• can have overflow chains!

• insert 0101

Page 99: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

99

Example b=4 bits, i =2, 2 keys/bucket

00 01 10 11

0101

1111

0000

1010

m = 01 (max used block)

Futuregrowthbuckets

10

1010

0101 • insert 0101

11

11110101

Page 100: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

100

Example Continued: How to grow beyond this?

00 01 10 11

111110100101

0101

0000

m = 11 (max used block)

i = 2

0 0 0 0100 101 110 111

3

. . .

100

100

101

101

0101

0101

Page 101: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

101

• If U > threshold then increase m(and maybe i )

When do we expand file?

• Keep track of: # used slots total # of slots = U

Page 102: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

102

Linear Hashing

Can handle growing files- with less wasted space- with no full reorganizations

No indirection like extensible hashing

Summary

+

+

Can still have overflow chains-

Page 103: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

103

Example: BAD CASE

Very full

Very empty Need to move

m here…Would wastespace...

Page 104: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

104

Hashing- How it works- Dynamic hashing

- Extensible- Linear

Summary

Page 105: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

105

Next:

• Indexing vs Hashing• Index definition in SQL• Multiple key access

Page 106: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

106

• Hashing good for probes given keye.g., SELECT …

FROM RWHERE R.A = 5

Indexing vs Hashing

Page 107: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

107

• INDEXING (Including B Trees) good for

Range Searches:e.g., SELECT

FROM RWHERE R.A > 5

Indexing vs Hashing

Page 108: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

108

Index definition in SQL

• Create index name on rel (attr)• Create unique index name on rel

(attr)defines candidate key

• Drop INDEX name

Page 109: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

109

CANNOT SPECIFY TYPE OF INDEX

(e.g. B-tree, Hashing, …)

OR PARAMETERS(e.g. Load Factor, Size of

Hash,...)

... at least in SQL...

Note

Page 110: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

110

ATTRIBUTE LIST MULTIKEY INDEX

(next) e.g., CREATE INDEX foo ON

R(A,B,C)

Note

Page 111: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

111

Motivation: Find records where DEPT = “Toy” AND SAL >

50k

Multi-key Index

Page 112: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

112

Strategy I:

• Use one index, say Dept.• Get all Dept = “Toy” records

and check their salary

I1

Page 113: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

113

• Use 2 Indexes; Manipulate Pointers

Toy Sal>

50k

Strategy II:

Page 114: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

114

• Multiple Key Index

One idea:

Strategy III:

I1

I2

I3

Page 115: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

115

Example

ExampleRecord

DeptIndex

SalaryIndex

Name=JoeDEPT=SalesSAL=15k

ArtSalesToy

10k15k17k21k

12k15k15k19k

Page 116: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

116

For which queries is this index good?

Find RECs Dept = “Sales” SAL=20kFind RECs Dept = “Sales” SAL > 20kFind RECs Dept = “Sales”Find RECs SAL = 20k

Page 117: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

117

Interesting application:

• Geographic Data

DATA:

<X1,Y1, Attributes> <X2,Y2, Attributes>

x

y

. .

.

Page 118: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

118

Queries:

• What city is at <Xi,Yi>?• What is within 5 miles from

<Xi,Yi>?• Which is closest point to <Xi,Yi>?

Page 119: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

119

h

nb

ia

co

d

10 20

10 20

Examplee

g

f

m

l

kj25 15 35 20

40

30

20

10

h i a bcd efg

n omlj k

• Search points near f• Search points near b

5

15 15

Page 120: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

120

Queries

• Find points with Yi > 20• Find points with Xi < 5• Find points “close” to i = <12,38>• Find points “close” to b = <7,24>

Page 121: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

121

• Many types of geographic index structures have been suggested

• Quad Trees• R Trees

Page 122: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

122

Two more types of multi key indexes

• Grid• Partitioned hash

Page 123: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

123

Grid Index Key 2

X1 X2 …… Xn V1 V2

Key 1

Vn

To records with key1=V3, key2=X2

Page 124: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

124

CLAIM

• Can quickly find records with– key 1 = Vi Key 2 = Xj

– key 1 = Vi

– key 2 = Xj

• And also ranges….– E.g., key 1 Vi key 2 < Xj

Page 125: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

125

But there is a catch with Grid Indexes!

• How is Grid Index stored on disk?

LikeArray... X

1X

2X

3X

4

X1

X2

X3

X4

X1

X2

X3

X4

V1 V2 V3

Problem:• Need regularity so we can compute

position of <Vi,Xj> entry

Page 126: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

126

Solution: Use Indirection

BucketsV1V2

V3 *Grid onlyV4 contains

pointers tobuckets

Buckets------

------

------

------

------

X1 X2 X3

Page 127: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

127

With indirection:

• Grid can be regular without wasting space

• We do have price of indirection

Page 128: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

128

Can also index grid on value ranges

Salary Grid

Linear Scale1 2 3

Toy SalesPersonnel

0-20K 1

20K-50K 2

50K- 38

Page 129: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

129

Grid files

Good for multiple-key searchSpace, management overhead (nothing is free)

Need partitioning ranges that evenly split keys

+

-

-

Page 130: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

130

Idea:

Key1 Key2

Partitioned hash function

h1 h2

010110 1110010

Page 131: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

131

h1(toy) =0 000h1(sales) =1 001h1(art)=1 010

. 011

.h2(10k) =01 100h2(20k) =11 101h2(30k) =01 110h2(40k) =00 111

.

.

<Fred,toy,10k>,<Joe,sales,10k><Sally,art,30k>

EX:

Insert

<Joe><Sally>

<Fred>

Page 132: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

132

h1(toy) =0 000h1(sales) =1 001h1(art)=1 010

. 011

.h2(10k) =01 100h2(20k) =11 101h2(30k) =01 110h2(40k) =00 111

.

.• Find Emp. with Dept. = Sales

Sal=40k

<Fred><Joe><Jan>

<Mary>

<Sally>

<Tom><Bill><Andy>

Page 133: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

133

h1(toy) =0 000h1(sales) =1 001h1(art)=1 010

. 011

.h2(10k) =01 100h2(20k) =11 101h2(30k) =01 110h2(40k) =00 111

.

.• Find Emp. with Sal=30k

<Fred><Joe><Jan>

<Mary>

<Sally>

<Tom><Bill><Andy>

look here

Page 134: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

134

h1(toy) =0 000h1(sales) =1 001h1(art)=1 010

. 011

.h2(10k) =01 100h2(20k) =11 101h2(30k) =01 110h2(40k) =00 111

.

.• Find Emp. with Dept. = Sales

<Fred><Joe><Jan>

<Mary>

<Sally>

<Tom><Bill><Andy>

look here

Page 135: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

135

Post hashing discussion:- Indexing vs. Hashing

- SQL Index Definition- Multiple Key Access

- Multi Key IndexVariations: Grid, Geo Data

- Partitioned Hash

Summary

Page 136: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

136

Reading Chapter 5

• Skim the following sections:– 5.3.6, 5.3.7, 5.3.8– 5.4.2, 5.4.3, 5.4.4

• Read the rest

Page 137: 1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value

137

The BIG picture….

• Chapters 2 & 3: Storage, records, blocks...• Chapter 4 & 5: Access Mechanisms -

Indexes- B trees- Hashing- Multi key

• Chapter 6 & 7: Query ProcessingNEXT