set 8 - alhenshirics4411/9538 set 8, storage, indexing, execution 2 outline of notes set 1:...

CS4411/9538 Set 8, Storage, Indexing, Execution 1

Set 8

Storage, Indexing, and Execution Strategies

(part 2 is about costs,

part 3 contains XML storage and techniques)Sylvia Osborn


Outline of notes◼ Set 1: Introduction ✔

◼ Set 2: Architecture ✔

❑ Centralized Relational

❑ Distributed DBMS

❑ Object-Oriented DBMS

◼ Set 3: Database Design ✔



◼ Set 4: data modeling✔

◼ Set 5: Querying ✔

◼ Set 6: XML Model and Querying ✔

◼ Set 7: Algebraic Query

Optimization ✔




◼ Set 8: Storage, Indexing, and

Execution Strategies

◼ Set 8, Part 2: Costs

◼ Set 8, Part 3: XML Implementation

Issues

◼ Set 9: Transactions and

Concurrency Control


◼ Set 9, Part 2



◼ Set 10: Recovery



◼ Set 11: Database Security

2

Outline of this set of notes

1. storage of values, tuples and objects

2. disc sorting

3. indexing

4. execution of relational algebra operators

◼ Goals:❑ to minimize page fetches from disc to main memory or vice

versa.

❑ (possibly) to try to get related stuff together on one page

◼ Note:❑ stuff comes into main memory in page/block size chunks,

of some size fixed by the operating system



1. (Disc) Storage of Atomic Data Types

◼ integers and reals are represented as they are in main memory (fixed

length, 2 or 4 bytes)

◼ enumerated types: look at an example: if a type is declared for paint

colours, say with 10 values ("sky blue", "forest green", ...) we do not

store the actual strings in every record. We introduce a code

(probably the numbers 0 to 9) and store them as binary numbers from

0 to 9, with a lookup somewhere to get the actual values. We need 4

bits to represent 0 to 9. So this value could be represented in one

byte.

Another example: sex - would have 2 values, M and F, and possibly a

"don't have the data" missing value. This only needs at most 2 bits.

We usually do not use the rest of this byte, as it is too costly to

decode more than one value within a byte.


(Disc) Storage, cont’d◼ dates: probably use: yyyy mm dd i.e. store 8 bytes per date

value.

◼ fixed length character strings: say CHAR(5) - just use the number of bytes given by the fixed length.

◼ variable length character strings: say VARCHAR(255) - the idea here is that the whole space may not be used up. There are 2 common ways to do this:

❑ length plus content: uses n+1 bytes, where the first one gives the length and the rest are the actual string.

This only works if n ≤ 255.

❑ with a null terminated string: store the n bytes of the string and then put a special byte which is not a valid character for strings in this language.


(Disc) Storage, cont’d

◼ other variable length fields:

❑ if the String type has no upper limit, it needs to

be stored, probably as length plus content where

length can be greater than 255.

❑ in an OODB, when we have set-valued attributes,

we basically have a 1-D array of Object IDs, or

perhaps an array of literals, and the size of this

array can vary as the object is updated

❑ Similarly for list-valued attribute values in an

OODB


Storage of Tuples◼ given the relation scheme for a relational database,

the system will store each tuple in consecutive bytes on the disc.

◼ given the object type definition for tuples in an OODB, we will store the atomic parts of each object in consecutive bytes on the disc.

◼ let's call this thing which goes onto the disc a disc record:❑ it is probably not fixed length.

❑ there will be a storage record schema which says which order the fields (attributes) are within the consecutive bytes.

❑ these records have to go into disc blocks/pages, which are fixed length.


Storage of Tuples, cont’d

◼ within each record, there will be a header with

such information as:

❑ the record type (there may be records of more than one

type mixed on a given disc page.)

❑ the overall length right now

❑ an offset to the beginning of each variable length field

◼ after the header, the fixed length fields are stored,

and then the variable length fields

◼ tuples get packed onto disc pages


Storage of Relations in System R

(the original IBM relational DB prototype)

◼ System R handled its own segmentation and paging.

◼ A page was a fixed-size unit of I/O.

◼ A segment was a logical address space whose size

(no. of pages) varied dynamically.

◼ Each base table was stored completely within one

segment.

CREATE TABLE P

( ............)

[ IN SEGMENT segment-name ]


Storage of Relations in System R, cont’d

◼ There were three types of segments:

1. public - recoverable and sharable

2. private - recoverable but not sharable

3. temporary - neither recoverable nor sharable

◼ A segment's type was fixed at system installation time and

could not be changed.

◼ if the IN SEGMENT clause was omitted in the CREATE TABLE

statement, the table would be stored in a private segment

belonging to the user who issued the CREATE TABLE.

◼ a given segment could contain more than one base table

(relation).


System R, cont’d

◼ System R kept its own page maps to tell it the

physical location on disc of each page of a segment -

this was used for the shadow paging recovery

management technique which we will talk about later

in the course.

◼ tuples occupied contiguous bytes on a page.

◼ tuples in a relation could have variable length fields.

◼ the tuple prefix contained:

❑ relation ID

❑ number of fields

❑ something to tell it the actual length of the variable length

fields for this tuple


System R Page Structure


System R, cont’d◼ insertions, deletions and updates to variable length fields

require tuples to be moved around on the page to optimize the use of storage. This can happen when the page is in main memory.

◼ when this happens, only the pointer from the slot to the tupleneeds to change. Tuple IDs (TIDs) do not change.

◼ if a page overflows, and a tuple has to be moved to an overflow page, the pointer in the slot points to the new location on the overflow page. The TID does not change.

◼ If an overflow page overflows, the pointer from the original slot referenced by the TID points directly to the new location of the tuple.

◼ Thus there are at most 2 disc accesses ever required to find a tuple given its tuple ID.


Indexing in System R

◼ the user can define indexes on stored relations.

◼ each index is implemented as a B+-tree (keys in

internal nodes, data on leaves)

◼ the index is stored in the same segment as the

relation it indexes

◼ index pages are separate from data pages.

◼ indexes can be defined to be unique or non-unique.

For example an index on a primary key field would be

a unique index. An index for what is sometimes called

a secondary key, some attribute which is not unique,

like hair colour, year in school, etc. is non-unique.


Indexing, cont’d

◼ leaf pages of the index

❑ contain single TIDs for unique indexes,

❑ or a sequence of TIDs for non-unique indexes.

◼ Current relational systems, like DB2, allow

the declaration of a primary key in the

CREATE TABLE command. This automatically

generates a unique index for this (these)

attribute(s).


Possible Access Paths for a relation

in System R1. System Sequence

• every data page in the segment containing the relation in

question is examined.

• every TID slot for each page is followed and the tuple prefix

is checked to see if the tuple belongs to the desired relation.

If so, the tuple is then processed by the query.

2. Using an Index

• The system keeps a catalogue of all indexes currently present

and for which relations and which attributes. Indexes may be

used to access all tuples of a relation in the order established

by the index (Note: this could involve more page fetches than

system sequence).


Access Paths cont’d

• If the query has a predicate: attr = value, and the

attribute has an index, then this index will definitely

be used to fetch only those pages having tuples with

this value. We will talk more about this when we talk

about executing selection.

You should be convinced that every query can be

answered.

Whether or not it can be answered efficiently depends on

whether or not the appropriate indexes have been

defined.


Extremely Large Values (Blobs)

◼ there are some values which are too large for a single

disk page.

◼ first of all, in its attribute spot, there will be a

pointer. The user is not aware of this.

◼ the large value can be fetched with some kind of

stream I/O interface. This would be suitable for a

long text value, e.g. the contents of a book chapter.

◼ can have demand paging on these large values.


(Blobs), cont’d

◼ one scheme, introduced in a research prototype

called Exodus from U. of Wisconsin, is to view the

large object as a B+ -tree, where the index value is

the byte offset from the beginning of the object.

If a large number of bytes is inserted in the middle

of the value, the B+ -tree algorithms look after

index reorganization and page splits, without

having to move data bytes around unless they are

on a splitting leaf page.


Object formats for OO Databases

(i.e. records with Object IDs and Inheritance)

◼ each (nested) object has its own disc record

◼ the object ID needs to be mapped to the location of this disc record (more on this later)

◼ The type system compiler figures out which attributes are inherited

◼ the contiguous bytes part of the stored record includes all literal attributes and also includes all inherited literal attributes.

◼ if there is multiple inheritance, the name conflict resolution is all made by the compiler before any objects need to be stored.


Clustering Objects Together

◼ in a relational system, if you know that 2 relations are

frequently being joined together, you might want to

cluster, say, the employees in the computer science

department with the computer science department record,

on the same page

◼ this won’t be possible if the primary key index is deciding

what page the tuple lives on

◼ this was contemplated in System R, but is probably more

relevant for object-oriented databases

◼ the database designer may know that employee and

department objects are always accessed together, and want

a way to have them put close together on the disk,

preferably on the same page.


Clustering, cont’d

◼ in Objectivity, for example, when you create a

new object, you can give an existing OID as a

parameter and the system will try to locate the

new object near the existing one.

◼ in O2, you can give a path within the aggregation

hierarchy and have it cluster the objects based on

this path. If the path includes nested objects, then

the nested objects will be clustered in this way.

(see O2 system administrator's guide)

2. Sorting data which is on disc

◼ algorithm is based on a (2) k-way balanced

merge sort


input 1

input 2

output 1

output 2

Main Memory

.

.

.

.

Sorting logic

Some details for sorting◼ on the first pass, distribute data alternately to output runs,

“already sorted”

◼ if these sorted runs are of length 1, then after 1 merge pass (with 2 input streams and 2 output streams), sorted runs are of length 2

◼ after 2 passes, sorted runs are of length 4

◼ after 3 passes, ...............................8

◼ ..

◼ after log2n passes, sorted run is of length n (i.e., all done!)

◼ overall time complexity is O(n log2n), because each pass reads and writes n records from/to the disc


3. Indexing (based on B+-tree)◼ basic idea is internal nodes are between half full and full

◼ the root node can be less than half full

◼ in original B-tree proposal, tree nodes contained records and pointers

◼ an improvement is to have only keys (and pointers) in the internal nodes

– this allows for more pointers per node, thus making the trees shorter

and fatter

◼ insertion and search take O(depth of the tree), which with k-way

branching is logkn if there are n keys in the tree

◼ this improvement is commonly called a B+-tree, but not always

◼ tree nodes are kept between half full and full by splitting if necessary

during insertion and coalescing during deletion

◼ generally assumed that an index node corresponds to a disc page, so

searching to the bottom of the tree takes O(logkn) disc accesses.


General structure of a B+-tree


100 200 max... root index level

10 20 ...internal index level

leaf pages of the index1 ...

1, rest of tuple data pages


Advantage of Clustering Indexes

◼ If it is on a non-unique attribute, when the data page is fetched into memory, a lot of tuples are fetched which might be relevant to the query.

Disadvantage of Clustering Indexes

◼ Can only do this for one attribute or attribute combination per relation


A Non-clustering Index on a unique attribute

This is called a Dense Index because there is a

pointer for every tuple in the index leaf pages


Non-clustering Indexes or Dense

Indexes, cont’d◼ Note: both clustering and dense indexes could be for a

unique attribute or a non-unique attribute. If non-unique,

then the data pages for clustering have all the tuples with

that value together. If non-unique and dense, then the

leaf index pages have a set of pointers to data pages

rather than a single one.

◼ Each commercial system has its own way of allowing the

database administrator to specify that a clustering index

is to be built.

4. Execution of relational operators

◼ Goals (reminder):

❑ to minimize page fetches from disc to main memory

or vice versa.

❑ take advantage of structures which have related stuff

together on one page

◼ Note:

❑ stuff comes into main memory in page/block size

chunks, of some size fixed by the operating system

❑ there are indexes which might help with certain

operators



Executing Selection (σ)

◼ A brute force method would be to scan the relation

and apply the predicate to each tuple.

◼ If the query asks for all tuples in relation R with attr =

value

❑ if the attr is a primary key, the answer is one tuple.

❑ if some other non-unique attribute, the answer could be a

very large fraction of the tuples.

❑ the basic heuristic is: if there is an index on the attribute,

use it. It will probably avoid having to examine every tuple.


Executing Selection (σ), cont’d

◼ Conjunctive query: of the form

attr1 = value1 ^ attr2 = value2

❑ if there is an index on only one attribute, say attr1, use the

index to get those tuples, and when you have them in main

memory, examine every tuple to see if attr2 = value2.

❑ if one of the attributes is the primary key, and there is an

index on the primary key, use the index. There will be one

tuple fetched, and can test the rest of the query then.

❑ if one of the attributes, (say attr1) has a clustering index, use

that index to fetch (a few pages of) tuples and then examine

them for attr2 in memory. Do this even if attr2 has an index.


Executing Selection (σ), cont’d

◼ if there are no relevant clustering indexes, but

both attributes have a non-clustering index, then

❑ follow both indexes to the bottom index page and

build two lists of TIDs

❑ sort the two lists by physical address

❑ intersect the two TID lists

❑ fetch the resulting pages and extract the tuples in the

answer

◼ and on and on and on (have to consider attr

value, attr > value, predicates connected by

“or”, etc.


Indexes DO NOT HELP with the

following predicates◼ A1 op A2, where A1 and A2 are from the same relation,

◼ ¬ predicate

◼ predicates joined by or

◼ A1 op arithmetic expression

◼ A1 like pattern

where the pattern starts with a wild card symbol

(Note: A1 like pattern allows for character string

matching. There are index types that help with this,

but these are of no use if a wild card comes at the

beginning.)


Execution Strategies for Join

◼ want to compute rbig ⋈ rsmall

◼ Suppose rbig has 10 000 tuples

and rsmall has 200 tuples

There are several algorithms


1. Naive Algorithm

for each tuple t1 in rbig do

fetch t1for each tuple t2 in rsmall do

fetch t2 and see if they generate a tuple in the answer

If each tuple fetch causes a disc read, then

read rbig once: 10 000 disc reads

read rsmall 10 000 times x 200 reads: 2 000 000 disc reads

total disc reads: 2 010 000

or O(nbig * nsmall ) where nbig is the number of tuples in rbig, and nsmall is the number in rsmall.


2. Block- (Page-) Based AlgorithmSuppose the tuples are clustered so that we get 20 tuples/disc page. rbig is 500 pages

and rsmall takes up 10 pages

for each page in rbig do

fetch the page

for each page in rsmall do

fetch it, compare all tuples to do the join

read rbig in pages 500 page reads

rsmall read 500 times @ 10 pages each 5 000

total 5 500 page reads

Put rsmall in the outer loop:

rsmall takes 10 page reads

read rbig 10 times takes 5 000 page reads

total 5 010

run time is O(# pages of rbig * # pages of rsmall ), better with rsmall in the outer loop.


3. Clever Buffering AlgorithmSuppose there are enough buffer pages main memory for m disc pages.

Read the first m-1 pages of rbig into the first m-1 buffer pages in memory

Repeat

read the first page of rsmall into the mth buffer page

Repeat

do the join on the tuples in memory

read the next page of rsmall into the mth buffer page+

Until rsmall is all read

read the next m-1 pages of rbig into memory

Until rbig is all read

+ the least recently used page replacement algorithm must not be used

here. It would put the next disc page into the first buffer spot


Clever Buffering, analysisSay m = 20.

read rbig once 500 page reads

read rsmall 500/19 times,

each is 10 pages 270 page reads

total 770 page reads

or O(pages of rbig + c * pages of rsmall)

To bypass the least recently used page replacement

algorithm, the database system has to “fix” these m-1

pages in memory until it is done with them.


4. Join with an IndexSuppose there is an index on the join attribute for r2, and r1 is stored

tightly packed according to some index (i.e. clustered, on anything).

for each page of tuples in r1 do

look up the matching tuples in r2, using the index, and perform the join.

Can assume that, with a high branching factor in the index, each index lookup takes about 3 or 4 disc accesses.

So, the cost in disc accesses is:

(depth of index on r2) * n1/(blocking factor for r1 )

where n1 is the number of tuples in r1


Join with an Index, cont’d

If there is no custering and there are no indexes, it

may be cheaper to build a B+-tree index on the fly

than to do the join some other way.

That takes O(n logbranchingfactor n), where n is the size

of r2.

All of the join techniques so far are variations on

what some systems call the nested loops algorithm.


5. Merge Join (Sort-Merge Join)

If both relations happen to be in order on the intersecting attributes, can do

a straight one-pass merge.

This takes O(n1 + n2) page accesses (assuming there are not too many tuples

with a single value to fit into memory at one time.)

If the two relations are not in order, sort them first and then do a merge.

Run time for a disc-based sorting algorithm is

O(n *logt(n))

With a disc sort, t can be quite a large number.

This method is always better than the naive algorithm.

In the processing of a query where select and project operations may have

reduced the size of both inputs to the join from the original stored

relations, the sorting phase could be the output phase from the select or

project, and then the join can be done using this algorithm on the

presorted temporary relations.


6. Hash Joins

Hash both relations using the same technique at the

same time, on the overlapping attributes, into one set

of buckets (or two parallel sets of buckets).

that requires reading them both once,

i.e. O(n1 + n2)

Read back each bucket and produce the output -takes

another O(n1 + n2), as long as no one bucket is too big

to fit in memory at one time. If there are some large

buckets, use the nested loops algorithm on the large

buckets.


7. Parallel Joins

From a discussion of this in the Garcia-Molina, Ullman

and Widom book. Here we assume the shared nothing

architecture, which is apparently the one used for

“database machines”, with i processors.

◼ at the source processor, hash the two relations into i

buckets, applying the same hash function the same

way for the overlapping attributes.

◼ ship bucket i to processor i as the outputs from the

hash get big enough to be worth shipping.

◼ perform the join on the i buckets in parallel.


Transforming Nested Queries into Joins

Select *

From R

Where A1 = 5 and A2 in

(Select A3

From S

Where A4 = 12)

◼ this query could be executed as suggested by its form, i.e. do the inner query and create a temporary relation, and then execute the outer query, probably using a Merge Join.

◼ Note that a tuple of R will be in the answer only once, if it satisfies the second predicate.


Nested Queries, cont’d

◼ The above query is equivalent to the following “join

form”:

Select distinct R.*

From R, S

Where R.A1 = 5 and S.A4 = 12 and R.A2 = S.A3

◼ Transforming queries into this form reduces the

number of cases to be considered when programming

the execution of σ.

◼ It also allows the query optimizer to consider other

execution methods for the join.



Consider this example:

Select Distinct R.*

From R

Where A1 = 5 and

A2 in (Select A4 From S

Where A5 = 6 and A6 > R.A3)

Looks like the nested loops algorithm is the only way to execute this.

However, it is equivalent to

Select Distinct R.*

From R, S

Where R.A1 = 5 and R.A2 = S.A4 and S.A5 = 6

and S.A6 > R.A3

With this version, a merge join also seems possible.



◼ DB2 performs this transformation only under the

following conditions:

❑ the subquery target list is a single column, guaranteed by a

unique index to have unique values.

❑ The comparison operator connecting the outer query to the

subquery is either IN or =ANY (which have the same

meaning).

◼ In fact, in general, DB2 generates more efficient

execution plans for the join version of a given query,

rather than the nested version.


How to do Distinct Projection◼ In general, the result is too big to fit in main memory. We have to

put the result on disc. The problem is, how do you check for

duplicates in a disc file?

◼ one technique for main memory would be to insert the results in

some order, checking for duplicates as you insert. This takes O(n2),

but on disc this means traversing the whole file to find the position

of the new tuple, so it ends up being O(n2) disc reads. Very bad.

◼ sort the result. i.e. filter out the unwanted attributes (doing the

projection) on the first pass, and then sort the result checking for

duplicate neighbours on the last pass. This would be an O(n logt n)

sort, where again n is the number of disc blocks in the relation being

sorted.

◼ Construct a B+ -tree on the fly, inserting the tuples as they are

generated, and checking for duplicates. Again, this is an O(n logm n)

algorithm.


◼ Non-Distinct Projection?

◼ Under what circumstances (with distinct

projection) are you guaranteed there will be

no duplicates in the result?


Other Relational Operators◼ commercial database systems have to have algorithms for set union, bag union, set

intersect, bag intersect, set difference, bag difference, grouping and aggregation

functions (sum, max, min and count), and sorting to implement the ORDER BY clause in

SQL.

◼ need the bag versions (where duplicates might exist) because that is what SQL produces if

you do not say SELECT DISTINCT but just SELECT...

◼ union, intersect and minus can be done in one pass of a balanced merge, if the inputs are

sorted.

◼ if the relations fit in main memory, many of these can be done with hashing, or a

balanced tree data structure.

◼ the aggregate functions on a whole relation can be done in one pass.

◼ the aggregates on groups can be done by hashing the relation into buckets corresponding

to the groups, and then applying the aggregate operation to each group.


Distributed Join Execution Methods

◼ With distributed algorithms, the operation to count, or the most

expensive thing to do is ship things from one site to another over

the network. So the algorithms for distributed databases try to

minimize the amount of data shipped from one site to another.

1. Ship Whole

❑ we are assuming that relations R and S are at different sites.

❑ to perform R ⋈ S, ship all of S to the site of R, or vice-versa.

2. Ship as Needed (nested loops, remote version)

for each join value of relation R

ship tuples of S with this value to the site of R


Distributed Joins, cont’d

3. Semijoin Algorithm – where A represents the

intersecting attributes:

Compute S ⋈ (R ⋉ πA (S)) as follows:

Do πA(S) at the site of s

ship πA(S) to the site of R

do the ⋉ at the site of R

ship the ⋉ to the site of S

do the join finally at the site of S

cost is based on cost of the ship steps

set 8 - alhenshirics4411/9538 set 8, storage, indexing, execution 2 outline of notes set 1:...

Documents