review for midterm 2

29
Review for Midterm 2 Review for Midterm 2 Shahram Ghandeharizadeh Shahram Ghandeharizadeh

Upload: hiroko

Post on 09-Feb-2016

51 views

Category:

Documents


0 download

DESCRIPTION

Review for Midterm 2. Shahram Ghandeharizadeh. Midterm 2. Scheduled for April 30 th 4 papers Variant indexes. Access path selection. Overview of query optimization. Mining Association Rules. Paper on cache management is not included because it was covered by your project. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Review for Midterm 2

Review for Midterm 2Review for Midterm 2

Shahram GhandeharizadehShahram Ghandeharizadeh

Page 2: Review for Midterm 2

Midterm 2Midterm 2 Scheduled for April 30Scheduled for April 30thth

4 papers4 papers Variant indexes.Variant indexes. Access path selection.Access path selection. Overview of query optimization.Overview of query optimization. Mining Association Rules.Mining Association Rules. Paper on cache management is not included Paper on cache management is not included

because it was covered by your project.because it was covered by your project.

Midterm 2 is worth 35% of your grade.Midterm 2 is worth 35% of your grade.

Page 3: Review for Midterm 2

Variant IndexesVariant Indexes A read-mostly database that is updated A read-mostly database that is updated

infrequently.infrequently. Complex indexes to speedup queries.Complex indexes to speedup queries. Focuses on physical designs to enhance Focuses on physical designs to enhance

performance.performance.

Page 4: Review for Midterm 2

Example Data WarehouseExample Data Warehouse Key Observations:Key Observations:

A handful of products, a A handful of products, a PROD table with tens of PROD table with tens of rows.rows.

Many millions of rows Many millions of rows for SALES tables.for SALES tables.

CidCidPidPidDayDayAmtAmtdollar_costdollar_costUnit_salesUnit_sales

SALESSALES

PidPidNameNameSizeSizeWeightWeightPackage_typePackage_type

PRODPRODDayDayWeekWeekMonthMonthYearYearHollidayHollidayWeekdayWeekday

TIMETIME

Page 5: Review for Midterm 2

A B+-Tree on Major HolidaysA B+-Tree on Major Holidays A B+-tree index on different holidays of the A B+-tree index on different holidays of the

SALES table.SALES table.

Joe, Big Mac, Lab day, …Joe, Big Mac, Lab day, …

Mary, Fries, Pres day, …Mary, Fries, Pres day, …

Harry, Big Mac, Pres day, …Harry, Big Mac, Pres day, …

Henry, Big Mac, Pres day, …Henry, Big Mac, Pres day, …

Jane, Happy Meal, Pres day, …Jane, Happy Meal, Pres day, …

Shideh, Happy Meal, Pres day, …Shideh, Happy Meal, Pres day, …

Kam, Happy Meal, Pres day, …Kam, Happy Meal, Pres day, …

Bob, Big Mac, Pres day, …Bob, Big Mac, Pres day, …

(Pres day, (1,2), (1, 3), (1, 4), (2,1), ….(Pres day, (1,2), (1, 3), (1, 4), (2,1), …. B+-treeB+-treeLeaf pageLeaf page

Value ListValue List

Page 6: Review for Midterm 2

A B+-Tree on Major HolidaysA B+-Tree on Major Holidays A B+-tree index on different holidays of the A B+-tree index on different holidays of the

SALES table.SALES table.

Joe, Big Mac, Lab day, …Joe, Big Mac, Lab day, …

Mary, Fries, Pres day, …Mary, Fries, Pres day, …

Harry, Big Mac, Pres day, …Harry, Big Mac, Pres day, …

Henry, Big Mac, Pres day, …Henry, Big Mac, Pres day, …

Jane, Happy Meal, Pres day, …Jane, Happy Meal, Pres day, …

Shideh, Happy Meal, Pres day, …Shideh, Happy Meal, Pres day, …

Kam, Happy Meal, Pres day, …Kam, Happy Meal, Pres day, …

Bob, Big Mac, Pres day, …Bob, Big Mac, Pres day, …

(Pres day, (1,2), (1, 3), (1, 4), (2,1), ….(Pres day, (1,2), (1, 3), (1, 4), (2,1), …. B+-treeB+-treeLeaf pageLeaf page

Value ListValue List RID ListRID List

Page 7: Review for Midterm 2

Conjunctive QueriesConjunctive Queries Count number of Big Mac Sales on Count number of Big Mac Sales on

“President’s Day” assuming a B+-tree on “President’s Day” assuming a B+-tree on product (pid) and day of SALESproduct (pid) and day of SALES

With RID-ListsWith RID-Lists Get the Value-List for “Big Mac” using the B+-Get the Value-List for “Big Mac” using the B+-

tree, obtain RID-List1.tree, obtain RID-List1. Get the Value-List for “President’s Day” using Get the Value-List for “President’s Day” using

the B+-tree, obtain RID-List2.the B+-tree, obtain RID-List2. Compute set-intersect of RID-List1 and RID-List2Compute set-intersect of RID-List1 and RID-List2 Count the number of RIDs in the intersection set.Count the number of RIDs in the intersection set.

Is there a better way? Is there a better way? Yes, use bit-maps and logical bit-wise operands.Yes, use bit-maps and logical bit-wise operands.

Page 8: Review for Midterm 2

Bitmap IndexesBitmap Indexes Use a bitmap to represent the existence of a Use a bitmap to represent the existence of a

record with a certain attribute value.record with a certain attribute value. Example: If a record has the indexed Example: If a record has the indexed

attribute value “Big Mac” then its attribute value “Big Mac” then its corresponding entry in the bitmap is set to corresponding entry in the bitmap is set to one. Otherwise, it is a zero.one. Otherwise, it is a zero.

Page 9: Review for Midterm 2

A BitmapA Bitmap A Bitmap B is defined on T as a sequence of A Bitmap B is defined on T as a sequence of

M bits.M bits. For each row r with row number j that has For each row r with row number j that has

the property P, we set bit j in B to one; all the property P, we set bit j in B to one; all other bits are set to zero.other bits are set to zero.

Assuming fix sized disk pages that hold p Assuming fix sized disk pages that hold p records, RID of record j is (j/p, j%p). Page is records, RID of record j is (j/p, j%p). Page is j/p, slot number is j%p.j/p, slot number is j%p.

Pres Day, 0100001100111111110000011001…..Pres Day, 0100001100111111110000011001…..

Record 0Record 0

Page 10: Review for Midterm 2

A BitmapA Bitmap A Bitmap B is defined on T as a sequence of A Bitmap B is defined on T as a sequence of

M bits.M bits. For each row r with row number j that has For each row r with row number j that has

the property P, we set bit j in B to one; all the property P, we set bit j in B to one; all other bits are set to zero.other bits are set to zero.

Assuming fix sized disk pages that hold p Assuming fix sized disk pages that hold p records, RID of record j is (j/p, j%p). Page is records, RID of record j is (j/p, j%p). Page is j/p, slot number is j%p.j/p, slot number is j%p.

Pres Day, 0100001100111111110000011001…..Pres Day, 0100001100111111110000011001…..

Record 1Record 1

Page 11: Review for Midterm 2

A BitmapA Bitmap A Bitmap B is defined on T as a sequence of A Bitmap B is defined on T as a sequence of

M bits.M bits. For each row r with row number j that has For each row r with row number j that has

the property P, we set bit j in B to one; all the property P, we set bit j in B to one; all other bits are set to zero.other bits are set to zero.

Assuming fix sized disk pages that hold p Assuming fix sized disk pages that hold p records, RID of record j is (j/p, j%p). Page is records, RID of record j is (j/p, j%p). Page is j/p, slot number is j%p.j/p, slot number is j%p.

Pres Day, 0100001100111111110000011001…..Pres Day, 0100001100111111110000011001…..

Record 2Record 2

Page 12: Review for Midterm 2

A B+-Tree on Major HolidaysA B+-Tree on Major Holidays A B+-tree index on different holidays of the A B+-tree index on different holidays of the

SALES table.SALES table.

Joe, Big Mac, Lab day, …Joe, Big Mac, Lab day, …

Mary, Fries, Pres day, …Mary, Fries, Pres day, …

Harry, Big Mac, Pres day, …Harry, Big Mac, Pres day, …

Henry, Big Mac, Pres day, …Henry, Big Mac, Pres day, …

Jane, Happy Meal, Pres day, …Jane, Happy Meal, Pres day, …

Shideh, Happy Meal, Pres day, …Shideh, Happy Meal, Pres day, …

Kam, Happy Meal, Pres day, …Kam, Happy Meal, Pres day, …

Bob, Big Mac, Pres day, …Bob, Big Mac, Pres day, …

(Pres day, 01111111….(Pres day, 01111111…. B+-treeB+-treeLeaf pageLeaf page

Page 13: Review for Midterm 2

Conjunctive QueriesConjunctive Queries Count number of Big Mac Sales on “President’s Count number of Big Mac Sales on “President’s

Day” assuming a B+-tree on product (pid) and day of Day” assuming a B+-tree on product (pid) and day of SALESSALES

With RIDWith RID Get the Value-List for “Big Mac” using the B+-tree, obtain Get the Value-List for “Big Mac” using the B+-tree, obtain

RID-List1.RID-List1. Get the Value-List for “President’s Day” using the B+-tree, Get the Value-List for “President’s Day” using the B+-tree,

obtain RID-List2.obtain RID-List2. Compute set-intersect of RID-List1 and RID-List2Compute set-intersect of RID-List1 and RID-List2 Count the number of RIDs in the intersection set.Count the number of RIDs in the intersection set.

With bit mapsWith bit maps Get the Value-List for “Big Mac” using the B+-tree, obtain Get the Value-List for “Big Mac” using the B+-tree, obtain

bit-map1.bit-map1. Get the Value-List for “President’s Day” using the B+-tree, Get the Value-List for “President’s Day” using the B+-tree,

obtain bit-map2.obtain bit-map2. Recall Existence Bitmap (EBM) identify rows that exist.Recall Existence Bitmap (EBM) identify rows that exist. Let RES = logical AND of bit-map1, bit-map2, and EBM.Let RES = logical AND of bit-map1, bit-map2, and EBM. Count the number of bits set to one to identify how many Count the number of bits set to one to identify how many

Big Macs were sold on “President’s Day”.Big Macs were sold on “President’s Day”.

Page 14: Review for Midterm 2

Variant IndexesVariant Indexes Midterm 2 ignores:Midterm 2 ignores:

MEDIAN, N-TILE, Column-Product as aggregates.MEDIAN, N-TILE, Column-Product as aggregates. Section 5.Section 5.

Page 15: Review for Midterm 2

Access Path SelectionAccess Path Selection Formulates a cost prediction for each access Formulates a cost prediction for each access

plan, using the following cost formula:plan, using the following cost formula:COST = Page fetches + W * (RSI Calls)COST = Page fetches + W * (RSI Calls)

W is an adjustable weighting factor between W is an adjustable weighting factor between I/O and CPU.I/O and CPU.

RSI calls is an approximation for CPU RSI calls is an approximation for CPU utilization.utilization.

Assumptions:Assumptions: WHERE tree is considered to be in conjunctive WHERE tree is considered to be in conjunctive

normal form,normal form, Every disjunct is called a boolean factor.Every disjunct is called a boolean factor.

Page 16: Review for Midterm 2

How?How? Enumerating the different execution plans,Enumerating the different execution plans, Estimate the cost of performing each plan,Estimate the cost of performing each plan, Pick the cheapest plan.Pick the cheapest plan.

Definition of cost is as follows:Definition of cost is as follows:COST = Page fetches + W * (RSI Calls)COST = Page fetches + W * (RSI Calls)

Page 17: Review for Midterm 2

Clustered BClustered B++-Tree-Tree A B+-tree on the gpa attributeA B+-tree on the gpa attribute

Bob, 21, 3.7, CSMary, 24, 3, ECE

Tom, 20, 3.2, EE

Kathy, 18, 3.8, LS

Kane, 19, 3.8, MELam, 22, 2.8, MEChang, 18, 2.5, CS Vera, 17, 3.9, EE

Louis, 32, 4, LS

Martha, 29, 3.8, CS

James, 24, 3.1, ME

Pat, 19, 2.8, EE

Chris, 22, 3.9, CSChad, 28, 2.3, LS

Leila, 20, 3.5, LS Shideh, 16, 4, CS

(3.7, (3, 1))

(3.8, (3,2))

(3.8, (3,3))

(3.9, (4,2))

(4, (4,3))

(3.8, (3,4))

(3.9, (4,1))

(4, (4,4))

(2.3, (1, 1))

(2.5, (1,2))

(2.8, (1,3))

(3.1, (2,2))

(3.2, (2,3)

(2.8, (1,4))

(3, (2,1))

(3.5, (2,4))

3.6

Page 18: Review for Midterm 2

Non-Clustered BNon-Clustered B++-Tree-Tree A random I/O for every qualifying recordA random I/O for every qualifying record

Bob, 21, 3.7, CS

Mary, 24, 3, ECE

Tom, 20, 3.2, EE

Kathy, 18, 3.8, LS

Kane, 19, 3.8, ME

Lam, 22, 2.8, ME

Chang, 18, 2.5, CS

Vera, 17, 3.9, EE

Louis, 32, 4, LS

Martha, 29, 3.8, CS

James, 24, 3.1, ME

Pat, 19, 2.8, EE

Chris, 22, 3.9, CS

Chad, 28, 2.3, LS

Leila, 20, 3.5, LS

Shideh, 16, 4, CS

(3.7, (1, 1))

(3.8, (3,2))

(3.8, (2,1))

(3.9, (2,4))

(4, (3,1))

(3.8, (1,4))

(3.9, (4,1))

(4, (4,4))

(2.3, (4, 2))

(2.5, (2,3))

(2.8, (2,2))

(3.1, (3,3))

(3.2, (1,3)

(2.8, (3,4))

(3, (1,2))

(3.5, (4,3))

3.6

Page 19: Review for Midterm 2

QuestionsQuestions How are relations and segments related? If How are relations and segments related? If

slide #4, you state segments may contain slide #4, you state segments may contain more than one relation, and then the next more than one relation, and then the next bullet says “at most one relation per bullet says “at most one relation per segment.”. What is going on?segment.”. What is going on?

Page 20: Review for Midterm 2

QuestionsQuestions How are relations and segments related? If How are relations and segments related? If

slide #4, you state segments may contain slide #4, you state segments may contain more than one relation, and then the next more than one relation, and then the next bullet says “at most one relation per bullet says “at most one relation per segment.”. What is going on?segment.”. What is going on?

Best clarified with an example:Best clarified with an example: Segment 1 may contain the Emp, Dept, and Segment 1 may contain the Emp, Dept, and

Revenues tables/relations.Revenues tables/relations. The Emp relation can be assigned to Segment 1 The Emp relation can be assigned to Segment 1

only. It may NOT be assigned to both Segments only. It may NOT be assigned to both Segments 1 and 2.1 and 2.

Page 21: Review for Midterm 2

QuestionsQuestions The cost of retrieving a range of records The cost of retrieving a range of records

from a clustered B+-tree: Should not this be from a clustered B+-tree: Should not this be (depth_of_B+-tree + F(pred) * TCARD) or (depth_of_B+-tree + F(pred) * TCARD) or something related to the depth of the tree something related to the depth of the tree rather than NINDX, since you only have to rather than NINDX, since you only have to navigate through to the leaf nodes once and navigate through to the leaf nodes once and do a record scan once you’ve reached the do a record scan once you’ve reached the correct leaf node?correct leaf node?

•NINDX(I), the number of pages in index I.NINDX(I), the number of pages in index I.•TCARD(T), the number of pages in the TCARD(T), the number of pages in the segment that hold tuples of relation T.segment that hold tuples of relation T.

Page 22: Review for Midterm 2

QuestionQuestion Should not the cost for a non-clustered B+-Should not the cost for a non-clustered B+-

tree also involve the depth of the tree rather tree also involve the depth of the tree rather than NINDX?than NINDX?

•NINDX(I), the number of pages in index I.NINDX(I), the number of pages in index I.•TCARD(T), the number of pages in the TCARD(T), the number of pages in the segment that hold tuples of relation T.segment that hold tuples of relation T.

Page 23: Review for Midterm 2

QuestionQuestion Should not the cost for a non-clustered B+-Should not the cost for a non-clustered B+-

tree also involve the depth of the tree rather tree also involve the depth of the tree rather than NINDX?than NINDX? It should include the depth of the tree.It should include the depth of the tree. NINDX must be included because the leaf pages NINDX must be included because the leaf pages

of the B+-tree must be visited for the qualifying of the B+-tree must be visited for the qualifying records.records.

•NINDX(I), the number of pages in index I.NINDX(I), the number of pages in index I.•TCARD(T), the number of pages in the TCARD(T), the number of pages in the segment that hold tuples of relation T.segment that hold tuples of relation T.

Page 24: Review for Midterm 2

QuestionsQuestions Why is the cost of a merge-scan NINDX(R) + Why is the cost of a merge-scan NINDX(R) +

NINDX(S) rather than the sum of segment NINDX(S) rather than the sum of segment scans of R and S, since you still have to visit scans of R and S, since you still have to visit every page that contains tuples of R and S?every page that contains tuples of R and S?

•NINDX(I), the number of pages in index I.NINDX(I), the number of pages in index I.•TCARD(T), the number of pages in the TCARD(T), the number of pages in the segment that hold tuples of relation T.segment that hold tuples of relation T.

Page 25: Review for Midterm 2

QuestionsQuestions Why is the cost of a merge-scan NINDX(R) + Why is the cost of a merge-scan NINDX(R) +

NINDX(S) rather than the sum of segment NINDX(S) rather than the sum of segment scans of R and S, since you still have to visit scans of R and S, since you still have to visit every page that contains tuples of R and S?every page that contains tuples of R and S? The merge-scan employs the sorted order of the The merge-scan employs the sorted order of the

entries in the leaf pages of the B+-tree index entries in the leaf pages of the B+-tree index structures.structures.

•NINDX(I), the number of pages in index I.NINDX(I), the number of pages in index I.•TCARD(T), the number of pages in the TCARD(T), the number of pages in the segment that hold tuples of relation T.segment that hold tuples of relation T.

Page 26: Review for Midterm 2

Overview of Query OptimizationOverview of Query Optimization Extends discussion to:Extends discussion to:

Correlation queries, use of outer-join to “flatten” Correlation queries, use of outer-join to “flatten” nested queries.nested queries.

Raises more questions than providing answers.Raises more questions than providing answers. A good starting point for:A good starting point for:

A practitioner who wants to build an optimizer for a A practitioner who wants to build an optimizer for a relational DBMS.relational DBMS.

A Ph.D. student interested in writing a dissertation in the A Ph.D. student interested in writing a dissertation in the area of query optimization techniques.area of query optimization techniques.

Page 27: Review for Midterm 2

Mining Association RulesMining Association Rules Objective: Discover association Rule over Objective: Discover association Rule over

basket data.basket data. Motivation: valuable for cross-marketing Motivation: valuable for cross-marketing

and attached mailing applications.and attached mailing applications. Example: 98% of customers who purchase tires Example: 98% of customers who purchase tires

and auto accessories also get automotive and auto accessories also get automotive services done.services done.

Key contributions:Key contributions: Fast algorithms: Fast algorithms:

Apriori, AprioriTid, and AprioriHybridApriori, AprioriTid, and AprioriHybrid Pay attention to terminology, definitions, and Pay attention to terminology, definitions, and

the general framework.the general framework.

Page 28: Review for Midterm 2

Database SystemsDatabase Systems This course has introduced you to:This course has introduced you to:

A storage manager and its use, BDB.A storage manager and its use, BDB. Spatial indexing, R-Trees.Spatial indexing, R-Trees. Parallel DBMS.Parallel DBMS. Alternative technologies for applications that do Alternative technologies for applications that do

not require ACID transactions.not require ACID transactions. Google FS, MapReduce, etc.Google FS, MapReduce, etc.

Optimization techniques for relational DBMSs.Optimization techniques for relational DBMSs. Knowledge Discovery.Knowledge Discovery. Efficient query processing techniques.Efficient query processing techniques. RAID and Use of flash memory in enterprises:RAID and Use of flash memory in enterprises:

Steve Kleiman’s lecture.Steve Kleiman’s lecture.

Page 29: Review for Midterm 2

Database SystemsDatabase Systems Many important topics remain:Many important topics remain:

Data mining, Data cubes, Data visualization Data mining, Data cubes, Data visualization techniques.techniques.

Papers are from 1990s: Neumerous follow-Papers are from 1990s: Neumerous follow-on papers!on papers!