final review - computer science | drexel ccijulia/cs500/documents/lectures/lecture... · final...
TRANSCRIPT
Julia Stoyanovich
Final exam logistics
• When: August 29th @ 9am through September 1 @ 9pm!
• The same format as the midterm: electronic, open book / open notes !
• 3 hours in length!
• The exam is cumulative, it will include material from the first half of the term, but will likely focus more on the material from the second half of the term (homeworks 3 and 4)
2
Julia Stoyanovich
Topics not on the final
• Database application development (JDBC and such)!
• MapReduce and Spark!
• Data, Responsibly
3
Julia Stoyanovich
Data mining (HW4)Imagine there are 100 baskets, numbered 1,2,...,100, and 100 items, similarly numbered. Item i is in basket j if and only if i divides j evenly. For example, basket 24 is the set of items {1,2,3,4,6,8,12,24}. !
Describe all the association rules that have 100% confidence. Which of the following rules has 100% confidence?
4
{4,6}→12 conf({4,6}→12) = supp({4,6,12})supp({4,6})
A rule has 100% confidence if, for any b, whenever all items on the left divide b, then also the item on the right divides b
Since 4 and 6 are in b, then b = 2 * 2 * 3 * c, where c is some natural number. In this case, is 12 guaranteed to divide b? - yes!
{3,5}→1
{8,10}→ 20 b = 2 * 2 * 2 * 5 * c
{3,4,5}→ 30 b = 2 * 2 * 3 * 5 * c
{1,2}→ 4
{2,3,5}→ 45
{3,6}→ 9
Julia Stoyanovich
Data mining (HW4)Imagine there are 100 baskets, numbered 1,2,...,100, and 100 items, similarly numbered. Item i is in basket j if and only if i divides j evenly. For example, basket 24 is the set of items {1,2,3,4,6,8,12,24}. !
An itemset S is closed if no proper superset of S has the same support.!
Describe all closed itemsets. Which of the itemsets are closed?
5
An itemset S is closed if the least common multiple (LCM) of its numbers increases if any number between 1 and 100 is added to the itemset. That is, S consists of some integer j (the LCM) and all multiples of j.
{1,5,7,35} What is the support set of this itemset? In which baskets can all of its items be found? Those where b = 5 * 7 * c, i.e., baskets 35 and 70.
{1,3,4,12} In which baskets can all of its items be found? Those where b = 2 * 2 * 3 * c, i.e., baskets 12, 24, 36, 48, 60, 72, 84, 96. But all of these are also divisible by 2!
{1,2,3,6}{1,5,25}{1,2,17,34}
{1,2,3,4,8}{1,2,3,5}{1,3,5,30}
Julia Stoyanovich
Data mining (HW4)Imagine there are 100 baskets, numbered 1,2,...,100, and 100 items, similarly numbered. Item i is in basket j if and only if j divides i evenly. For example, basket 24 is the set of items {24, 48, 72, 96}. !
A frequent itemset S is maximal no proper superset of S is frequent.!
Describe all maximal itemsets that have support exactly 4.
6
In which baskets can all of its items be found? Those where b divides 27 and 54 and 81. That’s baskets 27, 9, 3, 1. {27,54,81}
Under the new definition of the itemset to basket mapping, an itemset is in a bucket that is the greatest common divisor (GCD) of the items, and in buckets that correspond to all divisors of the GCD. All itemsets are in b=1.
{15,30,45,60,75,90}
To have support 4, the GCD must have exactly 4 distinct divisors (or 3 divisors in addition to the number 1). To realize this, GCD must be a product of 2 distinct primes a * b or a cube of a prime a * a * a.
{25,50,75,100}{6,36}{22,44,66,88}
Julia Stoyanovich
User-defined types
7
rank = reductionFactor −1Cost
Videos (vid: int, v(ile: video_file)N=100,000
• Violence: returns true if the given video contains violence, and false otherwise. On average it takes c1=0.4 sec to evaluate this method on a video, and we estimate that this method returns true for r1=20% of the videos in the relation.
• StrongLanguage: returns true if the given video contains strong language, and false otherwise. On average it takes c2=0.3 sec to evaluate this method on a video, and we estimate that this method returns true for r2=10% of the videos in the relation.
!• Nudity: returns true if the given video contains nudity, and false otherwise. On average it takes c3=0.2sec to evaluate this method on a video, and we estimate that this method returns true for r3=20% of the videos in the relation.
SELECT * FROM Videos WHERE Violence (vfile) AND StrongLanguage(vfile) AND NOT Nudity(vfile)
Julia Stoyanovich
User-defined types
8
rank = reductionFactor −1Cost
Videos (vid: int, v(ile: video_file)N=100,000
• Violence: c1=0.4, r1=20% • StrongLanguage: c2=0.3, r2=10% • Nudity: c3=0.2sec, r3=20% • NOT Nudity: c3=0.2sec, r’3=80% !• rank (Violence) = (r1 -‐1) / c1 = -‐0.8 / 0.4 = -‐2 • rank (StrongLanguage) = (r2 -‐1) / c2 = -‐0.9 / 0.3 = -‐3 • rank (NOT Nudity) = (r3’ -‐1) / c3 = -‐0.2 / 0.2 = -‐1
In the most efIicient query evaluation plan, conditions are evaluates in increasing order of their rank. The best query plan is σ (NOT Nudity( Violence (StrongLanguage (Videos)))
Julia Stoyanovich
User-defined types
9
Videos (vid: int, v(ile: video_file) N=100,000
In the most efIicient query evaluation plan, conditions are evaluates in increasing order of their rank. The best query plan is σ (NOT Nudity( Violence (StrongLanguage (Videos)))
Let us see how this query is evaluated step-‐by-‐step. !1. N=100, 000 tuples are processed by StrongLanguage, which costs
100,000 * 0.3 sec ; 10% of the tuples are passed along. 2. N’ = 10,000 tuples are processed by Violence, which costs 10,000 * 0.4 sec ; 20% of the tuples are passed along.
3. N” = 2,000 tuples are processed by Nudity, which costs 2,000 * 0.2sec. Those failing the condition are returned.
!Total cost = 100,000 * 0.3 + 10,000 * 0.4 + 2,000 * 0.2 = 34,400 sec
Julia Stoyanovich
User-defined types
10
rank = reductionFactor −1Cost
Videos (vid: int, v(ile: video_file)N=100,000
In addition to the information on costs and reduction factors given in (a), you are now told that an estimated 90% of the videos for which StrongLanguage is true also have Violence evaluate to true. Should you modify your plan from (a) in light of this information?
σ (NOT Nudity( Violence (StrongLanguage (Videos)))In light of this new information, r1=90% in a plan that contains Violence(StrongLanguage (Videos)), in that order. The cost of the best plan in (a) is modiIied as follows: 1. N=100, 000 tuples are processed by StrongLanguage, which costs 100,000 * 0.3 sec. 10% of the tuples are passed along to the next operator in the pipeline.
2. N’ = 10,000 tuples are processed by Violence, which costs 10,000 * 0.4 sec. 90% of the tuples are passed along to the next operator in the pipeline.
3. N” = 9,000 tuples are processed by Nudity, which costs 9,000 * 0.2sec. Those failing the condition are returned.
!Total cost = 100,000 * 0.3 + 10,000 * 0.4 + 9,000 * 0.2=35,800 sec.
Julia Stoyanovich
User-defined types
11
rank = reductionFactor −1Cost
Videos (vid: int, v(ile: video_file)N=100,000
In addition to the information on costs and reduction factors given in (a), you are now told that an estimated 90% of the videos for which StrongLanguage is true also have Violence evaluate to true. Should you modify your plan from (a) in light of this information?
σ (NOT Nudity( Violence (StrongLanguage (Videos)))Total cost = 100,000 * 0.3 + 10,000 * 0.4 + 9,000 * 0.2=35,800 sec
Going back to reasoning about ranks, we now have: rank (Violence) = (0.9 -‐1) / c1 = -‐0.1 / 0.4 = -‐0.25. This is the highest rank, and so this operator should be evaluated last. A different plan is now the most efIicient: ! σ (Violence (NOT Nudity( StrongLanguage (Videos))) Total cost = 100,000 * 0.3 + 10,000 * 0.2 + 8,000 * 0.4=35,200 sec
Julia Stoyanovich
Basic file organization• Heap files: good for full file scans or frequent updates!
• unordered files!
• insert at the end of file!
• assumes equality selection on key, exactly one match (why?)!
• Sorted files: good for range queries on sort field(s)!
• need external sort to keep sorted!
• compacted after deletion!
• assumes selection on sort field(s)!
• Hashed files: good for selection on equality !
• collection of buckets with primary & overflow pages!
• hashing function h(r) = bucket for record r!
• each bucket is a heap file
12
Julia Stoyanovich
Cost of operations
13
Heap File Sorted File Hashed File
Scan all recs p(T) D p(T) D 1.25 p(T) D
Equality Search p(T) D / 2 D log2 p(T) D
Range Search p(T) D D log2 p(T) + (# pages with matches)
1.25 p(T) D
Insert 2D Search + p(T) D 2D
Delete Search + D Search + p(T) D 2D
*
* assuming no overflow bucket, 80% page occupancy
p(T) - number of data pages in table T!
r(T) - number of records in table T!
D - time to read or write a disk page
Julia Stoyanovich
Clustered vs. unclustered index
14
Data entries
(Index File)
(Data file)
Data Records
Data entries
Data Records
CLUSTERED UNCLUSTERED
Julia Stoyanovich
B+ tree search
• Start at root, use key comparisons to navigate to a leaf!
• Search for 5*, 15*, all data entries >=24*
15
Root
17 24 30
2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39*
13
How many disk I/Os to answer a point query?!How many disk I/Os to answer a range query?
Julia Stoyanovich
Access paths• An access path is a method of retrieving tuples: file scan, or
index that matches a selection in the query!
• An index matches a conjunction of terms if it can be used to retrieve all data values that match this conjunction of terms.!
• A tree index matches a conjunction of terms that involve only attributes in a prefix of the search key.!
• e.g., tree index <a,b,c> matches the selection a=5 AND b=3; it also matches a=5 AND b>4; it does not match b=3.!
• A hash index matches a conjunction of terms that has a term attribute=value for every attribute in the search key of the index.!
• e.g., hash index on <a,b,c> matches a=b AND b=3 AND c=5; it does not match b=3; or a=5 and b=5; or a>5 AND b=3 and c=5
16
Julia Stoyanovich
One approach to selection!
• Find the most selective access path, retrieve tuples using it, and apply the remaining terms that do not match the index.!
• The most selective access path: an index or file scan (!) that we estimate will require the fewest page I/Os.!
• Terms that match this index reduce the number of tuples retrieved; other terms are used to discard some retrieved tuples, but do not affect the number of tuples / pages fetched.!
• Example: day < 1/1/2011 AND bid=5 AND sid=3!
• option 1: use a B+ tree index on day, then check bid=5 and sid=3 for each retrieved tuple!
• option 2: use a hash index on <bid, sid>, then check day <1/1/2011 for each retrieved tuple!
Once again, we are interested in quantifying the I/O-based cost
17
Julia Stoyanovich
Index-only evaluation
• Many DBMS implement index-only query plans: if the query can be satisfied using the information in the search key of the index, without going to the data record on disk!
• Important because typically only 1 index is clustered, and so using other indexes will potentially trigger several random I/Os!
• Works well only with unclustered indexes
18
Julia Stoyanovich
Using an index for selection
• Cost of finding qualifying data entries (typically small) plus cost of retrieving records (could be large)!
• Example: assuming uniform distribution of names, about 10% of tuples qualify (100 pages, 10,000 tuples).!
• with a clustered index, cost is little more than 100 I/Os!
• with an unclustered index, cost is up to 10,000 I/Os!
19
SELECT * FROM Reserves R WHERE R.rname < �C%�
Sailors (sid:int, sname: string, rating:int, age:real)
Reserves (sid:int, bid:int, day:date, rname:string)
Reserves (R): each tuple us 40 bytes long, 100 tuples per page, 1000 pages Sailors (S): each tuple is 50 bytes long, 80 tuples per page, 500 pages!
Julia Stoyanovich
Access paths: another example
20
Employees (ssn, name, salary, age, did);
100,000 employees, 10 employee records per disk page. !Stored on disk in a sorted file (alternative 1), with did as the sort key.!
Salaries from 0 to $100K; ages from 20 to 80; 50 employees per department. Uniform, uncorrelated values.!
Q1. Compute the number of employees whose salary is $35K and who work in department 177.
For each query: (1) List indexes that match the query. (2) What index would you build? (3) What is the cost of using that index to answer this query?
Q2. List name, age, salary of employee with eid=12357.
Q3. Compute the number of employees who are between 30 and 35 years old.
Julia Stoyanovich
2-way merge-sort
21
Input file PASS 0
PASS 1
PASS 2
PASS 3
9
3,4 6,2 9,4 8,7 5,6 3,1 2
3,4 5,6 2,6 4,9 7,8 1,3 2
2,3 4,6
4,7 8,9
1,3 5,6 2
2,3 4,4 6,7 8,9
1,2 3,5 6
1,2 2,3 3,4 4,5 6,6 7,8
1-page runs
2-page runs
4-page runs
8-page runs
example with N=7 pages
Julia Stoyanovich
2-way merge-sort
• What is the cost of this algorithm?!
• In each pass, we read each page process it, and write it out: 2 disk I/Os per page, per pass!
• There are k = log2N + 1 passes!
• The over-all cost is 2N (log2N + 1) I/Os
22
suppose the input occupies N = 2k disk pages
Main memory buffers
INPUT 1
INPUT 2
OUTPUT
Disk Disk
Julia Stoyanovich
Generalization: external merge-sort
M M M M M M M M M M M M M M M
MMM
MMM
MMM
MMM
MMM
MMM
MMM
MM
MMM
MMM
23
N records, divided into NR / M sorted runs of M / R records each
final sorted result
B: block size M: main memory size!N: input size (blocks) R: size of 1 record!
Julia Stoyanovich
Cost of external merge-sort
24
Given B = 4KB, M = 64MB, R = 0.1KB!!Pass 1: runs of 40*16*1024 = about 640,000 records !!Pass 2: runs increase by a factor of M/B - 1 = 16,000!! sorted runs of 10,240,000,000 records!!Pass 3: runs increase by a factor of M/B - 1 = 16,000!! sorted runs of 1014 records
with a modest memory size, we can sort everything in 2-3 passes!
B: block size M: main memory size!N: input size (pages) R: size of 1 record!
Cost = 2*N *(logM−1NM +1)
Julia Stoyanovich
External merge-sort example
A file with 10,000 records, each record is 1KB. Size of a page/block is 64KB (i.e., 64 records / block).!
What is the number of passes, the cost of 2-way external merge-sort?!
In this dataset, there are ceil(10,000 / 64) = 157 pages that must be sorted. In two-way external merge-sort, we use 1 memory block in pass 0 (each 64-record block is sorted), and 3 memory blocks in subsequent passes (pairs of adjacent sorted runs are merged). !!To sort 157 pages, we will need 1 + ceil(log2157) = 9 passes. !!Each page is read and written once on each pass (2 I/Os per page per pass). Thus, the total cost of two-way external merge-sort on this dataset is 2 * 157 * 9 = 2,826 I/Os.
25
Julia Stoyanovich
External merge-sort exampleA file with 10,000 records, each record is 1KB. Size of a page/block is 64KB (i.e., 64 records / block).!
With memory size of 320KB, how many passes for generalized external merge sort? What is the cost?!
Memory Iits 320 / 64 = 5 pages. All are used for sorting in pass 0. All but 1 are used for sorting in subsequent runs, the remaining page is used for output. !In phase 0 of generalized external merge-‐sort, we read in and sort 320KB (5 blocks worth) at a time, creating ceil(157/5) = 32 sorted runs of 5 blocks each. !Then in subsequent passes we merge 5-‐1=4 neighboring runs. We need ceil(log432)=2 passes to complete sorting. That’s a total of 3 passes, with 2 I/Os per page per pass, for a total of 2 * 157 * 3 = 942 I/Os, a signiIicant reduction compared to (a).
26
Julia Stoyanovich
Datalog
27
Buys(p,g) :−Likes(p,g)Buys(p,g) :−Follows(p, f ),Likes( f ,g),¬Hates(p,g)
Likes(A, 'Skirts ')Likes(A, 'Stilettos')Likes(B, 'Shorts ')Likes(B, 'Sneakers')Hates(A, 'Sneakers')Follows(A,B)Follows(B,A)
Julia Stoyanovich
Datalog
28
Path(x, y) :−Edge(x, y)Path(x, y) :−Edge(x, z),Path(z, y)
Path(x, y) :−Edge(x, y)Path(x, y) :−Path(x, z),Path(z, y)
2" 3" 6" 7"
5"
10" 11"
0" 1" 4" 8" 9"
Julia Stoyanovich
Normalization: important to know• Closures, keys!
• Computing the closure of a set of attributes!
• Identifying candidate keys of a relation!
• Identifying whether an FD follows from a set of FDs!
• Minimal basis of a set of FDs!
• Normal forms and decompositions!
• Determining whether a relation is in BCNF, in 3NF!
• Decomposition into BCNF!
• Determining whether a decomposition into BCNF is dependency-preserving!
• Decomposition into 3NF (synthesis)
29
Julia Stoyanovich
Closure of a set of attributes
Suppose A = {A1, …, An} is a set of attributes and S is a set of FDs.!
The closure of A under the FDs in S is the set of attributes B s.t. every relation that satisfies all the FDs in S also satisfies
30
A→ B
We denote the closure of {A1,A2,…,An} {A1,A2,…,An}+by
Note that {A1,A2,…,An}⊆ {A1,A2,…,An}+
Julia Stoyanovich
Computing the closure of a set of attributes
1. Split the FDs of S using the splitting rule, so that each FD has one attribute on the right!
2.Initialize !
3. Repeatedly search for some FD such that !
!
4.Stop when no more attributes can be added to
31
Input: a set of attributes {A1,A2,…,An} and a set of FDs SOutput: the closure {A1,A2,…,An}
+
{A1,A2,…,An}+ ← {A1,A2,…,An}
B1,B2,…,Bm →C{B1,B2,…,Bm}⊆ {A1,A2,…,An}
+ ∧C ∉{A1,A2,…,An}+
{A1,A2,…,An}+
Algorithm AttributeClosure
Julia Stoyanovich
Closures and keysQ: How can we tell if a set of attributes is a candidate key or a superkey of a relation R?!
A: If = all the attributes in R
32
A1A2…An
{A1A2…An}*
Q: How can we compute the candidate keys for R?!
A: Find all sets of attributes that functionally determine all other attributes and make sure these sets are minimal.
Julia Stoyanovich
Example
33
R(ABCD) BD→C AB→ D AC→ B BD→ A
Find all candidate keys of the given set of FDs.
Julia Stoyanovich
Example
34
Find all candidate keys of the given set of FDs.
R ABCD( ) ABD → C ; A → B ; AB → C ; B → A
Julia Stoyanovich
Minimal basis of a set of FDs• For a given relation R, there may exist several sets of
FDs that are equivalent: !
- they give rise to the same closures of all subsets of R’s attributes!
- the same sets of FDs follow from them!
- all such equivalent sets of FDs are called bases for S in R!
• A minimal basis B is a set of FDs that satisfies 3 conditions!
1. All FDs in B have 1 attribute on the right!
2. If any FD is removed from B, the result is no longer a basis!
3. If for any FD in B we remove 1 attribute on the left, the result is no longer a basis
35
Julia Stoyanovich
Example
36
Find all candidate keys R(ABCD) C→ B BC→ A A→C BD→ A
Check whether the following are minimal bases of the set of FDs.
{AC→ D,D→ B}{D→ A,D→ B,D→C}
Julia Stoyanovich
Computing a projection of a set of FDs
1. Compute the closure of each subset of attributes of R1 in S. Add to T all non-trivial FDs X -> A s.t. A is both in X+ and an attribute of R1.!
2. Remove from T all FDs that involve attributes not in L (on either side). !
3.Optionally compute the minimal basis of T, remove FDs from T that do not belong to the minimal basis.
37
Algorithm ProjectFDsInput: Relations R and R1= . A set of FDs S that hold in R.!
Output: The set of FDs T that hold in R1.
π L (R)
Suppose relation R is given, with its corresponding set of FDs S. If we take a projection of R onto a set of attributes L, what can we say about the FDs of ? π L (R)
Why not simply take S and project each FD?
Julia Stoyanovich
Example
38
Compute a projection of the set of FDs when R (ABCD) is projected onto ACD.
R(ABCD) A → B ; B → C ; C → D
π ACD (R)
Julia Stoyanovich
Boyce-Codd Normal Form (BCNF)Let R be a relation schema, S be the set of FDs given to hold over R. !
R is in BCNF if, for every FD !
one of the following statements is true:
39
In a BCNF relation, the only set of attributes that determines values for other attributes is a superkey!
A1A2…An → B1B2…Bm
1. The FD is trivial: !
2. is a candidate key of R!
3. is a superkey of R
A1A2…AnA1A2…An
{B1,B2,…,Bm}⊆ {A1,A2,…,An}
Julia Stoyanovich
Third Normal Form (3NF)Let R be a relation schema, S be the set of FDs given to hold over R. !
R is in 3NF if, for every FD !
one of the following statements is true:
40
In contrast to BCNF, some redundancy is possible with 3NF. This normal form is a compromise, needed when no dependency-preserving decomposition into BCNF exists.
A1A2…An → B1B2…Bm
1. The FD is trivial: !
2. is a candidate key of R!
3. is a superkey of R!
4. Each is part of some candidate key of R
A1A2…AnA1A2…An
{B1,B2,…,Bm}⊆ {A1,A2,…,An}{same as!for BCNF
Bi
Julia Stoyanovich
Example: are these relations in BCNF? In 3NF?
41
R ABCD( ) A → B ; B → A ; A → D ; D → B
R ABCD( ) AB → C ; BCD → A ; D → A ; B → C
R ABCD( ) FD 's : AC → D ; D → A ; D → C ; D → B
Julia Stoyanovich
Decomposition into BCNF
42
Let R be a relation schema, S be the set of FDs given to hold over R. We decompose R by considering FDs that violate BCNF.!
1. Check whether R is in BCNF. If so, return {R}.!
2. Otherwise, let be an FD that violates BCNF. !
2.1.Use AttributeClosure to compute !
2.2. Decompose R into R1 = and R2 =!
2.3.Use ProjectFDs to compute FDs of R1 and R2!
2.4.Recursively decompose R1 and R2 using BCNFDecomposition
Algorithm BCNFDecompositionInput: Relation R, a set of FDs S that hold in R.!
Output: A decomposition of R into a set of relations, all of which are in BCNF.
A1A2…An → B1B2…Bm{A1,A2,…,An}
+
{A1,A2,…,An}+ R − {B1,B2,…,Bm}
Julia Stoyanovich
Normalization: more examples
43
AB → C ; D → B ; AC → D R(ABCD)
(a) list candidate keys of R
AD→ B(b) does this FD follow from the set of FDs above?
(c) is R in BCNF? is it in 3NF?
Julia Stoyanovich
Normalization: more examples
44
R(ABCD)
Is the decomposition dependency preserving? - yes
Decompose R into BCNF. Show keys, projected FDs.
A→ B B→ D AD→C BC→ A
2 candidate keys: A and BC; thus BCNF is violated by B→ D
R(ABCD)( Keys:(A,(BC(FDs:(
R1(ABC)( R2(BD)( Key:(B(Keys:(A,(BC(FDs:( FD:(
A→ B ; B→ D ; AD→C ; BC→ A
B→ DA→ BC ; BC→ A
Julia Stoyanovich
Normalization: more examples
45
R(ABCD)
Compute the minimal basis of the original set of FDs
A→ B B→ D AD→C BC→ A
2 candidate keys: A and BC; thus BCNF is violated by B→ D
A→ B ; B→ D ; A→C ; BC→ A
Decompose R into 3NF. Clearly mark all candidate keys.
R1(AB), R2(AC), R3(BD), R4(BCA).!
Julia Stoyanovich
Decompose into BCNF
B is the only key. Thus, A → C ; A → D ; AD → C all violate BCNF. Let's decompose on A → C.!
We end up with R1(AC) and R2(ABD). R1 is in BCNF, since the only FD that holds there is A → C, the FD on which we decomposed. To see whether R2 is in BCNF we need to project FDs of R onto R2, compute the keys of R2 and see whether any FDs violate BCNF.!
To project FDs of R onto R2, we compute closures of all subsets of ABD w.r.t. the FDs of R. This gives: {A}+={ACD} = {AD}+, {B}+={ABCD}, {D}+={D}. There is no need to check any supersets of B, since B is already a candidate key. Now, given these closures, we see that R2 has the following non-trivial FDs: A → D, B → A, B → D. So, R2 is not in BCNF, the FD A → D violates BCNF.!
Decomposing R2(ABD) on A → D gives R3(AB) and R4(AD). Both are in BCNF. The final decomposition is as follows:!
R1(AC), with FD A → C; R3(AB), with FD B → A; R4(AD), with FD A → D.
46
R ABCD( ) A → C ; B → A ; A → D ; AD → C
Julia Stoyanovich
Decomposition into 3NF
47
Let R be a relation schema, S be the set of FDs given to hold over R. We decompose R by considering FDs that violate 3NF.!
1. Check whether R is in 3NF. If so, return {R}.!
2. Find a minimal basis for S, say T. !
3. For each FD in T of the form create a relation!
and add it to the decomposition!
4. If none of the relations from Step 3 is a superkey for R, another relation to the decomposition, whose schema is a key for R
Algorithm 3NFSynthesisDecompositionInput: Relation R, a set of FDs S that hold in R.!
Output: A decomposition of R into a set of relations, all of which are in 3NF.
A1A2…An → B1B2…Bm
A1A2…AnB1B2…Bm
Julia Stoyanovich
Normalization: more examples
48
R ABCD( ) C → B ; A → B ; CD → A ; BCD → ADecompose R into 3NF. Show keys.
First, we compute candidate keys for R. Since no FDs have either C or D on the right, both these attributes must be part of a candidate key. In fact, {CD} is the only candidate key of R, since {CD}+={ABCD}. R is not in 3NF, since FDs and violate this normal form. !To Iind a 3NF decomposition, we compute minimal basis of the set of FDs. To do this, we observe that the last FD, with BCD on the left, can be dropped, since it is redundant with the FD that has CD on the left. !We create a 3NF decomposition with relations R1(CB), R2(AB) and R3(CDA). Since R3 is a superkey for R, we don’t need to add any more relations to the decomposition, done.
Julia Stoyanovich
ER modelingDraw an ER diagram that encodes the following business rules. Clearly mark all key and participation constraints.!
Chefs work at restaurants. A chef is uniquely identified by an SSN, and is also described by a name and a cuisine in which she specialized. A restaurant is uniquely identified by a combination of name and city. Each chef works in at least one restaurant, and each restaurant must have at least one chef working at it. Some chefs own restaurants, and if a chef owns a restaurant - she is its sole owner.
49
Julia Stoyanovich
ER modelingDraw an ER diagram that encodes the following business rules. Clearly mark all key and participation constraints.!
Chefs work at restaurants. A chef is uniquely identified by an SSN, and is also described by a name and a cuisine in which she specialized. A restaurant is uniquely identified by a combination of name and city. Each chef works in at least one restaurant, and each restaurant must have at least one chef working at it. Some chefs own restaurants, and if a chef owns a restaurant - she is its sole owner.
50
Julia Stoyanovich
ER to relational
51
RESTAURANTS(
city(name(
CHEFS(
name(ssn(
work_at(
own(cuisine(
Julia Stoyanovich
ER to relational
52
PRESIDENTS)
name)
running_mate) VICE_PRESIDENTS)
name)party)
Julia Stoyanovich
Binary vs. ternary relationship sets
53
PRESIDENT)
name)
running_mate) VICE_PRESIDENT)
name)
Party)
name)
PRESIDENT)
name)
running_mate) VICE_PRESIDENT)
name)party)
Julia Stoyanovich
And now with constraints
54
PRESIDENT)
name)
running_mate) VICE_PRESIDENT)
name)party)
PRESIDENT)
name)
running_mate) VICE_PRESIDENT)
name)
Party)
name)
Julia Stoyanovich
Candidate keys, superkeys
55
Consider a relation schema and business rules below. !!Dancers (name: string, dob: date, stage_name: string, company: string)!• No two dancers have the same combination of name and date of birth (dob).!• No two dancers have the same combination of stage name and company.!• A name, a dob and a stage name have to be specified for each dancer, but not
all dancers belong to a company.!
What are the candidate keys? !Which of these would be appropriate for a primary key?!Which are not appropriate for a primary key?!!What are the superkeys? !!Write a valid create table statement.
Julia Stoyanovich
Relational algebra and SQL
56
Flights (flno: int, origin: string, destintation: string, dist: int, departs:date, arrives: date)Aircraft (aid: int, aname: string, range: int)
Employees (eid: int, ename: string, salary: int)
Certified (eid: int, aid: int)
(a) List eids of pilots certified to fly Boeing.
(b) List names of pilots certified to fly Boeing.
Julia Stoyanovich
Relational algebra and SQL
57
(c) List names of aircraft that can be used on non-stop flights from Bonn to Madras.
Flights (flno: int, origin: string, destintation: string, dist: int, departs:date, arrives: date)Aircraft (aid: int, aname: string, range: int)
Employees (eid: int, ename: string, salary: int)
Certified (eid: int, aid: int)
Julia Stoyanovich
Relational algebra and SQL
58
(d) Find names of pilots who can operate planes with a range greater than 3,000 miles but are not certified on any Boeing aircraft.
Flights (flno: int, origin: string, destintation: string, dist: int, departs:date, arrives: date)Aircraft (aid: int, aname: string, range: int)
Employees (eid: int, ename: string, salary: int)
Certified (eid: int, aid: int)
Julia Stoyanovich
SQL
59
(e) List eids of pilots certified to fly exactly 3 aircraft.
Flights (flno: int, origin: string, destintation: string, dist: int, departs:date, arrives: date)Aircraft (aid: int, aname: string, range: int)
Employees (eid: int, ename: string, salary: int)
Certified (eid: int, aid: int)
Julia Stoyanovich
SQL
60
(f) List aids of aircraft that can be used on flight AF007, along with an average salary of pilots who are certified to operate these aircraft.
Flights (flno: int, origin: string, destintation: string, dist: int, departs:date, arrives: date)Aircraft (aid: int, aname: string, range: int)
Employees (eid: int, ename: string, salary: int)
Certified (eid: int, aid: int)
Julia Stoyanovich
When writing queries• For relational algebra, do worry about efficiency: avoid Cartesian
product whenever possible, push selections!
!
• For SQL, do worry about efficiency and readability: !
• avoid nested queries if your query can be expressed with a join!
• use group by / having as appropriate, not a subquery and a where clause in the outer!
• use standard notation, like we covered in class, e.g., no need to write “inner join”, and do write your queries by hand!
!
• For both SQL and relational algebra: do not join with relations unnecessarily. You should have exactly the right number of tables in the from clause of a SQL query, no more no less
61