mis 335 - database systemsmisprivate.boun.edu.tr/durahim/lectures/mis335-w6-schemarefinemen… ·...

82
MIS 335 - Database Systems Schema Refinement and Normal Forms Ahmet Onur Durahim http://www.mis.boun.edu.tr/durahim/

Upload: ngokien

Post on 13-Mar-2018

214 views

Category:

Documents


1 download

TRANSCRIPT

MIS 335 - Database SystemsSchema Refinement and Normal Forms

Ahmet Onur Durahimhttp://www.mis.boun.edu.tr/durahim/

Learning Objectives

• Anomalies

• Functional Dependencies

• Normal Forms

– 1stNF, 2ndNF, 3rdNF, Boyce-Codd (BCNF)

• Decompositions

– Lossless-Join Decompositions

– Dependency Preserving Decompositions

credit: Yücel Saygın

Schema Refinement

• Redundant (useless) storage of information is the root cause of problems

– Storing the same information redundantly => in more than one place within a database

• Refinement approach based on decompositions

– A relation with redundancy is refined by decomposition => replacing it with smaller relationsthat contain the same information without redundancy

Anomalies• Modification (Update) anomalies

– If one copy of the repeated data is updated, an inconsistency is created unless all copies are updated similarly

• Insertion anomalies– It may not be possible to store certain information

unless some other, unrelated, information is stored

• Deletion anomalies– It may not be possible to delete certain information

without losing some other, unrelated, information as well

Anomalies• Insertion anomalies

– Cannot record filmType without starName

• Deletion anomalies– If we delete the last starName, we also lose the movie info.

title year length filmType studioName starName

Star Wars 1977 124 Action Fox Carrie Fisher

Star Wars 1977 124 Action Fox Mark Haill

Star Wars 1977 124 Action Fox Harrison Ford

Mighty Ducks 1991 104 Animation Disney Emilo Estevez

Wayne’s World 1992 95 Comedy Paramount Dana Carvey

Wayne’s World 1992 95 Comedy Paramount Mike Meyers

Decompositiontitle year length filmType studioName starName

Star Wars 1977 124 Action Fox Carrie Fisher

Star Wars 1977 124 Action Fox Mark Haill

Star Wars 1977 124 Action Fox Harrison Ford

Mighty Ducks 1991 104 Animation Disney Emilo Estevez

Wayne’s World 1992 95 Comedy Paramount Dana Carvey

Wayne’s World 1992 95 Comedy Paramount Mike Meyers

title year starName

Star Wars 1977 Carrie Fisher

Star Wars 1977 Mark Haill

Star Wars 1977 Harrison Ford

Mighty Ducks 1991 Emilo Estevez

Wayne’s World 1992 Dana Carvey

Wayne’s World 1992 Mike Meyers

title year length filmType studioName

Star Wars 1977 124 Action Fox

Mighty Ducks 1991 104 Animation Disney

Wayne’s World 1992 95 Comedy Paramount

Anomalies

• Modification (Update) anomalies– To update address of a student who occurs twice or more in a table,

we will have to update S_Address column in all rows. Otherwise, data will become inconsistent

• Insertion anomalies– Suppose for a new admission, we have a S_id, name, and address of

a student, but if student has not opted for any subjects yet, then we have to insert NULL there

• Deletion anomalies– If S_id 401 has only one subject and temporarily he drop it, when we

delete that row, entire student record will be deleted along with it

Schema Refinement

• Functional Dependencies, can be used to identify schemas with problems and to suggest refinements– Functional dependency is a kind of IC (Integrity

Constraint) that generalizes the concept of a key

– An instance r of a relational schema R satisfies FD: X Y where X, Y are non-empty sets of attributes in R• If t1.X = t2.X, then t1.Y = t2.Y

– t1 and t2 are two different tuples of r

– t1.X: projection of tuple t1 onto the attributes in X

• Decomposition is used for schema refinement

Functional Dependency Example

• Database of beverage drinkers

NAME ADDR BEVERAGE

LIKED MANUF

FAV

BEVERAGE

John Doe NY, Soho CocaCola Light CocaCola CocaCola

John Doe NY, Soho CocaCola CocaCola CocaCola

Elisa Day DC, Dupont Pepsi Light Pepsi Pepsi Max

Elisa Day DC, Dupont Fanta CocaCola Pepsi Max

Functional Dependency Example• title - year length• title - year filmType• title - year studioName• title - year length - filmType - studioName

TITLE YEAR LENGTH FILMTYPE studioName starName

Star Wars 1977 124 Action Fox Carrie Fisher

Star Wars 1977 124 Action Fox Mark Hamill

Star Wars 1977 124 Action Fox Harrison Ford

Mighty Ducks 1991 104 Animation Disney Emilio Estevez

Wayne’s World 1992 95 Comedy Paramount Dana Carvey

Wayne’s World 1992 95 Comedy Paramount Mike Meyers

Functional Dependencies (FDs)• A functional dependency X Y holds over relation R,

if for every allowable instance r of R:– t1 ∈ r, t2 ∈ r, 𝜋X(t1) = 𝜋X(t2) implies 𝜋Y(t1) = 𝜋Y(t2)– i.e., given two tuples in r, if the X values agree, then the Y

values must also agree. (X and Y are sets of attributes)

X Y Z

1 a p

2 b q

1 a r

2 b p

t1

t2

Functional Dependencies (FDs)

• Does the following relation instance satisfy X Y ?

X Y Z

1 a p

2 b q

1 a r

2 c p

Functional Dependencies (FDs)• A functional dependency X Y holds over relation R if,

for every allowable instance r of R:– t1 ∈ r, t2 ∈ r, 𝜋X(t1) = 𝜋X(t2) implies 𝜋Y(t1) = 𝜋Y(t2)– i.e., given two tuples in r, if the X values agree, then the Y

values must also agree. (X and Y are sets of attributes)

• An FD is a statement about all allowable relations– Must be identified based on semantics of application

– Given some allowable instance r1 of R, we can check if it violates some FD f, but we cannot tell if f holds over R

• K is a candidate key for R means that K R– However, K R does not require K to be minimal

– K being a primary key is a special case of an FD

Functional Dependencies (FDs)

• Does the following relation instance satisfy X Y ?

X Y Z

1 a p

2 b q

1 a r

3 b p

Functional Dependencies (FDs)

• If X is a candidate key, then X YZ ?

X Y Z

1 a p

2 b q

1 a r

3 b p

Functional Dependencies (FDs)

• If YZ X, can we say that YZ is a candidate key ?

X Y Z

1 a p

2 b q

1 a r

3 b p

Constraints on Entity Sets• Consider relation obtained from Hourly_Emps:

– Hourly_Emps (ssn, name, lot, rating, hrly_wages, hrs_worked)

• Notation: We will denote this relation schema by listing the attributes: SNLRWH– This is really the set of attributes {S,N,L,R,W,H}.– Sometimes, we will refer to all attributes of a relation by using

the relation name• e.g., Hourly_Emps for SNLRWH

S N L R W HHourly_Emps

Constraints on Entity Sets

• Some FDs on Hourly_Emps:– ssn is the key: S SNLRWH – rating determines hrly_wages: R W

S N L R W H1 100

2 200

3 250

2 300

Hourly_Emps

• Did you notice anything wrong with the following instance ?

Constraints on Entity Sets

• Some FDs on Hourly_Emps:– ssn is the key: S SNLRWH – rating determines hrly_wages: R W

S N L R W H1 100

2 200

3 250

2 200

Hourly_Emps

• Salary should be the same for a given rating!

Example

• Problems due to R W:– Redundant Storage: The rating value 8 corresponds to the

hourly wage 10 (This association is repeated 3 times)– Update anomaly: Can we change W in just the 1st tuple of

SNLRWH?– Insertion anomaly: What if we want to insert an employee

and don’t know the hourly wage for his rating?– Deletion anomaly: If we delete all employees with rating 5,

we lose the information about the wage for rating 5

S N L R W H

123-22-3666 Attishoo 48 8 10 40

231-31-5368 Smiley 22 8 10 30

131-24-3650 Smethurst 35 5 7 30

434-26-3751 Guldu 35 5 7 32

612-67-4134 Madayan 35 8 10 40

S N L R W H

123-22-3666 Attishoo 48 8 10 40

231-31-5368 Smiley 22 8 10 30

131-24-3650 Smethurst 35 5 7 30

434-26-3751 Guldu 35 5 7 32

612-67-4134 Madayan 35 8 10 40

S N L R H

123-22-3666 Attishoo 48 8 40

231-31-5368 Smiley 22 8 30

131-24-3650 Smethurst 35 5 30

434-26-3751 Guldu 35 5 32

612-67-4134 Madayan 35 8 40

R W

8 10

5 7

Hourly_Emps2

Wages

Hourly_Emps

Reasoning About FDs

• Given some FDs, we can usually infer additional FDs:– ssn did, did lot implies ssn lot

• An FD f is implied by a given set F of FDs if f holds whenever all FDs in F hold– F+ = closure of F is the set of all FDs that are implied by F

• Armstrong’s Axioms (X, Y, Z are sets of attributes):– Reflexivity: If X ⊆ Y, then Y X (a trivial FD)

– Augmentation: If X Y, then XZ YZ for any Z

– Transitivity: If X Y and Y Z, then X Z

• These are sound and complete inference rules for FDs– S: generate only FDs in F+, C: generate all FDs in F+

Reasoning About FDs

• In the following schema

– SN S is a trivial FD (by reflexivity)

– since {S,N} is a superset of {S}

S N L R W H

Hourly_Emps (ssn, name, lot, rating, hrly_wages, hrs_worked)

Reasoning About FDs

• In the following schema

– If SN RW, then SNL RWL (by augmentation)

S N L R W H

Hourly_Emps (ssn, name, lot, rating, hrly_wages, hrs_worked)

Reasoning About FDs

• Couple of additional rules (that follow from AA):

– Union: If X Y and X Z, then X YZ

• Proof:

– From X Y, we have XX XY (by augmentation)

– Note that XX is X, therefore X XY

– From X Z, we have XY YZ (by augmentation)

– From X XY and XY YZ, we have X YZ (by transitivity)

Reasoning About FDs

• Couple of additional rules (that follow from AA):

– Decomposition: If X YZ, then X Y and X Z• Try to prove it at home/dorm/ICs

Reasoning About FDs• Example: Contracts(Cid,Sid,prJid,Did,Pid,Qty,Value),

where we denote schema as CSJDPQV and:– C is the key: C CSJDPQV– Project purchases each part using single contract: JP C– Department purchases at most one part from a supplier:

SD P

• JP C and C CSJDPQV imply JP CSJDPQV• SD P implies SDJ JP• SDJ JP and JP CSJDPQV imply SDJ CSJDPQV

• We cannot conclude that SD CSDPQV by cancelling J from both sides of SDJ CSJDPQV !!!

Example

• Suppose that we are given;

– a relation scheme R = (A,B,C,G,H,I)

– the set of functional dependencies F:

• F = {A B, A C, CG H, CG I, B H}

• Is A H logically implied by F?

• Is AG I logically implied by F?

Reasoning About FDs• Computing the closure of a set of FDs (F+) can be

expensive– Size of closure is exponential in # attrs

• Example: – A database with 4 attributes (A,B,C,D)– F = {A B, B C}– Find the closure of F denoted by F+

– A A, A B, A C, B B, B C, C C, D D, – AB A, AB B, AB C, AC A, AC B, AC C, AD

A, AD B, AD C, AD D, BC B, BC C, BD B, BD C, BD D, CD C, CD D,

– ABC A, ABC B, ABC C, ABD A, ABD B, ABD C, ABD D, BCD B, BCD C, BCD D,

– ABCD A, ABCD B, ABCD C, ABCD D

Reasoning About FDs

• Computing the closure of a set of FDs (F+) can be expensive– Size of closure is exponential in # attrs

• Typically, we just want to check if a given FD, X Y, is in the closure of a set of FDs F

• An efficient check:– Compute attribute closure of X (denoted X+) wrt F:

• Set of all attributes A such that X A is in F+

• There is a linear time algorithm to compute this;– For each FD Y Z in F, if X+ is a superset of Y then add Z to X+

Reasoning About FDs

• Does F = {A B, B C, CD E} imply A E?– i.e., Is A E in the closure F+?

– Equivalently, is E in A+?

• Lets compute A+

– Initialize A+ to {A} : A+ = {A}

– From A B, we can add B to A+ : A+ = {A, B}

– From B C, we can add C to A+: A+ = {A, B, C}

– We can not add any more attributes, and A+ does not contain E• Therefore A E does not hold

DB Design Guidelines• Design a relation schema with a clearly defined

semantics

• Design the relation schemas so that there are no insertion, deletion, or modification anomalies

– If there may be anomalies, state them clearly

• Avoid attributes which may frequently have nullvalues as much as possible

• Make sure that relations can be combined by key-foreign key links

Normal Forms

• Normal forms are standards for a good DB schema (introduced by Codd in 1972)

• If a relation is in a certain normal form (such as BCNF, 3NF etc.), it is known that certain kinds of problems are avoided/minimized.

• Normal forms help us decide if decomposing a relation helps

Normal Forms: 1NF

• First Normal Form: Relation in 1NF if every field containes only atomic values

– No set valued attributes (no lists or sets)

sid name phones

1 ali {5332344568,

2165533561}

2 veli …

3 ayse …

4 fatma …

First Normal FormStudent – Not in 1NF

Student – in 1NF

Normal Forms

• Role of FDs in detecting redundancy;

• Consider a relation R with 3 attributes, ABC

– No FDs hold: There is no redundancy here

– Given a FD, A B: Several tuples could have the same A value, and if so, they’ll all have the same B value

• This potential redundancy can be predicted using this FD information

Normal Forms: 2NF• Second Normal Form: Every non-prime (non-key)

attribute should be fully functionally dependent on every key (no partial dependency)– i.e., candidate keys

• In other words: “No non-prime attribute in the table is functionally dependent on a proper subset of any candidate key”– Prime attribute: any attribute that is part of a key– Non-prime attributes: rest of the attributes

• Ex: If AB is a key, and C is a non-prime attribute, then if A C holds then A partially determines C– there is a partial functional dependency to a key

2nd Normal FormStudent – Not in 2NF

Student Age

Adam 15

Alex 14

Stuart 17

Student Subject

Adam Biology

Adam Math

Alex Math

Stuart Math

Student Age Subject

Adam 15 Biology

Adam 15 Math

Alex 14 Math

Stuart 17 Math

Student – in 2NF Student Subject – in 2NF

candidate key is {Student, Subject}

2nd Normal Form• Composite primary key is

[Customer ID, Store ID]• The non-key attribute is

[Purchase Location]

• Not in 2nd normal form– [Purchase Location] only depends on [Store ID]– [Store ID] is only part of the primary key

PURCHASE_DETAIL

PURCHASE STORE

2nd Normal Form• Composite primary key is

[Employee, Skill]• The non-key attribute is

[Current Work Location]

• Not in 2nd normal form– [Current Work Location] only depends on [Employee]– [Employee] is only part of the primary key

Normal Forms: 3NF• Relation R with FDs F is in Third Normal Form if, for

all X A in F+ (Zaniolo’s def.)– A ∈ X (called a trivial FD), (=> X contains A) or– X contains a key for R, (=> X is a superkey) or– A is part of some key for R (=> Every element of A-X is a

prime attribute (contained in some candidate key))

• R is in 2NF & there is no transitive functional dependency (Codd’s def.)– B is functionally dependent on A, and C is functionally

dependent on B. Therefore, C is transitively dependent on A via B

• If R is in 3NF, some redundancy is possible

What Does 3NF Achieve?• If 3NF violated by X A, one of the following holds:

– X is a subset of some key K (partial dependency)• We store (X, A) pairs redundantly

– X is not a proper subset of any key (transitive dependency)• There is a chain of FDs K X A, which means that we cannot associate an X

value with a K value unless we also associate an A value with an X value

• But: even if relation is in 3NF, these problems could arise– e.g., Reserves SBDC (C: Credit Card), S C, C S is in 3NF (SBD & CBD

are keys),• S C: Sailor uses a unique CreditCard to pay for reservations (Only key is SBD)

(S is not a key and C is not part of a key)– Hence not in 3NF (redundantly stored SC pairs)

• If also C S: Credit cards uniquely identify the owner (which means CBD is also a key)

– Hence in 3NF

– but for each reservation of sailor S, same (S, C) pair is stored

• There is a stricter normal form (BCNF)

Partial/Transitive Dependencies

• Partial Dependency

• Transitive Dependencies

Key Attributes X Attribute A Case 1: A not in KEY

Key Attributes X Attribute A Case 1: A not in KEY

Key Attributes XAttribute A Case 2: A is in KEY

Not violate 3NF

Violates 3NF

3rd Normal Form• [Book ID] (key) determines

[Genre ID]• [Genre ID] determines

[Genre Type]

• Not in 3rd normal form– [Book ID] determines [Genre Type] via [Genre ID]– There is transitive functional dependency

BOOK DETAIL

GENREBOOK

=> non-key attribute

3rd Normal Form• [Tournament, Year] is a

minimal set of attributes guaranteed to uniquely identify a row– candidate key for the table

• Not in 3rd normal form– [Tournament, Year] determines [Date of Birth] via [Winner]– Non-prime attribute [Date of Birth] is transitively dependent on

the candidate key

Tournament Winners

Winner Dates of BirthTournament Winners

=> non-key attribute

Boyce-Codd Normal Form (BCNF)

• Relation R with FDs F is in BCNF if, for all X A in F+

– A ∈ X (called a trivial FD), or

– X contains a key for R. (i.e., X is a superkey)

• In other words, R is in BCNF if the only non-trivial FDs that hold over R are key constraints

KeyNonkey

Attr1Nonkey

Attr2Nonkey

AttrK

FDs in a BCNF Relation

X Y A

x y1 a

x y2 ?

Boyce-Codd Normal Form (BCNF)• BCNF ensures that No Redundancy in R can be predicted

using FDs alone– if a relation is in BCNF, every field of every tuple records a

piece of information that cannot be inferred (using only FDs) from the values in all other fields in relation instance

• If we are shown two tuples that agree upon the X value, we cannot infer the A value in one tuple from the A value in the other

• If example relation is in BCNF (where X A), the 2 tuples must be identical (X is a key since R in BCNF) – this situation cannot arise in relational DBs

BCNF

• FDs:– GPA rank– cid cname, cInstructor– sid sname, address, GPA

• Keys:– {sid, cid} • Not in BCNF

– Not every LHS of FDs contain a key– None of the FDs contain a key

Student Course

sid sname address GPA cid cname cInstructor rank

111 Onur 123 st. 3.8 335 DB sys Durahim 1

222 Ahmet 999 st. 2.9 335 DB sys Durahim 12

111 Onur 123 st. 3.8 413 Info sys Jackman 1

• Not even in 2nd NF– 2nd and 3rd FDs lead to

partial dependencies

BCNF

• FDs:– GPA rank– cid cname, cInstructor– sid sname, address, GPA

• Keys:– {sid, cid} • Not in BCNF

– Not every LHS of FDs contain a key– None of the FDs contain a key

Student Course

sid sname address GPA cid cname cInstructor rank

111 Onur 123 st. 3.8 335 DB sys Durahim 1

222 Ahmet 999 st. 2.9 335 DB sys Durahim 12

111 Onur 123 st. 3.8 413 Info sys Jackman 1

sid sname address GPA cid rank

111 Onur 123 st. 3.8 335 1

222 Ahmet 999 st. 2.9 335 12

111 Onur 123 st. 3.8 413 1

cid cname cInstructor

335 DB sys Durahim

413 Info sys Jackman

• Not even in 2nd NF– 2nd and 3rd FDs lead to

partial dependencies

BCNF

• FDs:– GPA rank– cid cname, cInstructor– sid sname, address, GPA

• Keys:– {sid, cid} • Not in BCNF

– Not every LHS of FDs contain a key– None of the FDs contain a key

Student Course

sid sname address GPA cid cname cInstructor rank

111 Onur 123 st. 3.8 335 DB sys Durahim 1

222 Ahmet 999 st. 2.9 335 DB sys Durahim 12

111 Onur 123 st. 3.8 413 Info sys Jackman 1

sid sname address GPA cid

111 Onur 123 st. 3.8 335

222 Ahmet 999 st. 2.9 335

111 Onur 123 st. 3.8 413

cid cname cInstructor

335 DB sys Durahim

413 Info sys Jackman

GPA rank

3.8 1

2.9 12

• Not even in 2nd NF– 2nd and 3rd FDs lead to

partial dependencies

BCNF

• FDs:– GPA rank– cid cname, cInstructor– sid sname, address, GPA

• Keys:– {sid, cid} • Not in BCNF

– Not every LHS of FDs contain a key– None of the FDs contain a key

Student Course

sid sname address GPA cid cname cInstructor rank

111 Onur 123 st. 3.8 335 DB sys Durahim 1

222 Ahmet 999 st. 2.9 335 DB sys Durahim 12

111 Onur 123 st. 3.8 413 Info sys Jackman 1

sid sname address GPA

111 Onur 123 st. 3.8

222 Ahmet 999 st. 2.9

cid cname cInstructor

335 DB sys Durahim

413 Info sys Jackman

GPA rank

3.8 1

2.9 12

sid cid

111 335

222 335

111 413

• All of these four tables are now in BCNF

BCNF sid instrid CourseCode OffHourAppnt

111 123 MIS335 12.10.2014

111 999 MIS413 13.10.2014

222 123 MIS335 14.10.2014

222 999 MIS413 15.10.2014• FDs:

– sid, instrid CourseCode, OffHourAppnt– courseCode instrid

• In 3NF BUT NOT in BCNF

BCNF sid instrid CourseCode OffHourAppnt

111 123 MIS335 12.10.2014

111 999 MIS413 13.10.2014

222 123 MIS335 14.10.2014

222 999 MIS413 15.10.2014• FDs:

– sid, instrid CourseCode, OffHourAppnt– courseCode instrid

• In 3NF BUT NOT in BCNF– No partial key or transitive key dependencies

BCNF sid instrid CourseCode OffHourAppnt

111 123 MIS335 12.10.2014

111 999 MIS413 13.10.2014

222 123 MIS335 14.10.2014

222 999 MIS413 15.10.2014• FDs:

– sid, instrid CourseCode, OffHourAppnt– courseCode instrid

• In 3NF BUT NOT in BCNF– No partial key or transitive key dependencies

– courseCode is not a superkey

sid CourseCode OffHourAppnt

111 MIS335 12.10.2014

111 MIS413 13.10.2014

222 MIS335 14.10.2014

222 MIS413 15.10.2014

instrid CourseCode

123 MIS335

999 MIS413

Normal Form Shortcuts

• All attributes are prime

– At least in 3NF

• Singleton keys

– At least in 2NF

Decomposition of a Relation Scheme

• Suppose that relation R contains attributes A1, ..., An

• A decomposition of R consists of replacing R by two or more relations such that:– Each new relation scheme contains a subset of the attributes

of R (and no attributes that do not appear in R), and

– Every attribute of R appears as an attribute of one of the new relations

• Intuitively, decomposing R means we will store instances of the relation schemes produced by the decomposition, instead of instances of R– e.g., Can decompose SNLRWH into SNLRH and RW

Decomposition of a Relation Scheme

• We can decompose SNLRWH into SNL and RWH

S N L R W H

S N L R W H

Example Decomposition• SNLRWH has FDs {S SNLRWH, R W}

– Is this in 3NF?– R W violates 3NF

• W values repeatedly associated with R values

• In order to fix the problem, we need to create a relation RW to store the R W associations, and to remove W from the main schema: – i.e., we decompose SNLRWH into SNLRH and RW

S N L R H R W

Problems with Decompositions• There are three potential problems to consider:

– Some queries become more expensive (Performance loss due to required joins)• e.g., How much did sailor Joe earn? (salary = W*H)

– Given instances of the decomposed relations, we may not be able to reconstruct the corresponding instance of the original relation• Fortunately, not in the SNLRWH example.

– Checking some dependencies may require joining the instances of the decomposed relations• Fortunately, not in the SNLRWH example.

• Tradeoff: Must consider these issues vs. redundancy

R WS N L R H

Problems with DecompositionsWhat problems does a given decomposition cause, if any?

• Lossless-join property– Enables to recover any instance of the decomposed

relation from corresponding instances of the smaller relations• Given instances of the decomposed relations, we may not be able

to reconstruct the corresponding instance of the original relation!

• Dependency-preservation property– Enables us to enforce any constraint on the original

relation by simply enforcing some constraints on each of the smaller relations

– Checking some dependencies may require joining the instances of the decomposed relations

Lossless Join Decompositions

• Decomposition of R into X and Y is lossless-join w.r.t. a set of FDs F if, for every instance r that satisfies F:– 𝝅X(r) ⋈ 𝝅Y(r) = r

• It is always true that r ⊆ 𝜋X(r) ⋈ 𝜋Y(r)– In general, the other direction does not hold!

– If it does, the decomposition is lossless-join.

• Definition extended to decomposition into 3 or more relations in a straightforward way

• It is essential that all decompositions used to deal with redundancy be lossless

Lossless Join• The decomposition of R into X and

Y is lossless-join wrt FDs F if and only if the closure of F (F+) contains:– X ⋂ Y X, or

– X ⋂ Y Y

• The attributes common to X and Y must contain a key for either X or Y

• If a FD U V holds over R and U ⋂ V is empty, the decomposition of R into R – V and UV is lossless

A B C

1 2 3

4 5 6

7 2 8

1 2 8

7 2 3

A B C

1 2 3

4 5 6

7 2 8

A B

1 2

4 5

7 2

B C

2 3

5 6

2 8

• Person(SSN, Name, Address, Hobby)• F = {SSN, Hobby Name, Address;

SSN Name, Address}

SSN Name Address Hobby

111111 Celalettin Sabanci D. Stamps

111111 Celalettin Sabanci D. Coins

555555 Elif Mutlukent Skating

555555 Elif Mutlukent Surfing

666666 Sercan Esentepe Math

SSN Hobby

111111 Stamps

111111 Coins

555555 Skating

555555 Surfing

666666 Math

SSN Name Address

111111 Celalettin Sabanci D.

555555 Elif Mutlukent

666666 Sercan Esentepe

Person

Person1 Hobby

• Person(SSN, Name, Address, Hobby)• F = {SSN, Hobby Name, Address;

SSN Name, Address}

SSN Name Address Hobby

111111 Celalettin Sabanci D. Stamps

111111 Celalettin Sabanci D. Coins

555555 Elif Mutlukent Skating

555555 Elif Mutlukent Surfing

666666 Sercan Esentepe Math

SSN Hobby

111111 Stamps

111111 Coins

555555 Skating

555555 Surfing

666666 Math

SSN Name Address

111111 Celalettin Sabanci D.

555555 Elif Mutlukent

666666 Sercan Esentepe

Problems with Decompositions (Contd.)

• Checking some dependencies may require joining the instances of the decomposed relations

Dependency Preserving Decomposition• Consider CSJDPQV, C is key, JP C and SD P

– SD does not contain a key, thus SD P causes a violation of BCNF– BCNF decomposition: CSJDQV and SDP– Problem: Checking JP C for each insertion requires a join

(expensive!) => decomposition is not dependency-preserving

• Dependency preserving decomposition:– A dependency X Y that appear in F should either appear in one

of the sub relations or should be inferred from the dependencies in one of the sub relations

• Projection of set of FDs F: If R is decomposed into X, ... projection of F onto X (denoted FX) is the set of FDs U V in F+ (closure of F) such that U, V are in X– Ex: R = ABC, F = {A B, B C, C A}

• F+ includes FDs, {A B, B C, C A, B A, A C, C B}• FAB = {A B, B A}, FAC = {C A, A C}

Dependency Preserving Decomposition

• Decomposition of R into schemas with attribute sets X and Y is dependency preserving if (FX ⋃ FY)+ = F+

– take the dependencies in FX and FY and – compute the closure of their union– we get back all dependencies in the closure of F

• therefore, we need to enforce only the dependencies in FX and FY , then all FDs in F+ are sure to be satisfied

• Important to consider F+, not F, in this definition:– ABC, {A B, B C, C A}, decomposed into AB and BC– Is this dependency preserving? Is C A preserved???

• F+ includes FDs, {A B, B C, C A, B A, A C, C B}• FAB = {A B, B A}, FBC = {B C, C B}, • FAB U FBC = {A B, B A, B C, C B}• Does the closure of FAB U FBC imply C A?

Dependency Preserving Decomposition

• Dependency preserving does not imply lossless join:

– Ex: ABC, A B, decomposed into AB and BC, is a lossy decomposition

• And vice-versa!

– Ex: CSJDPQV, {C is key, JP C and SD P}, decomposed into CSJDQV and SDP, is lossless but not dependency preserving

Normalization

• Converting relations to BCNF– Possible to obtain a lossless-join decomposition

into a collection of BCNF relation schemas

– But, there may be no dependency-preservingdecomposition into a collection of BCNF relation schemas

• Converting relations to 3NF– There is always a dependency-preserving, lossless-

join decomposition into a collection of 3NF relation schemas

Decomposition into BCNF• Consider relation R (ABCD) with FDs F (AB is key)• If X Y (A CD) violates BCNF, decompose R into

R-Y (ABCD - CD) and XY (A CD)– Y is a single attribute and not in X

• Repeated application of this idea will give us a collection of relations that are in– BCNF, lossless join decomposition, and guaranteed to

terminate

• In general, several dependencies may cause violation of BCNF– The order in which we “deal with” them could lead to

very different sets of relations!

BCNF decomposition

• Given a relation R and FDs F for R

• Compute keys for R (using FDs)

• Repeat until all relations are in BCNF

– Pick any R’ with A B that violates BCNF

– Decompose R’ into R1(A,B) and R2(A, rest)

– Compute FDs for R1 and R2

– Compute keys for R1 and R2

Example: Decomposition into BCNF

• R = ABCDEFGH with FDs– ABH C : A DE : BGH F

– F ADH : BH GE

• Is R in BCNF?

• Which FD violates the BCNF ?– ABH C ?

• No, since ABH is a superkey

– A DE violates BCNF• Since attribute closure of A is ADE and therefore A is not a

superkey

• Decompose R = ABCDEFGH into R1 = ADE and R2 = ABCFGH

Example: Decomposition into BCNF

• R = ABCDEFGH with FDs– ABH C : A DE : BGH F – F ADH : BH GE

• R1 = ADE, F1 = {A DE} • R2 = ABCFGH, F2 = {ABH C, BGH F, F AH, BH G}

– New FDs are obtained by projecting the original FDs on the attributes in the new relations

– For example: BH GE is decomposed into {BH G, BH E} and BH E is not included in F1 or F2, BH G is included into R2

– Is the decomposition of R into R1 and R2 dependency preserving?

• R1 is in BCNF, but we need to apply the algorithm on R2 since it is not in BCNF

BCNF and Dependency Preservation• In general, there may not be a dependency

preserving decomposition into BCNF– e.g., SBD, SB D, D B– Can’t decompose while preserving 1st FD; not in BCNF

• Similarly, decomposition of CSJDPQV into SDP, JS and CJDQV is not dependency preserving (w.r.t. the FDs JP C, SD P and J S)– However, it is a lossless join decomposition– In this case, adding JPC to the collection of relations gives

us a dependency preserving decomposition• JPC tuples stored only for checking FD! (Redundancy!)

– there is no such redundancy within a single BCNF relation• This example shows that redundancy can still occur across

relations, even though there is no redundancy within a relation

Decomposition into 3NF

• The algorithm for lossless-join decomposition into BCNF can be used to obtain a lossless join decomposition into 3NF (typically, can stop earlier)– But this approach does not ensure dependency-

preservation

• To ensure dependency preservation, one idea:– If X Y is not preserved, add relation XY– Problem is that XY may violate 3NF!

• e.g., consider the addition of CJP to “preserve” JP C. What if we also have J C ?

• Refinement: Instead of the given set of FDs F, use a minimal cover for F

Minimal Cover for a Set of FDs• Minimal cover FD set of G for a set of FDs F s.t.:

– The closure of F (F+) = The closure of G (G+)– Right hand side of each FD in G is a single attribute– If we modify G by deleting an FD or by deleting attributes

from an FD in G, the closure changes

• Intuitively, every FD in G is “needed”, and “as small as possible” in order to get the same closure as F– every dependency in it is required for the closure to be

equal to F+

– each attribute on the left side is necessary– the right side is a single attribute

• e.g., A B, ABCD E, EF GH, ACDF EG has the following minimal cover:– A B, ACD E, EF G and EF H

Minimal Cover for a Set of FDs• A B, ABCD E, EF GH, ACDF EG has the

following minimal cover by:– ACDF EG => ACDF E and ACDF G

– EF GH => EF G and EF H

– ABCD E can be replaced by ACD E since A B holds

– ACDF G is implied by A B, ABCD E, EF GH• A B => A AB => AC ABC => ACD ABCD => ACD E

=> ACDF EF => ACDF GH => ACDF G (can be deleted)

• A B => A AB => AC ABC => ACD ABCD => ACD E => ACDF EF => ACDF E (can be deleted)

– A B, ACD E, EF G and EF H

Obtaining the Minimal Cover• Algorithm Steps:

1. Put the FDs in standard form

• single attribute on the right hand side

2. Minimize the left hand side of each FD

• For each FD check if an attribute in LHS can be deleted while preserving equivalence to closure of F

3. Delete the redundant FDs

• It is necessary to minimize the left sides of FDs before checking for redundant FDs

Obtaining the Minimal Cover• Example: F = {ABCD E, E D, A B, AC D}

– Notice that the right hand sides have a single attribute • if not we had to decompose the right hand sides first

• Can we remove B from the left hand side of ABCD E?– Check if ACD E is implied by F

• In order to do this, find the attribute closure ACD wrt F

– If B is in the attribute closure, then ACD E is implied by F, and therefore we can replace ABCD E with ACD E • note that given ACD E, we have ABCD E

– A B => A AB => ACD ABCD => ACD E

• Can we remove D from ACD E– Check if AC E is implied by F’

• obtained by replacing ABCD E in F with ACD E

• F’’ = {AC E, E D, A B, AC D}– Can we drop any FDs in F’’?– Could we drop any FDs in F before minimizing the left hand sides?

Dependency Preserving Decomposition into 3rdNF

• Let R be the relation to be decomposed into 3rdNF and F be the FDs that is a minimal cover

• Algorithm Steps– Perform lossless-join decomposition of R into R1, R2, …, Rn– Project the FDs in F into F1, F2, …, Fn

• that correspond to R1, R2, …, Rn

– Identify the set of FDs that are not preserved• i.e., that are not in the closure of the union of F1, F2, …, Fn

– For each FD X A that is not preserved, create a relation schema XA and add it to the decomposition

Example• Consider the relation R, Contracts(CSJDPQV) with FDs

(C is a key):– JP C , SD P , J S

• Decomposed into R1(SDP), R2(CSJDQV)– R1 in BCNF, R2 not in 3NF

• Decompose R2(CSJDQV) into R3(JS), R4(CJDQV)– Both R3 and R4 in 3NF (in BCNF also)

• Decomposition of R into R1, R3, R4 is lossless-join• But not dependency-preserving (JP C is not preserved)

• Add R5(CJP) into relation

• Resulting decomposition is CSJDPQV into SDP, JS, CJDQV, CJP

Synthesis Approach• Consider the relation R, Contracts(CSJDPQV) with FDs (C is

a key):– JP C , SD P , J S

• Find minimal cover– C CSJDPQV into C S, C J , C D, C P, C Q, C V

• C S is implied by C J and J S• C P is implied by C S, C D and SD P

– Final set is (C J , C D, C Q, C V, JP C, SD P, J S)

• So add corresponding schemas for all the FDs in minimal cover– CJ, CD, CQ, CV, CJP, SDP, JS

• Improve this set by combining relations for which C is the key into CDJPQV– CDJPQV, SDP, JS