normalization

Database Management System

Normalization Process

This Lecture

Schema Refinement

Normalization

Schema Refinement - Review

Conceptual Modeling is a subjective process

Therefore, the schema after the logical database design phase may not be very good (contain redundancies)

However, there are formalisms to ensure that the schema is good.

This process is called Normalization

Schema Refinement Review (contd.)

Relational database schema = set of relations

Relation = set of attributes

How we group the attributes to relations is very important


Too many attributes in a relation Waste space

Anomalies

Decomposing the relation into too smaller set of relations Loss-less join property

Dependency preserving property


Too many attributes

For example,

LECTURER(id, name, address, salary, deptno,dname building)


Insertion Anomaly

1. Inserting a new lecturer to the LECTURER table

- Department information is repeated (ensure that correct department information is inserted).


2. Inserting a department with no employees

(Impossible b/c null values for id is not allowed)


Deletion Anomalies

Deleting the last lecturer from the department will lose information about the department


Update Anomalies

Updating the departments building needs to be done for all lecturers working for that department

Schema Refinement Review (contd.).

When redundancies exists, we should decompose the relations to smaller relations


Decomposing the relation into too smaller relations

Loss-less join property: we might lose information if we decompose relations

Dependency-preserving property: The set of dependencies in S can be verified by a set of dependencies in R1 and R2


Loss-less join property:

For example,

S P D

S1 P1 D1

S2 P2 D2

S3 P1 D3

S P

S1 P1

S2 P2

S3 P1

P D

P1 D1

P2 D2

P1 D3

S R1 R2


Joining them together, we get spurious tuples

S P D

S1 P1 D1

S1 P1 D3

S2 P2 D2

S3 P1 D1

S3 P1 D3

R1 R2


To avoid the above mentioned issues in the relational schema, we can apply a formal process called Normalization

Normalization is based on functional dependencies


Key points:

Redundancy is based on functional dependencies

Therefore, normalization is based on functional dependencies


Given some FDs, we can usually infer additional FDs: A B, B C implies A C

An FD f is implied by a set of FDs F if f holds whenever all FDs in F hold. F+ = closure of F is the set of all FDs that are implied by F.

How can we get F+?


Armstrongs Axioms (X, Y, Z are sets of attributes):

Reflexivity: If X Y, then Y X

Augmentation: If X Y, then XZ YZ for any Z

Transitivity: If X Y and Y Z, then X Z

These are sound and complete inference rules for FDs!


Couple of additional rules (that follow from AA):

Union: If X Y and X Z, then X YZ

Decomposition: If X YZ, then X Y and X Z

Example: Contracts(cid,sid,jid,did,pid,qty,value), and:

C is the key: C CSJDPQV

Project purchases each part using single contract: JP C

Dept purchases at most one part from a supplier: SD P

JP C, C CSJDPQV imply JP CSJDPQV

SD P implies SDJ JP

SDJ JP, JP CSJDPQV imply SDJ CSJDPQV


Why is F+ important?

X RHS in relation R

X is a subset of attributes in relation R. If RHS contains all attributes of R, then X is a superkey.

If X is not a superkey, then values for X can repeat in different tuples resulting in redundancy!!!

So determining F+ can help us find superkeys and check for any redundancy.


Computing the closure of a set of FDs can be expensive. (Size of closure is exponential in # attrs!)

Typically, we just want to check if a given FD X Y is in the closure of a set of FDs F+. An efficient check:

Compute attribute closure of X (denoted X+) wrt F:

Set of all attributes A such that X A is in F+

There is a linear time algorithm to compute this.

Check if Y is in X+


Algorithm to find X+:

closure = X;

repeat until there is no change: {

If there is an FD U V in F such that U closure

then set closure = closure V

}

Does F = {A B, B C, CD E } imply A E?

i.e, is A E in the closure F+? Equivalently, is E in A+?

We can use the attribute closure to find out keys of the relation. If X+ contains all attributes of the relation, then X is a superkey.


Schema Refinement Steps:

Determine F for relation R

Find all keys in F using attribute closure

Normalize


There are many Normal Forms proposed to reduce redundancies

Some of the well-known ones are:

1st Normal Form

2nd Normal Form

3rd Normal Form

Boyce-Codd Normal Form


Review of some terms Candidate Key: Each key of a relation is called a

candidate key

Primary Key: A candidate key is chosen to be the primary key

Prime Attribute: an attribute which is a member of a candidate key

Nonprime Attribute: An attribute which is not prime


1st Normal Form

A relation R is in first normal form (1NF) if domains of all attributes in the relation are atomic (simple & indivisible).


2nd Normal Form:

A relation R is in second normal form (2NF) if every nonprime attribute A in R is not partially dependent on any key of R


Example

EMP_PROJ

NIC PNUM HOURS ENAME PNAME LOC

FD1

FD2

FD3


NIC PNUM HOURS

NIC ENAME

PNUM PNAME PLOC

EP1

EP2

EP3


3rd Normal Form:

A relation R is in 3rd normal form (3NF) if every

R is in 2NF, and

No nonprime attribute is transitively dependent on any key


Example,

ENAME SSN BDATE ADD DNUM DNAME DMGR

EMP_DEPT


ED1

ED2

ENAME SSN BDATE ADD DNUM

DNUM DNAME DMGR


Boyce-Codd Normal Form (BCNF):

A relation schema is in Boyce-Codd Normal Form

If every nontrivial functional dependency XA hold in R, then X is a superkey of R


Keys: PropertyID, (County_Name, Lot#)

PROPERTY_ID

COUNTY_NAME

LOT# AREA PRICE TAX_RATE

FD1

FD2

FD3

FD4

FD5


Decomposition into BCNF:

Consider relation R with FDs F. If X Y violates BCNF, decompose R into R - Y and XY.

Repeated application of this idea will give us a collection of relations that are in BCNF; lossless join decomposition, and guaranteed to terminate.

e.g., CSJDPQV, key C, JP C, SD P, J S

To deal with SD P, decompose into SDP, CSJDQV.

To deal with J S, decompose CSJDQV into JS and CJDQV

In general, several dependencies may cause violation of BCNF. The order in which we deal with them could lead to very different sets of relations!


In general, there may not be a dependency preserving decomposition into BCNF.

e.g., CSZ, CS Z, Z C

Cant decompose while preserving 1st FD; not in BCNF.

Similarly, decomposition of CSJDQV into SDP, JS and CJDQV is not dependency preserving (w.r.t. the FDs JP C, SD P and J S).

However, it is a lossless join decomposition.

In this case, adding JPC to the collection of relations gives us a dependency preserving decomposition.

JPC tuples stored only for checking FD! (Redundancy!)


Obviously, the algorithm for lossless join decomp into BCNF can be used to obtain a lossless join decomp into 3NF (typically, can stop earlier).

To ensure dependency preservation, one idea: If X Y is not preserved, add relation XY.

Problem is that XY may violate 3NF! e.g., consider the addition of CJP to `preserve JP C. What if we also have J C ?

Refinement: Instead of the given set of FDs F, use a minimal cover for F.


Minimal cover G for a set of FDs F: Closure of F = closure of G.

Right hand side of each FD in G is a single attribute.

If we modify G by deleting an FD or by deleting attributes from an FD in G, the closure changes.

General alg. to obtain minimal cover:

Put the FDs in a standard form (i.e. single attribute in RHS).

Minimize the Left side of each FD. For each FD, check if we can delete attributes in LHS while preserving equivalence to F+.

Delete any redundant FDs.


Intuitively, every FD in G is needed, and as small as possible in order to get the same closure as F.

e.g., A B, ABCD E, EF GH, ACDF EG has the following minimal cover:

A B, ACD E, EF G and EF H

Dependency Preserving 3NF decomposition:

Let R1, R2, , Rn be a lossless-join decomposition of R with a minimal cover F

Let N be dependencies of F which are not preserved

For each FD, X A in N, add XA to the decomposition of R


1st diagram translated: Workers(S,N,L,D,S) Departments(D,M,B)

Lots associated with workers.

Suppose all workers in a dept are assigned the same lot: D L

Redundancy; fixed by: Workers2(S,N,D,S) Dept_Lots(D,L)

Can fine-tune this: Workers2(S,N,D,S) Departments(D,M,B,L)

lot

dname

budget did

since name

Works_In Departments Employees

ssn

lot

dname

budget

did

since name

Works_In Departments Employees

ssn

Before:

After:

Refining an ER Diagram

Exercise 1. Consider the following two sets of functional

dependencies

F= {A ->C, AC ->D,E ->AD, E ->H}

and

G = {A ->CD, E ->AH}

Check whether or not they are equivalent.

To show equivalence, we prove that G is covered by F and F is covered by G.

Proof that G is covered by F:

{A} + = {A, C, D} (with respect to F),

which covers A ->CD in G

{E} + = {E, A, D, H, C} (with respect to F),

which covers E ->AH in G

Proof that F is covered by G:

{A} + = {A, C, D} (with respect to G),

which covers A ->C in F

{A, C} + = {A, C, D} (with respect to G),

which covers AC ->D in F

{E} + = {E, A, H, C, D} (with respect to G),

which covers E ->AD and E ->H in F

2. Consider the relation schema EMP_DEPT and the following set F of functional dependencies on EMP_DEPT:

F = {SSN ->{ENAME, BDATE,ADD, DNUM} , DNUM ->{DNAME, DMGR} }

Calculate the closures {SSN} + and {DNUM} + with respect to F.

ENAME SSN BDATE ADD DNUM DNAME DMGR

EMP_DEPT

Answer: {SSN} + ={SSN, ENAME, BDATE, ADD, DNUM, DNAME, DMGR} {DNUM} + ={DNUM, DNAME, DMGR}

3. Is the set of functional dependencies F in Exercise 2 minimal? If not, try to find an minimal set of functional dependencies that is equivalent to F. Prove that your set is equivalent to F.

Answer:

The set F of functional dependencies in Exercise 2 is not minimal, because it violates rule 1 of minimality (every FD has a single attribute for its right hand side).

The set G is an equivalent minimal set:

G= {SSN ->{ENAME}, SSN ->{BDATE},

SSN->{ADD}, SSN ->{DNUM} ,

DNUM ->{DNAME}, DNUM->{DMGR}}

To show equivalence, we prove that F is covered by G and G is covered by F.

Proof that F is covered by G: {SSN}+={SSN, ENAME, BDATE, ADD, DNUM, DNAME, MGR} (with respect to G), which covers SSN ->{ENAME, BDATE, ADDRESS, DNUMBER} in F {DNUM} + ={DNUM, DNAME, DMGR} (with respect to G), which covers DNUM ->{DNAME, DMGR} in F

Proof that G is covered by F:

{SSN}+={SSN, ENAME, BDATE, ADD, DNUM, DNAME, DMGR}

(with respect to F), which covers

SSN ->{ENAME}, SSN ->{BDATE}, SSN ->{ADD}, and SSN ->{DNUM} in G

{DNUM} + ={DNUM, DNAME, DMGR}

(with respect to F), which covers DNUM ->{DNAME} and DNUM->{DMGR} in G

normalization

Documents

relational schema

relational database

y x augmentation

x z example

set of fds f

normalization normalization

set of relations relation

set of attributes