normalization

Upload: lasitha-lakmal-karunarathna

Post on 12-Oct-2015

12 views

Category:

Documents


0 download

DESCRIPTION

Normalizing DBMS.

TRANSCRIPT

  • Database Management System

    Normalization Process

  • This Lecture

    Schema Refinement

    Normalization

  • Schema Refinement - Review

    Conceptual Modeling is a subjective process

    Therefore, the schema after the logical database design phase may not be very good (contain redundancies)

    However, there are formalisms to ensure that the schema is good.

    This process is called Normalization

  • Schema Refinement Review (contd.)

    Relational database schema = set of relations

    Relation = set of attributes

    How we group the attributes to relations is very important

  • Schema Refinement Review (contd.)

    Too many attributes in a relation Waste space

    Anomalies

    Decomposing the relation into too smaller set of relations Loss-less join property

    Dependency preserving property

  • Schema Refinement Review (contd.)

    Too many attributes

    For example,

    LECTURER(id, name, address, salary, deptno,dname building)

  • Schema Refinement Review (contd.)

    Insertion Anomaly

    1. Inserting a new lecturer to the LECTURER table

    - Department information is repeated (ensure that correct department information is inserted).

  • Schema Refinement Review (contd.)

    2. Inserting a department with no employees

    (Impossible b/c null values for id is not allowed)

  • Schema Refinement Review (contd.)

    Deletion Anomalies

    Deleting the last lecturer from the department will lose information about the department

  • Schema Refinement Review (contd.)

    Update Anomalies

    Updating the departments building needs to be done for all lecturers working for that department

  • Schema Refinement Review (contd.).

    When redundancies exists, we should decompose the relations to smaller relations

  • Schema Refinement Review (contd.)

    Decomposing the relation into too smaller relations

    Loss-less join property: we might lose information if we decompose relations

    Dependency-preserving property: The set of dependencies in S can be verified by a set of dependencies in R1 and R2

  • Schema Refinement Review (contd.)

    Loss-less join property:

    For example,

    S P D

    S1 P1 D1

    S2 P2 D2

    S3 P1 D3

    S P

    S1 P1

    S2 P2

    S3 P1

    P D

    P1 D1

    P2 D2

    P1 D3

    S R1 R2

  • Schema Refinement Review (contd.)

    Joining them together, we get spurious tuples

    S P D

    S1 P1 D1

    S1 P1 D3

    S2 P2 D2

    S3 P1 D1

    S3 P1 D3

    R1 R2

  • Schema Refinement Review (contd.)

    To avoid the above mentioned issues in the relational schema, we can apply a formal process called Normalization

    Normalization is based on functional dependencies

  • Schema Refinement Review (contd.)

    Key points:

    Redundancy is based on functional dependencies

    Therefore, normalization is based on functional dependencies

  • Schema Refinement Review (contd.)

    Given some FDs, we can usually infer additional FDs: A B, B C implies A C

    An FD f is implied by a set of FDs F if f holds whenever all FDs in F hold. F+ = closure of F is the set of all FDs that are implied by F.

    How can we get F+?

  • Schema Refinement Review (contd.)

    Armstrongs Axioms (X, Y, Z are sets of attributes):

    Reflexivity: If X Y, then Y X

    Augmentation: If X Y, then XZ YZ for any Z

    Transitivity: If X Y and Y Z, then X Z

    These are sound and complete inference rules for FDs!

  • Schema Refinement Review (contd.)

    Couple of additional rules (that follow from AA):

    Union: If X Y and X Z, then X YZ

    Decomposition: If X YZ, then X Y and X Z

    Example: Contracts(cid,sid,jid,did,pid,qty,value), and:

    C is the key: C CSJDPQV

    Project purchases each part using single contract: JP C

    Dept purchases at most one part from a supplier: SD P

    JP C, C CSJDPQV imply JP CSJDPQV

    SD P implies SDJ JP

    SDJ JP, JP CSJDPQV imply SDJ CSJDPQV

  • Schema Refinement Review (contd.)

    Why is F+ important?

    X RHS in relation R

    X is a subset of attributes in relation R. If RHS contains all attributes of R, then X is a superkey.

    If X is not a superkey, then values for X can repeat in different tuples resulting in redundancy!!!

    So determining F+ can help us find superkeys and check for any redundancy.

  • Schema Refinement Review (contd.)

    Computing the closure of a set of FDs can be expensive. (Size of closure is exponential in # attrs!)

    Typically, we just want to check if a given FD X Y is in the closure of a set of FDs F+. An efficient check:

    Compute attribute closure of X (denoted X+) wrt F:

    Set of all attributes A such that X A is in F+

    There is a linear time algorithm to compute this.

    Check if Y is in X+

  • Schema Refinement Review (contd.)

    Algorithm to find X+:

    closure = X;

    repeat until there is no change: {

    If there is an FD U V in F such that U closure

    then set closure = closure V

    }

    Does F = {A B, B C, CD E } imply A E?

    i.e, is A E in the closure F+? Equivalently, is E in A+?

    We can use the attribute closure to find out keys of the relation. If X+ contains all attributes of the relation, then X is a superkey.

  • Schema Refinement Review (contd.)

    Schema Refinement Steps:

    Determine F for relation R

    Find all keys in F using attribute closure

    Normalize

  • Schema Refinement Review (contd.)

    There are many Normal Forms proposed to reduce redundancies

    Some of the well-known ones are:

    1st Normal Form

    2nd Normal Form

    3rd Normal Form

    Boyce-Codd Normal Form

  • Schema Refinement Review (contd.)

    Review of some terms Candidate Key: Each key of a relation is called a

    candidate key

    Primary Key: A candidate key is chosen to be the primary key

    Prime Attribute: an attribute which is a member of a candidate key

    Nonprime Attribute: An attribute which is not prime

  • Schema Refinement Review (contd.)

    1st Normal Form

    A relation R is in first normal form (1NF) if domains of all attributes in the relation are atomic (simple & indivisible).

  • Schema Refinement Review (contd.)

    2nd Normal Form:

    A relation R is in second normal form (2NF) if every nonprime attribute A in R is not partially dependent on any key of R

  • Schema Refinement Review (contd.)

    Example

    EMP_PROJ

    NIC PNUM HOURS ENAME PNAME LOC

    FD1

    FD2

    FD3

  • Schema Refinement Review (contd.)

    NIC PNUM HOURS

    NIC ENAME

    PNUM PNAME PLOC

    EP1

    EP2

    EP3

  • Schema Refinement Review (contd.)

    3rd Normal Form:

    A relation R is in 3rd normal form (3NF) if every

    R is in 2NF, and

    No nonprime attribute is transitively dependent on any key

  • Schema Refinement Review (contd.)

    Example,

    ENAME SSN BDATE ADD DNUM DNAME DMGR

    EMP_DEPT

  • Schema Refinement Review (contd.)

    ED1

    ED2

    ENAME SSN BDATE ADD DNUM

    DNUM DNAME DMGR

  • Schema Refinement Review (contd.)

    Boyce-Codd Normal Form (BCNF):

    A relation schema is in Boyce-Codd Normal Form

    If every nontrivial functional dependency XA hold in R, then X is a superkey of R

  • Schema Refinement Review (contd.)

    Keys: PropertyID, (County_Name, Lot#)

    PROPERTY_ID

    COUNTY_NAME

    LOT# AREA PRICE TAX_RATE

    FD1

    FD2

    FD3

    FD4

    FD5

  • Schema Refinement Review (contd.)

    Decomposition into BCNF:

    Consider relation R with FDs F. If X Y violates BCNF, decompose R into R - Y and XY.

    Repeated application of this idea will give us a collection of relations that are in BCNF; lossless join decomposition, and guaranteed to terminate.

    e.g., CSJDPQV, key C, JP C, SD P, J S

    To deal with SD P, decompose into SDP, CSJDQV.

    To deal with J S, decompose CSJDQV into JS and CJDQV

    In general, several dependencies may cause violation of BCNF. The order in which we deal with them could lead to very different sets of relations!

  • Schema Refinement Review (contd.)

    In general, there may not be a dependency preserving decomposition into BCNF.

    e.g., CSZ, CS Z, Z C

    Cant decompose while preserving 1st FD; not in BCNF.

    Similarly, decomposition of CSJDQV into SDP, JS and CJDQV is not dependency preserving (w.r.t. the FDs JP C, SD P and J S).

    However, it is a lossless join decomposition.

    In this case, adding JPC to the collection of relations gives us a dependency preserving decomposition.

    JPC tuples stored only for checking FD! (Redundancy!)

  • Schema Refinement Review (contd.)

    Obviously, the algorithm for lossless join decomp into BCNF can be used to obtain a lossless join decomp into 3NF (typically, can stop earlier).

    To ensure dependency preservation, one idea: If X Y is not preserved, add relation XY.

    Problem is that XY may violate 3NF! e.g., consider the addition of CJP to `preserve JP C. What if we also have J C ?

    Refinement: Instead of the given set of FDs F, use a minimal cover for F.

  • Schema Refinement Review (contd.)

    Minimal cover G for a set of FDs F: Closure of F = closure of G.

    Right hand side of each FD in G is a single attribute.

    If we modify G by deleting an FD or by deleting attributes from an FD in G, the closure changes.

    General alg. to obtain minimal cover:

    Put the FDs in a standard form (i.e. single attribute in RHS).

    Minimize the Left side of each FD. For each FD, check if we can delete attributes in LHS while preserving equivalence to F+.

    Delete any redundant FDs.

  • Schema Refinement Review (contd.)

    Intuitively, every FD in G is needed, and as small as possible in order to get the same closure as F.

    e.g., A B, ABCD E, EF GH, ACDF EG has the following minimal cover:

    A B, ACD E, EF G and EF H

    Dependency Preserving 3NF decomposition:

    Let R1, R2, , Rn be a lossless-join decomposition of R with a minimal cover F

    Let N be dependencies of F which are not preserved

    For each FD, X A in N, add XA to the decomposition of R

  • Schema Refinement Review (contd.)

    1st diagram translated: Workers(S,N,L,D,S) Departments(D,M,B)

    Lots associated with workers.

    Suppose all workers in a dept are assigned the same lot: D L

    Redundancy; fixed by: Workers2(S,N,D,S) Dept_Lots(D,L)

    Can fine-tune this: Workers2(S,N,D,S) Departments(D,M,B,L)

    lot

    dname

    budget did

    since name

    Works_In Departments Employees

    ssn

    lot

    dname

    budget

    did

    since name

    Works_In Departments Employees

    ssn

    Before:

    After:

    Refining an ER Diagram

  • Exercise 1. Consider the following two sets of functional

    dependencies

    F= {A ->C, AC ->D,E ->AD, E ->H}

    and

    G = {A ->CD, E ->AH}

    Check whether or not they are equivalent.

  • To show equivalence, we prove that G is covered by F and F is covered by G.

    Proof that G is covered by F:

    {A} + = {A, C, D} (with respect to F),

    which covers A ->CD in G

    {E} + = {E, A, D, H, C} (with respect to F),

    which covers E ->AH in G

    Proof that F is covered by G:

    {A} + = {A, C, D} (with respect to G),

    which covers A ->C in F

    {A, C} + = {A, C, D} (with respect to G),

    which covers AC ->D in F

    {E} + = {E, A, H, C, D} (with respect to G),

    which covers E ->AD and E ->H in F

  • 2. Consider the relation schema EMP_DEPT and the following set F of functional dependencies on EMP_DEPT:

    F = {SSN ->{ENAME, BDATE,ADD, DNUM} , DNUM ->{DNAME, DMGR} }

    Calculate the closures {SSN} + and {DNUM} + with respect to F.

    ENAME SSN BDATE ADD DNUM DNAME DMGR

    EMP_DEPT

    Answer: {SSN} + ={SSN, ENAME, BDATE, ADD, DNUM, DNAME, DMGR} {DNUM} + ={DNUM, DNAME, DMGR}

  • 3. Is the set of functional dependencies F in Exercise 2 minimal? If not, try to find an minimal set of functional dependencies that is equivalent to F. Prove that your set is equivalent to F.

    Answer:

    The set F of functional dependencies in Exercise 2 is not minimal, because it violates rule 1 of minimality (every FD has a single attribute for its right hand side).

    The set G is an equivalent minimal set:

    G= {SSN ->{ENAME}, SSN ->{BDATE},

    SSN->{ADD}, SSN ->{DNUM} ,

    DNUM ->{DNAME}, DNUM->{DMGR}}

  • To show equivalence, we prove that F is covered by G and G is covered by F.

    Proof that F is covered by G: {SSN}+={SSN, ENAME, BDATE, ADD, DNUM, DNAME, MGR} (with respect to G), which covers SSN ->{ENAME, BDATE, ADDRESS, DNUMBER} in F {DNUM} + ={DNUM, DNAME, DMGR} (with respect to G), which covers DNUM ->{DNAME, DMGR} in F

  • Proof that G is covered by F:

    {SSN}+={SSN, ENAME, BDATE, ADD, DNUM, DNAME, DMGR}

    (with respect to F), which covers

    SSN ->{ENAME}, SSN ->{BDATE}, SSN ->{ADD}, and SSN ->{DNUM} in G

    {DNUM} + ={DNUM, DNAME, DMGR}

    (with respect to F), which covers DNUM ->{DNAME} and DNUM->{DMGR} in G