database normalization revised

49
Sudipta Saha Page 1 5/22/2022 Database Normalization Theory (Upto BCNF) The goal of a relational-database design is to generate a set of relation schemas that allows us to store information without unnecessary redundancy, yet also allows us to retrieve information easily. One approach is to design schemas that are in an appropriate normal form. Definition: - Database normalization, (sometimes referred to as canonical synthesis), is a technique for designing relational database tables to minimize duplication of information and, in so doing, to safeguard the database against certain types of logical or structural problems, namely data anomalies. For example, when multiple instances of a given piece of information occur in a table, the possibility exists that these instances will not be kept consistent when the data within the table is updated, leading to a loss of data integrity. A table that is sufficiently normalized is less vulnerable to problems of this kind. Higher degrees of normalization typically involve more tables and create the need for a larger number of joins, which can reduce performance. Accordingly, more highly normalized tables are typically used in database applications involving many isolated transactions (e.g. an Automated teller machine), while less normalized tables tend to be used in database applications that need to map complex relationships between data entities and data attributes (e.g. a reporting application). Pitfalls in Relational-Database Design (Why is normalization required? / Why is normalization adopted for database design? / What are the advantage of normalized relations over non normalized relation?) A table that is not sufficiently normalized –is a bad design. It shows following inconvenience. 1

Upload: sudipta30saha

Post on 03-Jul-2015

116 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Database Normalization Revised

Sudipta Saha Page 1 4/12/2023

Database Normalization Theory (Upto BCNF)

The goal of a relational-database design is to generate a set of relation schemas that allows us to store information without unnecessary redundancy, yet also allows us to retrieve information easily. One approach is to design schemas that are in an appropriate normal form. Definition: - Database normalization, (sometimes referred to as canonical synthesis), is a technique for designing relational database tables to minimize duplication of information and, in so doing, to safeguard the database against certain types of logical or structural problems, namely data anomalies.

For example, when multiple instances of a given piece of information occur in a table, the possibility exists that these instances will not be kept consistent when the data within the table is updated, leading to a loss of data integrity. A table that is sufficiently normalized is less vulnerable to problems of this kind. Higher degrees of normalization typically involve more tables and create the need for a larger number of joins, which can reduce performance. Accordingly, more highly normalized tables are typically used in database applications involving many isolated transactions (e.g. an Automated teller machine), while less normalized tables tend to be used in database applications that need to map complex relationships between data entities and data attributes (e.g. a reporting application).

Pitfalls in Relational-Database Design

(Why is normalization required? / Why is normalization adopted for database design? / What are the advantage of normalized relations over non normalized relation?)

A table that is not sufficiently normalized –is a bad design. It shows following inconvenience.

Repetition of information- wastes space.For example, we can compare the space used by the two relations EMPLOYEE and DEPARTMENT in Figure 1 with the space for an EMP_DEPT relation in Figure 2. In EMP_DEPT, the attribute values pertaining to a particular department (DNUMBER, DNAME, DMGR_SSN) are repeated for every employee who works for that department, which wastes space. In contrast, each department’s information appears only once in the DEPARTMENT relation in Figure 1. Only the department number (DNUMBER) is repeated in the EMPLOYEE relation for each employee who works in that department.

Figure 1EMPLOYEE

Ename Ssn Bdate Address DnumberSmith, John B 123456789 1965-01-09 731 Fondren, Houston, TX 5Wong, Franklin T. 333445555 1955-12-08 638 Voss, Houston, TX 5Zelyala, Aliciya J. 999887777 1968-07-19 3321 Castle, Spring, TX 4

1

Page 2: Database Normalization Revised

Sudipta Saha Page 2 4/12/2023

Wallance, Jennifer S. 987654321 1941-06-20 291 Berry, Belliaire, TX 4Narayan, Ramesh K. 666884444 1962-09-15 975 Fire Oak, Humble, TX 5English, Joyce A. 453453453 1972-07-31 5631 Rice, Houston, TX 5Jabbar, Ahamed V. 987987987 1969-03-29 980 Dallas, Houston, TX 4Borg, James E 888665555 1937-11-10 450, Stone, Houston, TX 1

DEPARTMENT Dname Dnumber Dmgr_ssnResearch 5 333445555Administrator 4 987654321Headquaters 1 888665555

Figure 2EMP_DEPT

Ename Ssn Bdate Address Dnumber Dname Dmgr_ssnSmith, John B 123456789 1965-01-09 731 Fondren, Houston, TX 5 Research 333445555Wong, Franklin T. 333445555 1955-12-08 638 Voss, Houston, TX 5 Research 333445555Zelyala, Aliciya J. 999887777 1968-07-19 3321 Castle, Spring, TX 4 Administrator 987654321Wallance, Jennifer S. 987654321 1941-06-20 291 Berry, Belliaire, TX 4 Administrator 987654321Narayan, Ramesh K. 666884444 1962-09-15 975 Fire Oak, Humble, TX 5 Research 333445555English, Joyce A. 453453453 1972-07-31 5631 Rice, Houston, TX 5 Research 333445555Jabbar, Ahamed V. 987987987 1969-03-29 980 Dallas, Houston, TX 4 Administrator 987654321Borg, James E 888665555 1937-11-10 450, Stone, Houston, TX 1 Headquaters 888665555

Logical inconsistencies of various types called update anomalies result from involving data operations. In such a table we can see the following update anomalies:

♦Insertion anomaly – There is circumstance in which certain facts cannot be recorded at all. The insertion anomaly occurs when we want to insert a new record in the relation. In insertion anomaly, the user cannot insert a fact about an entity until he has an additional fact about another entity.

To insert a new employee tuple into EMP_DEPT, we must include either the attribute values for the department that the employee works for, or nulls (if the employee does not work for a department as yet). For example, to insert a new tuple for an employee who works in department number 5, we must enter the attribute values of department 5 correctly so that they are consistent with values for department 5 in other tuples in EMP_DEPT. In the design of Figure1 we do not have to worry about this consistency problem because we enter only the department number in the employee tuple; all other attribute values of department 5 are recorded only once in the database, as a single tuple in the DEPARTMENT relation.

• It is difficult to insert a new department that has no employees as yet in the EMP_DEPT relation. The only way to do this is to place null values in the attributes for employee. This causes a problem because SSN is the primary key of EMP_DEPT, and each tuple is supposed to represent an employee entity—not a department entity. Moreover, when the first employee is assigned to that department, we do not need the tuple with null values any more. This problem does not occur in the design of Figure 1, because a department is entered in the DEPARTMENT relation whether or not any employees work for it, and whenever an employee is assigned to that department, a corresponding tuple is inserted in EMPLOYEE.

2

Page 3: Database Normalization Revised

Sudipta Saha Page 3 4/12/2023

♦Deletion anomaly- The deletion anomaly occurs when a record is deleted from the relation. In this anomaly, the deletion of facts about an entity automatically deleted the fact of another entity.

For example, if we delete from EMP_DEPT an employee tuple that happens to represent the last employee working for a particular department, the information concerning that department is lost from the database. This problem does not occur in the database of Figure 1 because DEPARTMENT tuples are stored separately.

♦Modification anomaly – The modification anomaly occurs when the record is modified in the relation. In this anomaly, the modification in the value of specific attribute requires modification in all records in which that value occurs.

The same information can be expressed on multiple records; therefore updates to the table may result in logical inconsistencies. If the update is not carried through successfully—then the table is left in an inconsistent state.

For example, in EMP_DEPT, if we change the value of one of the attributes of a particular department—say, the manager of department 5—we must update the tuples of all employees who work in that department; otherwise, the database will become inconsistent. If we fail to update some tuples, the same department will be shown to have two different values for manager in different employee tuples, which should not be the case.

Ideally, a relational database table should be designed in such a way as to exclude the possibility of update, insertion, and deletion anomalies. The normal forms of relational database theory provide guidelines for deciding whether a particular design will be vulnerable to such anomalies. It is possible to correct an un-normalized design so as to make it adhere to the demands of the normal forms: this is called normalization. Removal of redundancies of the tables will lead to several tables, with referential integrity restrictions between them.

Define 1NF with example

1NF -A relation is in first normal form (1NF) if and only if the domain of an attribute must include only atomic (simple, indivisible) values and that the value of any attribute in a tuple must be a single value from the domain of that attribute. Hence, 1NF disallows having a set of values, a tuple of values, or a combination of both as an attribute value for a single tuple.

Consider the DEPARTMENT relation schema shown in figure 3, whose primary key is DNUMBER. Each department can have a number of locations. As we can see, DEPARTMENT relation is not in 1NF because DLOCATIONS is not a single-valued attribute, as illustrated the first tuple in following Figure.

Figure 3DEPARTMENTDNAME DNUMBER DMGRENO DLOCATIONSResearch 5 333445555 {Bellarie, Sugarland, Houston}

3

Page 4: Database Normalization Revised

Sudipta Saha Page 4 4/12/2023

Administration 4 987654321 {Stafford}Headquarters 1 888665555 {Houston}

There are three main techniques to achieve first normal form for such a relation:

1. Removing the attribute DLOCATIONS that violates 1NF and placing it in a separate relation DEPT_LOCATIONS along with the primary key DNUMBER of DEPARTMENT. The primary key of this relation will be the combination {DNUMBER, DLOCATION}, shown in figure 4. A distinct tuple in DEPT _ LOCATIONS exists for each location of a department. This decomposes the non-1NF relation into two 1NF relations.

Figure 4DEPARTMENTDNAME DNUMBER DMGRENOResearch 5 33445555Administration 4 987654321Headquarters 1 888665555

DEPT _ LOCATIONS

DNUMBER DLOCATION5 Bangalore5 New Delhi5 Hyderabad4 Chennai1 Hyderabad

2. Expanding the key so that there will be a separate tuple in the original DEPARTMENT relation for each location of a DEPARTMENT, as shown in figure 5. In this case, the primary key becomes the combination {DNUMBER, DLOCATION}. This solution has the disadvantage of introducing redundancy in the relation.

Figure 5 - DEPARTMENTDNAME DNUMBER DMGRENO DLOCATIONResearch 5 33445555 BangaloreResearch 5 33445555 New DelhiResearch 5 33445555 HyderabadAdministration 4 987654321 ChennaiHeadquarters 1 888665555 Hyderabad

3. If a maximum number of values is known for the attribute—for example, if it is known that at most three locations can exist for a department—replacing the DLOCATIONS attribute by three atomic attributes DLOCATION1, DLOCATION2, and DLOCATION3 shown in figure 6. This solution has the disadvantage of introducing null values, if most departments have fewer than three locations- which wastes space.

4

Page 5: Database Normalization Revised

Sudipta Saha Page 5 4/12/2023

Figure 6 - - DEPARTMENTDNAME DNUMBER DMGRENO DLOCATION1 DLOCATION2 DLOCATION3Research 5 33445555 Bangalore New Delhi HyderabadAdministration 4 987654321 Chennai Null Null Headquarters 1 888665555 Hyderabad Null Null

Of the three solutions above, the first is generally considered best because it does not suffer from redundancy and is completely general, having no limit placed on a maximum number of values.

First normal form also disallows multivalued attributes that are themselves composite. These are called nested relations because each tuple can have a relation within it.

Figure 7 shows how the EMP_PROJ relation could appear if nesting is allowed. Each tuple represents an employee entity, and a relation PROJS (PNUMBER, HOURS) within each tuple represents the employee's projects and the hours per week that employee works on each project. The schema of this EMP_PROJ relation can be represented as follows

EMP_PROJ (ENO, ENAME, {PROJS (PNUMBER, HOURS)}). The set braces {} identify the attribute PROJS as multivalued and we list the component attributes that form PROJS between parentheses ( ).Figure 7 - EMP_PROJ

Ssn Ename PNUMBERS HOURS123456789 Smith, John B 1 32.5

2 7.5666884444 Narayan, Ramesh K. 3 40.0453453453 English, Joyce A. 1 20.0

2 20.0333445555 Wong, Franklin T. 2 10.0

3 10.010 10.020 10.0

999887777 Zelyala, Aliciya J. 30 30.010 10.0

987987987 Jabbar, Ahamed V. 10 35.530 5.5

987654321 Wallance, Jennifer S. 30 20.020 15.5

888665555 Borg, James E 20 NULL

Ssn is the primary key of the EMP_PROJ relation in while PNUMBER is the partial key of the nested relation; that is, within each tuple, the nested relation must have unique values of PNUMBER. To normalize this into 1NF, we remove the nested relation attributes into a new relation and propagate the primary key into it; the primary key of the new relation will combine the partial key with the primary key of the original relation. Decomposition and primary key propagation yield the schemas EMP_PROJ1 and EMP_PROJ2 shown in figure 8.

Figure 8EMP_PROJ1

SSN ENAME

5

Page 6: Database Normalization Revised

Sudipta Saha Page 6 4/12/2023

EMP_PROJ2

SSN PNUMBER HOURS

This procedure can be applied recursively to a relation with multiple-level nesting to unrest the relation into a set of 1NF relations.

Functional Dependencies A functional dependency (FD) is a constraint between two sets of attributes in a relation from a database. Functional dependencies play a key role in differentiating good database design from bad database designs.

Definition of Functional Dependency - A functional dependency, denoted by X →Y, between two sets of attributes X and Y that are subsets of R specifies a constraint on the possible tuples that can form a relation state r of R. The constraint is that, for any two tuples t1 and t2 in r that have t1[X] = t2[X], we must also have t1[Y] = t2[Y].

This means that the values of the Y component of a tuple in r depend on, or are determined by, the values of the X component; or alternatively, the values of the X component of a tuple uniquely (or functionally) determine the values of the Y component. We also say that there is a functional dependency from X to Y or that Y is functionally dependent on X. The abbreviation for functional dependency is FD or f.d. The set of attributes X is called the left-hand side of the FD, and Y is called the right-hand side.

As an example we are considering the following schema: Lending-schema = (branch-name, branch-city, assets, customer-name, loan-number, amount). We know that a bank branch has a unique value of assets, so given a branch name we can uniquely identify the assets value. On the other hand, we know that a branch may make many loans, so given a branch name we cannot uniquely determine a loan number. In other words, we say that the functional dependency branch-name -> assets holds on Lending schema but we do not expect the functional dependency branch-name-> loan-number to hold. The fact that a branch has a particular value of assets, and the fact that a branch makes a loan are independent.

Some important concept regarding FD Relation extensions r(R) that satisfy the functional dependency constraints are called

legal extensions (or legal relation states) of R, because they obey the functional dependency constraints. A functional dependency is a property of the relation schema (intension) R, not of a particular legal relation state (extension) r of R. Hence, an FD cannot be inferred automatically from a given relation extension r but must be defined explicitly by someone who knows the semantics of the attributes of R.

Figure 9 - TEACHTeacher Course TextSmith Data Structure BartramSmith Data Management MartinHall Compilers HoffmanBrown Data structure Horowitz

For example, Figure 9 shows a particular state of the TEACH relation schema. Although at first glance we may think that TEXT → COURSE, we cannot confirm this unless we know that it is true for all possible legal states of TEACH.

• A functional dependency is a type of constraint that is a generalization of the notion of key.

6

Page 7: Database Normalization Revised

Sudipta Saha Page 7 4/12/2023

Definition of Superkey-Let R is a relation schema. A subset K of R is a superkey of R if, in any legal relation r(R), for all pairs t1 and t2 of tuples in r such that t1≠ t2, then t1 [K] ≠ t2 [K]. That is, no two tuples in any legal relation r(R) may have the same value on attribute set K.

Functional dependency X→ Y says that, X functionally determines Y in a relation schema R if and only if, whenever two tuples of r(R) agree on their X-value, they must necessarily agree on their Y-value. Using the functional-dependency notation, we say that X is a superkey of R if X-> R. That is, X is a superkey if, whenever t1[X] = t2[X], it is also the case that t1 [R] = t2 [R] (that is, t1 = t2).

Similarly, we can say if a constraint on R states that there cannot be more than one tuple with a given X-value in any relation instance r(R)—that is, X is a candidate key of R—this implies that X →Y for any subset of attributes Y of R (because the key constraint implies that no two tuples in any legal state r(R) will have the same value of X).

• If X →Y in R, this does not say whether or not Y →X in R.

Uses of functional dependencies:

1. To test relations to see whether they are legal under a given set of functional dependencies. If a relation r is legal under a set F of functional dependencies, we say that r satisfies F.2. To specify constraints on the set of legal relations.

Examples: -

A B C Da1 b1 c1 d1a1 b2 c1 d2a2 b2 c2 d2a2 b3 c2 d3a3 b3 c2 d4

Here, A -> C is satisfied. There are two tuples that have an A value of a1. These tuples have the same C value—namely, c1. Similarly, the two tuples with an A value of a2 have the same C value, c2. There are no other pairs of distinct tuples that have the same A value.

The functional dependency C -> A is not satisfied, however. To see that it is not, consider the tuples t 1 =: (a2, b3, c2, d3) and t2 = (a3, b3, c2, d4). These two tuples have the same C values, c2, but they have different A values, a2 and a3, respectively. Thus, we have found a pair of tuples t1 and t2 such that t1[C]= t2 [C], but t1 [A] ≠ t2 [A].

Trivial Functional Dependency Some functional dependencies are said to be trivial because all relations satisfy them. For example, A -> A is satisfied by all relations involving attribute A. Similarly, AB->A is

7

Sample relation r

Page 8: Database Normalization Revised

Sudipta Saha Page 8 4/12/2023

satisfied by all relations involving attribute A & B. In general, a functional dependency of the form α->β is trivial if β α, where α, β are set of attributes of any relation R.

There is a difference between the concepts of a relation satisfying a dependency and a dependency holding on a schema. If we consider the customer relation in following figure, we see that customer-street-> customer-city is satisfied.

customer-name customer-street customer-cityJones Main HarrisonSmith North RyeHayes Main HarrisonCurry North Park Rye

Lindsay Putnam Pittsfield

However, in the real world, two cities can have streets with the same name. Thus, it is possible, at some time; to have an instance of the customer relation in which customer-street -> customer-city is not satisfied. So, we would not include customer-street -> customer-city in the set of functional dependencies that hold on Customer-schema.

In the loan relation of following figure the dependency loan-number, amount is satisfied. In contrast to the case of customer-city and customer-street in Customer schema, the real-world enterprise requires each loan to have only one amount. Therefore, we want to require that loan-number -> amount be satisfied by the loan relation at all times. In other words, the constraint loan-number -> amount hold on Loan-schema.

Loan relationloan-number branch-name amount

L-17 Downtown 1000L-23 Redwood 2000L-15 Perryridge 1500L-14 Downtown 1500L-93 Mianus 500

**** Closure of a set of Functional Dependencies (Important)

It is not sufficient to consider the given set of functional dependencies. Rather, we need to consider all functional dependencies that hold. For a given set F of functional dependencies, it can be proved that certain other functional dependencies hold. We say that such functional dependencies are "logically implied" by F.

Logically Implied (or inferred) Functional Dependency-For a given a relational schema R, a functional dependency f on R is logically implied by a set of functional dependencies F on R if every relation instance r(R) that satisfies F also satisfies f.

Example: - There is a relation schema R = (A, B, C, G, H, I) and the set of functional dependencies. A-> B A ->C

8

Customer relation

Page 9: Database Normalization Revised

Sudipta Saha Page 9 4/12/2023

CG-> H CG -> I B->HThe functional dependencyA->H

is logically implied. That is, whenever given set of functional dependencies holds on a relation, A ->H must also hold on the relation. Proof:

Let that t1 and t2 are tuples such thatt1 [A] = t2 [A]Since we are given that A-> B, it follows from the definition of functional dependency thatt1[B] =t2[B]Then, since we are given that B -> H, it follows from the definition of functional dependency thatt1 [H] = t2[H]Therefore, we have shown that, whenever t1, and t2 are tuples such that t1 [A] = t2 [A], it must be that t1 [H] = t2 [H]. But that is exactly the definition of A -> H.

Definition of Closure - F be a set of functional dependencies (say). The closure of F, denoted by F+, is the set of all functional dependencies that include F as well as all dependencies logically implied by F.

Axioms-Given F, we can compute F+ directly from the formal definition of functional dependency. If F were large, this process would be lengthy and difficult. Axioms, or rules of inference, provide a simpler technique for reasoning about functional dependencies.

Armstrong's Axioms-

We can use the following three rules to find logically implied functional dependencies. By applying these rules repeatedly, we can find all of F+, given F. This collection of rules is called Armstrong's axioms in honor of the person who first proposed it.

Reflexivity rule- If α is a set of attributes and β α, then α -> β holds.Augmentation rule- If α -> β holds and γ is a set of attributes, then γα -> γβ holds.

Transitivity rule- If α -> β holds and β -> γ holds, then a -> γ holds.

Armstrong's axioms are sound, because they do not generate any incorrect functional dependencies. They are complete, because, for a given set F of functional dependencies, they allow us to generate all F + .

Although Armstrong's axioms are complete, it is tiresome to use them directly for the computation of F+. To simplify matters further, we list additional rules. It is possible to use Armstrong's axioms to prove that these rules are correct.

Union rule: If α -> β holds and α -> γ holds, then α -> β γ holds.Decomposition rule: If α -> β γ holds, then α -> β holds and α -> γ holds.Pseudo-transitivity rule: If α -> β holds and γ β -> δ holds, then α γ -> δ holds.

9

Page 10: Database Normalization Revised

Sudipta Saha Page 10 4/12/2023

Proof of Axioms (inference rules) Each of the preceding inference rules can be proved from the definition of functional dependency, either by direct proof or by contradiction. A proof by contradiction assumes that the rule does not hold and shows that this is not possible.

Proof of Reflexivity rule

Suppose that X Y and that two tuples t1 and t2 exist in some relation instance r of R such that t1[X] = t2[X]. Then t1[Y] = t1[Y] because X Y; hence, X→Y must hold in r.

Proof of Augmentation rule

Let us assume that X→Y holds in a relation instance r of R but that XZ→YZ does not hold. Then there must exist two tuples t1and t2 in r such that (1) t1[X] = t2[X], (2) t1 [Y] = t2 [Y], (3) t1 [XZ] = t2 [XZ], and (4) t1 [YZ] ≠t2 [YZ]. This is not possible because from (1) and (3) we deduce (5) t1[Z] = t2[Z], and from (2) and (5) we deduce (6) t1 [YZ] = t2 [YZ], contradicting (4).

Proof of Transitivity rule Let us assume that (1) X →Y and (2) Y→ Z both hold in a relation r. Then for any two tuples t1and t2 in r such that t1 [X] = t2 [X], we must have (3) t1[Y] = t2[Y], from assumption (1); hence we must also have (4) t1 [Z] = t2 [Z], from (3) and assumption (2); hence X →Z must hold in r.

Proof Union rule (Using Armstrong’s Axioms)1. X → YZ (given).2. YZ → Y (using Reflexivity rules and knowing that YZ Y).3. X → Y (using Transitivity rule on 1 and 2).

Decomposition rule (Using Armstrong’s Axioms)

1. X → Y (given).2. X → Z (given).3. X → XY (using augmentation rule on 1 by augmenting with X; notice that XX = X).4. XY → YZ (using augmentation rule on 2 by augmenting with Y).5. X →YZ (using transitivity on 3 and 4).Pseudo-transitivity rule(Using Armstrong’s Axioms)1. X →Y (given).2. WY →Z (given).3. WX →WY (using augmentation rule on 1 by augmenting with W).4. WX → Z (using transitivity rule on 3 and 2).

A procedure to compute F +

F+ = Frepeat for each functional dependency f in F+

apply reflexivity and augmentation rules on f add the resulting functional dependencies to F+

for each pair of functional dependencies f1 and f2 in F+

if f1 and f2 can be combined using transitivity add the resulting functional dependency to F+

10

Page 11: Database Normalization Revised

Sudipta Saha Page 11 4/12/2023

until F+ does not change any further

Let us apply our rules to the example of schema R (A, B, C, G, H, I) and the set F of functional dependencies {A-> B, A-> C, CG-> H, CG-> I, B-> H}. We list several members of F+ here:

A -> H. Since A -> B and B -> H hold, we apply the transitivity rule. CG-> HI. Since CG -> H and CG ->I, the union rule implies that CG->HI.AG->I. Since A->C and CG -> I, the pseudo-transitivity rule implies that AG -> I holds. Another way of finding that AG -> I holds is as follows: We use the augmentation rule on A -> C to infer AG -> CG. Applying the transitivity rule to this dependency and CG-> I, we infer AG -> I.

//The left-hand and right-hand sides of a functional dependency are both subsets of R. Since a set of size n has 2n subsets, there are a total of 2 x 2n = 2 n+1 possible functional dependencies, where n is the number of attributes in R.

Typically, database designers first specify the set of functional dependencies F that can easily be determined from the semantics of the attributes of R; then reflexivity, augmentation, and transitivity are used to infer additional functional dependencies that will also hold on R. A systematic way to determine these additional functional dependencies is first to determine each set of attributes α that appears as a left-hand side of some functional dependency in F and then to determine the set of all attributes that are dependent on α. Thus for each such set of attributes α, we determine the set of attributes that are functionally determined by α based on F; is called the closure of α under F.

***** Definition of Closure of Attribute Sets (Important)

Let α be a set of attributes. We call the set of all attributes functionally determined by α under a set F of functional dependencies is the closure of α under F; we denote it by α +.

An algorithm to compute α+ the closure of α under F result := α;while (changes to result) do for each functional dependency β->γ in F do begin if β result then result := result γ ; end

Here the input is a set F of functional dependencies and the set α of attributes. The output is stored in the variable result.

To illustrate how the algorithm works, we shall use it to compute (AG)+ with the set F of functional dependencies {A-> B, A-> C, CG-> H, CG-> I, B-> H}.

11

Attribute closure algorithm

Page 12: Database Normalization Revised

Sudipta Saha Page 12 4/12/2023

We start with result = AG. The first time that we execute the while loop to test each functional dependency, we find that

A -> B causes us to include B in result. To see this fact, we observe that A-> B is in F, A result (which is AG), and so result: = result B.

A ->C causes result to become ABCG. CG->H causes result to become ABCGH. CG ->I causes result to become ABCGHIThe second time that we execute the while loop, no new attributes are added to result, and the algorithm terminates.

Another example to illustrate Closure of Attribute Sets Consider the following relations:

EMP_PROJ

SSN PNUMBER HOURS ENAME PNAME PLOCATIONSHere

SSN → ENAMEPNUMBER→ PNAME, PLOCATIONSSSN, PNUMBER → HOURS

Hence, {SSN}+ = {SSN, ENAME}

{PNUMBER}+ = {PNUMBER, PNAME, PLOCATIONS}

{SSN, PNUMBER}+ = {SSN, PNUMBER, HOURS, ENAME, PNAME, PLOCATIONS}

There are several uses of the attribute closure algorithm:i) To test if α is a superkey, we compute α+, and check if α+ contains all attributes of Rii) We can check if a functional dependency a -> β holds (or, in other words, is in F+),

by checking if β α+. That is, we compute α + by using attribute closure, and then check if it contains β.

iii) It gives us an alternative way to compute F+: For each γ R, we find the closure γ+ , and for each S γ+ we output a functional dependency γ->S.

Canonical Cover

Suppose that we have a set of functional dependencies F on a relation schema. Whenever a user performs an update on the relation, the database system must ensure that the update does not violate any functional dependencies, that is, all the functional dependencies in F are satisfied in the new database state. The system must roll back the update if it violates any functional dependencies in the set F.

We can reduce the effort spent in checking for violations by testing a simplified set of functional dependencies that has the same closure as the given set. Any database that satisfies the simplified set of functional dependencies will also satisfy the original set, and vice versa, since the two sets have the same closure. However, the simplified set is easier to test.

Consider a set F of functional dependencies and the functional dependency α-> β in F.

12

Page 13: Database Normalization Revised

Sudipta Saha Page 13 4/12/2023

Extraneous Attributes of functional Dependency-An attribute of a functional dependency is said to be extraneous if we can remove it without changing the closure of the set of functional dependencies. The formal definition of extraneous attributes is as follows. Consider a set F of functional dependencies and the functional dependency a -> β in F.

Attribute A is extraneous in α if A α, and F logically implies (F - {α -> β}) {(α -A) -> β}.

Attribute A is extraneous in β if A β, and the set of functional dependencies (F - {α -> β}) {α -> (β - A)} logically implies F.

Testing extraneous attributes If A β to check if A is extraneous we shall consider the set F´ = (F - {α → β})

{α → (β – A)} and check if α → A can be inferred from F´. To do so, compute α+ (the closure of α) under F´ if α includes A, then A is extraneous in β.

If A α, to check if A is extraneous, let γ = α - {A}, and check if γ →β can be inferred from F. To do so, compute γ+ (the closure of γ) under F; if γ includes all attributes in β, then A is extraneous in α.

13

Explanation 1:If F Contains 1) ABC->XYZ

(say α) (say β)2) BC->XYZThen A is extraneous in α

Because F logically implies (F- {ABC ->XYZ}) (BC->XYZ) As we know by augmentation rule that, If BC->XYZ, then ABC->AXYZThen by decomposition rule, If ABC->AXYZ, then ABC-> XYZ

Explanation 2:If F Contains 1) ABC->XYZ

(say α) (say β)2) ABC->YZ 3) YZ ->XThen X is extraneous in β

Because F logically implies (F- {ABC ->XYZ}) (ABC->YZ) As we know by transitivity rule that, If ABC->YZ and YZ->X then ABC->XThen by union rule, If ABC->YZ & ABC->X, then ABC-> XYZ

Page 14: Database Normalization Revised

Sudipta Saha Page 14 4/12/2023

Canonical Cover -A canonical cover FC, for F is a set of dependencies such that F logically implies all dependencies in FC, and FC logically implies all dependencies in F. Furthermore, FC must have the following properties:

i) No functional dependency in FC contains an extraneous attribute.ii) Each left side of a functional dependency in FC is unique. That is, there are no two

dependencies α1->β1 and α2->β2 and in FC such that α1 =α2

The canonical cover of F, Fc can be shown to have the same closure as F; hence, testing whether Fc is satisfied is equivalent to testing whether F is satisfied.

Is the canonical cover unique?No, canonical cover of a set of functional dependencies is not unique. A canonical cover of a set of functional dependencies F is a minimal set of dependencies that is equivalent to F. Unfortunately there can be several canonical covers for a set of functional dependencies.

Computing canonical cover

Fc = Frepeat Use the union rule to replace any dependencies in Fc of the form α1->β1 and α2->β2 with α1->β1, β2 Find a functional dependency α->β in Fc with an extraneous attribute either in α or in β /* Note: the test for extraneous attributes is done using Fc not F */ If an extraneous attribute is found, delete it from α->βuntil Fc does not change.

Consider the following set F of functional dependencies on schema (A, B, C):A-> BCB->CA->B

AB-> CLet us compute the canonical cover for F.

There are two functional dependencies with the same set of attributes on the left side of the arrow:

A-> BCA->B

We combine these functional dependencies into A -> BC

A is extraneous in AB -> C because F logically implies (F - {AB ->C}) {B-> C}. This assertion is true because B-> C is already in our set of functional dependencies.

C is extraneous in A -> BC, since A-> BC is logically implied by A -> B and B-> C.

Thus, our canonical cover isA-> B B->C

14

Page 15: Database Normalization Revised

Sudipta Saha Page 15 4/12/2023

Another example –

Consider the following set F of functional dependencies on schema (A, B, D):B-> AD->A

AB ->DLet us compute the canonical cover for F.

Step -1: Fc = F, hence Fc = {B-> A, D->A, AB ->D}Step 2: There are no two or more functional dependencies with the same set of attributes on the left side of the arrow.Step 3. A is extraneous in AB → D because Fc logically implies (F - {AB ->D}) {B-> D}. This assertion is true because B→ A is already in our set of functional dependencies, Fc. By augmenting with B both sides we get BB →AB or B →AB. (i)

AB → D (ii) is present here, then by using transitivity rule on (i) and (ii) we get B→D.(iii) So, AB -> D is replaced by B → D in Fc

No further reduction is possible, since all the FD’s have single attribute on the left hand side.Step 4: Fc changes, Now Fc = {B-> A, D->A, B ->D}Step 5: There are two functional dependencies with the same set of attributes on the left side of the arrow: B→ D and B→A, We combine these functional dependencies into B -> AD (iv)Step 6: A is extraneous in B -> AD, since B -> AD is logically implied by B -> D (v) and D-> A (vi). No further reduction is possible, since all the FD’s have single attribute on the left hand side.Step 7: Fc changes, Now Fc = { D->A, B ->D}Step 8: There are no two or more functional dependencies with the same set of attributes on the left side of the arrowStep 9: No reduction is possible, since all the FD’s have single attribute on the left hand side.Step 10: Fc doesn’t changes. So canonical cover Fc is { D->A, B ->D}

Define 2NF with example

2NF - Second normal form (2NF) is based on the concept of full functional dependency. A functional dependency X -> Y is a full functional dependency if removal of any attribute A from X means that the dependency does not hold any more; that is, for any attribute A X, (X – {A}) does not functionally determine Y.

A functional dependency X ->Y is a partial dependency if some attribute A X can be removed from the dependency still holds; that is, for some A X, (X – {A}) -> Y.

2NF-A relation schema R is in 2NF with respect to a set F of functional dependencies if it satisfies 1NF and every nonprime attribute A in R is fully functionally dependent on the primary key of R.

Or General definition is A relation schema R is in second normal form (2NF) with respect to a set F of functional dependencies if it satisfies 1NF & every nonprime attribute A in R is not partially dependent on any candidate key of R.

15

Page 16: Database Normalization Revised

Sudipta Saha Page 16 4/12/2023

The test for 2NF involves testing for functional dependencies whose left-hand side attributes are part the primary key. If the primary key contains a single attribute, the test need not be applied at all.Example1 Consider the following relations in figure 10:

Figure 10-EMP_PROJ

SSN PNUMBER HOURS ENAME PNAME PLOCATIONS

The EMP_ PROJ relation in above figure is in 1NF but is not in 2NF. The nonprime attribute ENAME violates 2NF because of FD2, as to do the nonprime attributes PNAME and PLOCATION because of FD3. The functional dependencies FD2 and FD3 make ENAME, PNAME, and PLOCATION partially dependent on the primary key {SSN, PNUMBER} of EMP_PROJ, thus violating the 2NF test.

If a database is in lower normal form to make it in higher normal form, we should decompose that relation schema into several schemas with fewer attributes. Here also we decompose EMP_PROJ into three relation schemas shown in figure 11.

Figure 11EP1

EP2SSN ENAME

PNUMBER PNAME PLOCATION

Partial and full functional dependencies will now be considered with respect to all candidate key of a relation.

Example 2

SSN PNUMBER HOURS

16

FD1

FD2

FD3

FD1

FD2

FD3

Page 17: Database Normalization Revised

Sudipta Saha Page 17 4/12/2023

Consider the relation schema LOTS shown in following figure which describes parcels of land for sale in various counties of a state. Suppose that there are two candidate keys: PROPERTY_ID# and {COUNTY_NAME, LOT#}; that is, lot numbers are unique only within each county, but PROPERTY-ID numbers are unique across counties for the entire state.

Based on the two candidate keys PROPERTY_ID# and {COUNTY_NAME, LOT#}, we know that the functional dependencies FD1 and FD2 of Figure hold.

We choose PROPERTY_ID# as the primary key, so it is underlined, but no special consideration will be given to this key over the other candidate key. Suppose that the following two additional functional dependencies hold in LOTS:

COUNTY_NAME -> TAX_RATEAREA->PRICEIn words, the dependency FD3 says that the tax rate is fixed for a given county (does not vary lot by lot within the same county), while FD4 says that the price of a lot is determined by its area regardless of which county it is in.LOTSPROPERTY-ID COUNTY_NAME LOT# AREA PRICE TAX-RATE

The LOTS relation schema violates the general definition of 2NF because TAX_RATE is partially dependent on the candidate key {COUNTY_NAME, LOT#}, due to FD3. To normalize LOTS into 2NF, we decompose it into the two relations LOTS1 and LOTS2, shown in following figure. We construct LOTS1 by removing the attribute TAX_RATE that violates 2NF from LOTS and placing it with COUNTY-NAME (the left-hand side of FD3 that causes the partial dependency) into another relation LOTS2. Both LOTS1 and LOTS2 are in 2NF. Notice that FD4 does not violate 2NF and is carried over to LOTS1.

LOTS1PROPERTY-ID COUNTY_NAME LOT# AREA PRICE

17

FD1

FD2

FD3FD4

FD1

FD2

FD4

Page 18: Database Normalization Revised

Sudipta Saha Page 18 4/12/2023

LOTS2COUNTY_NAME TAX_RATE

**The test for 2NF involves testing for functional dependencies whose left-hand site attributes are part of candidate key. If candidate key contain single attribute, the test not to be applied at all.

Define 3NF with example

Third normal form (3NF) is based on the concept of transitive dependency. A functional dependency X-> Y in a relation schema R is a transitive dependency if there is a set of attributes Z that is neither a candidate key nor a subset of any key of R, and both X -> Z and Z ->Y hold.

3NF- According to Codd's original definition, a relation schema R is in 3NF with respect to a set F of functional dependencies if it satisfies 2NF and no nonprime attribute of R is transitively dependent on the primary key. Or (General Definition)

A 3NF relation should not have a nonkey attributes functionally determined by another non-key attribute (or by a set of nonkey attributes).A relation schema R is in third normal form (3NF) with respect to a set F of functional dependencies if, whenever a nontrivial functional dependency X -> A holds in R, either (a) X is a superkey of R, or (b) A is a prime attribute of R.

Example1: Consider the relation schema EMP_DEPT in the figure 12.

Figure 12EMP_DEPTENAME SSN DOB ADDRESS DNUMBER DENAME DMGRNO

The above relation schema is in 2NF, since no partial dependencies on a key exist. However, EMP_DEPT is not in 3NF because of the transitive dependency of DMGRENO (and also DNAME) on SSN via DNUMBER. We can normalize EMP_DEPT by decomposing it into the two 3NF relation schemas EDI and ED2.

ED1ENAME SSN DOB ADDRESS DNUMBER

ED2

18

FD3

FD1

FD2

FD1

Page 19: Database Normalization Revised

Sudipta Saha Page 19 4/12/2023

DNUMBER DNAME DMGRNO

Intuitively, we see that EDI and ED2 represent independent entity facts about employees and departments.

Intuitively, we can see that any functional dependency in which the left-hand side is part (proper subset) of the candidate key, or any functional dependency in which the left-hand side is a nonkey attribute, is a problematic FD. 2NF and 3NF normalization remove these problem FDs by decomposing the original relation into new relation. In terms of the normalization process, it is not necessary to remove the partial dependencies before the transitive dependencies, but historically, 3NF has been defined with the assumption that a relation is tested for 2NF first before it is tested for 3NF.

Example –2

Consider the following two relation Schemas:

LOTS1PROPERTY-ID COUNTY_NAME LOT# AREA PRICE

LOTS2COUNTY_NAME TAX_RATE

Here LOTS2 is in 3NF. However, FD4 in LOTS1 violates 3NF because AREA is not a superkey and PRICE is not a prime attribute in LOTS1. To normalize LOTS1 into 3NF we decompose it into relation schemas LOTS1A and LOTS1B shown in following figure. We construct LOTS1A attribute by removing the attribute PRICE that violates 3NF from LOTS1 and placing it with AREA (the left-hand side of FD4 that causes the transitive dependency) into another relation LOTS1B. Both LOTS1A and LOTS1B are in 3NF.

LOTS1APROPERTY-ID COUNTY_NAME LOT# AREA

19

FD2

FD1

FD2

FD4

FD3

FD1

FD2

Page 20: Database Normalization Revised

Sudipta Saha Page 20 4/12/2023

LOTS1BAREA PRICE

Interpreting the General Definition of Third Normal Form

A relation schema R violates the general definition of 3NF if a functional dependency X->A holds in R that violates both conditions (a) and (b). Violating (b) means that A is a nonprime attribute. Violating (a) means that X is not a superset of any key of R; hence, X could be nonprime or it could be a proper subset of a key of R. If X is nonprime, we typically have a transitive dependency that violates 3NF, whereas if X is a proper subset of a key of R, we have a partial dependency that violates 3NF (and also 2NF).

Boyce-Codd normal form (BCNF) was proposed as a simpler form of 3NF, but it was found to be stricter than 3NF. That is, every relation in BCNF is also in 3NF; however, a relation in 3NF is not necessarily in BCNF.

BOYCE-CODD NORMAL FORM (BCNF)-A relation schema R is in BCNF with respect to a set F of functional dependencies if whenever a nontrivial functional dependency X -> A holds in R, then X is a superkey of R.

The formal definition of BCNF differs slightly from the definition of 3NF. The only difference between definitions of BCNF and 3NF is that condition (b) of 3NF, which allows A to be prime, is absent from BCNF.

Consider the following relation schema:

LOTS1APROPERTY ID# COUNTY_NAME LOT# AREA

20

FD3

FD1

FD2

FD5

Page 21: Database Normalization Revised

Sudipta Saha Page 21 4/12/2023

Here FD5 violates BCNF in LOTS1A because AREA is not a superkey of LOTS1A. Note that FD5 satisfies 3NF in LOTS1A because COUNTY_NAME is a prime attribute (condition b), but this condition does not exist in the definition of BCNF. We can decompose LOTS1A into two BCNF relations LOTS1AX and LOTS1AY, shown in following figure. This decomposition loses the, functional dependency FD2 because its attributes no longer coexist in the same relation after decomposition.

LOTS1AXPROPERTY ID# COUNTY_NAME LOT#

LOTS1AYAREA COUNTY_NAME

In practice, most relation schemas that are in 3NF are also in BCNF. Only if X ->A holds in a relation schema R with X not being a superkey and A being a prime attribute will R be in 3NF but not in BCNF.

Decomposition If a database is in lower normal form to make it in higher normal form, we should decompose that relation schema into several schemas with fewer attributes. Careless decomposition, however, may lead to another form of bad design. We will discuss some required properties of relational decomposition

1) We must make sure that each attribute in schema R which is decomposed into D={ R1, R2, ....Rm}will appear in at least one relation schema Ri in the decomposition so that no attributes are "lost"; formally we have Ri=R

This is called the attribute preservation condition of decomposition.

2) We must make sure that decomposition will be Lossless (Nonadditive) Joins of decomposition.

Definition of lossless (Nonadditive) join-decomposition - Let F be a set of functional dependencies on relation schema R. A decomposition D ={ R1, R2, ....Rm}of R is a lossless join decomposition, if for all relation state r on schema R, that are legal under F, the follows holds

r = ΠR1(r) ΠR2(r) ...... ΠRm(r)

3) We must make sure that decomposition will be dependency preserving.

21

Page 22: Database Normalization Revised

Sudipta Saha Page 22 4/12/2023

We say that decomposition of relation schema R into D = {R1, R2,….Rm} is dependency preserving with respect to F if the union of projection of F on each Rj in D is equivalent to F that is ((ΠR1(F)) (ΠR2(F)) … (ΠRm(F)))+ = F+

Given a set of dependencies F on R, the projection of F on Rj denoted by ΠRj(F) where Rj is a subset of R, is set of dependencies X → Y in F+ such that attribute X Y are in all contained in Rj. Hence the projection of F on each relation schema Rj in the decomposition of D ={ R1, R2, ....Rm} is the set of functional dependencies in F+, the closure of F, such that their left and right hand side attributes are in Rj.

We will explain lossless-join-decomposition using following example.

Lending schemaBranch-Name Branch-City Assets Customer-

NameLoan-Number Amount

Downtown Brooklyn 9000000 Jones L-17 1000Redwood Palo Alto 2100000 Smith L-23 2000Perryridge Horseneck 1700000 Hayes L-15 1500Downtown Brooklyn 9000000 Jackson L-14 1500Mianus Horseneck 400000 Jones L-93 500Round Hill Horseneck 8000000 Turner L-11 900Pownal Bennington 300000 Williams L-29 1200North Town Rye 3700000 Hayes L-16 1300Downtown Brooklyn 9000000 Johnson L-18 2000Perryridge Horseneck 1700000 Glenn L-25 2500Brighton Brooklyn 7100000 Brooks L-10 2200

Due to presence of FD2 (Which is a transitive dependency Loan-Number -> Branch-Name & Branch-Name-> Branch-City, Assets hence Loan-Number -> Branch-City, Assets), Lending is not in 3NF.

We decompose Lending-schema into the following two schemas:i. Branch-customer=Π branch-name, branch-city, assets, customer-name (Lending)ii. Customer-loan = Π customer-name, loan-number, amount (Lending)

The following Functional dependency Holds: Branch-customer = (branch-name, branch-city, assets, customer-name)

Customer-loan = (customer-name, loan-number, amount)

22

FD1

FD2

Page 23: Database Normalization Revised

Sudipta Saha Page 23 4/12/2023

Resulting branch-customer and customer-loan relation are shown bellow:

Branch-customerBranch-Name Branch-City Assets Customer-

NameDowntown Brooklyn 9000000 JonesRedwood Palo Alto 2100000 SmithPerryridge Horseneck 1700000 HayesDowntown Brooklyn 9000000 JacksonMianus Horseneck 400000 JonesRound Hill Horseneck 8000000 TurnerPownal Bennington 300000 WilliamsNorth Town Rye 3700000 HayesDowntown Brooklyn 9000000 JohnsonPerryridge Horseneck 1700000 GlennBrighton Brooklyn 7100000 Brooks

Customer-loanCustomer-

NameLoan-Number Amount

Jones L-17 1000Smith L-23 2000Hayes L-15 1500Jackson L-14 1500Jones L-93 500Turner L-11 900Williams L-29 1200Hayes L-16 1300Johnson L-18 2000Glenn L-25 2500Brooks L-10 2200

Of course, there are cases in which we need to reconstruct the loan relation. For example, suppose that we wish to find all branches that have loans with amounts less than $1000. No relation in our alternative database contains these data. We need to reconstruct the lending relation. It appears like the following Relation.

Branch-customer Customer-loanBranch-Name Branch-City Assets Customer-

NameLoan-Number Amount

Downtown Brooklyn 9000000 Jones L-17 1000Downtown Brooklyn 9000000 Jones L-93 500Redwood Palo Alto 2100000 Smith L-23 2000

23

Page 24: Database Normalization Revised

Sudipta Saha Page 24 4/12/2023

Perryridge Horseneck 1700000 Hayes L-15 1500Perryridge Horseneck 1700000 Hayes L-16 1300Downtown Brooklyn 9000000 Jackson L-14 1500Mianus Horseneck 400000 Jones L-17 1000Mianus Horseneck 400000 Jones L-93 500Round Hill Horseneck 8000000 Turner L-11 900Pownal Bennington 300000 Williams L-29 1200North Town Rye 3700000 Hayes L-15 1500North Town Rye 3700000 Hayes L-16 1300Downtown Brooklyn 9000000 Johnson L-18 2000Perryridge Horseneck 1700000 Glenn L-25 2500Brighton Brooklyn 7100000 Brooks L-10 2200

When we compare Branch-customer Customer-loan and the lending relation with which we started, we notice a difference: Although every tuple that appears in the lending relation appears

in Branch-customer Customer-loan, there are tuples in Branch-customer Customer-loan

that are not in lending. In our example, Branch-customer Customer-loan has the following additional tuples:

(Downtown, Brooklyn, 9000000, Jones, L-93, 500) (Perryridge, Horseneck, 1700000, Hayes, L-16, 1300) (Mianus, Horseneck, 400000, Jones, L-17, 1000) (North Town, Rye, 3700000, Hayes, L-15, 1500)

Now we consider the query, "Find all bank branches that have made a loan in an amount less than $1000." If we look back at Lending relation, we see that the only branches with loan amounts less than $1000 are Mianus and Round Hill. However, when we apply the expression Branch-

customer Customer-loan, we obtain three branch names: Mianus, Round Hill, and Downtown.

A closer examination of this example shows why. If a customer happens to have several loans from different branches, we cannot tell which loan belongs to which branch from branch-customer and customer-loan. Thus, when we join branch-customer and customer-loan, we obtain not only the tuples we had originally in lending, but also several additional tuples.

Although we have more tuples in Branch-customer Customer-loan, we actually have less information. We are no longer able, in general, to represent in the database information about which loan are taken from which branch. Because of this loss of information, we call the decomposition of Lending-schema into Branch-customer-schema and customer-loan-schema a lossy decomposition, or a lossy-join decomposition.

A decomposition that is not a lossy-join decomposition is a lossless-join decomposition. It should be clear from our example that a lossy-join decomposition is, in general, a bad database design.

***The word loss in lossy refers to loss of information, not to loss of tuples. If a decomposition does not have the lossless join property, we may get additional spurious tuples after the

24

Page 25: Database Normalization Revised

Sudipta Saha Page 25 4/12/2023

NATURAL JOIN( ) operations are applied; these additional tuples represent erroneous information. We prefer the term nonadditive join because it describes the situation more accurately; if the property holds on a decomposition, we are guaranteed that no spurious tuples bearing wrong information are added to the result after the d NATURAL JOIN operations are applied.

Why is the decomposition lossy?

There is one attribute in common between Branch-customer-schema and Customer-loan-schema:Branch-customer-schema Customer-loan-schema = {customer-name}The only way that we can represent a relationship between, for example, loan-number and branch-name is through customer-name. This representation is not adequate because a customer may have several loans (customer-name is not a superkey of Customer-loan), yet these loans are not necessarily obtained from the same branch (customer-name is not a superkey of Branch-customer).

Let us consider another alternative design, in which we decompose Lending-schema into the following two schemas:Branch-schema = (branch-name, branch-city, assets)

Loan-info-schema = (branch-name, customer-name, loan-number, amount)

Resulting Branch and Loan-info relation are shown bellow:

Branch

Loan-infoBranch-Name Customer-

NameLoan-Number Amount

Downtown Jones L-17 1000Redwood Smith L-23 2000Perryridge Hayes L-15 1500Downtown Jackson L-14 1500Mianus Jones L-93 500Round Hill Turner L-11 900Pownal Williams L-29 1200

Branch-Name Branch-City Assets Downtown Brooklyn 9000000Redwood Palo Alto 2100000Perryridge Horseneck 1700000Mianus Horseneck 400000Round Hill Horseneck 8000000Pownal Bennington 300000North Town Rye 3700000Brighton Brooklyn 7100000

25

Page 26: Database Normalization Revised

Sudipta Saha Page 26 4/12/2023

North Town Hayes L-16 1300Downtown Johnson L-18 2000Perryridge Glenn L-25 2500Brighton Brooks L-10 2200If we reconstruct the lending relation. It appears like the following Relation.

Branch Loan-info

Branch-Name Branch-City Assets Customer-Name

Loan-Number Amount

Downtown Brooklyn 9000000 Jones L-17 1000Redwood Palo Alto 2100000 Smith L-23 2000Perryridge Horseneck 1700000 Hayes L-15 1500Downtown Brooklyn 9000000 Jackson L-14 1500Mianus Horseneck 400000 Jones L-93 500Round Hill Horseneck 8000000 Turner L-11 900Pownal Bennington 300000 Williams L-29 1200North Town Rye 3700000 Hayes L-16 1300Downtown Brooklyn 9000000 Johnson L-18 2000Perryridge Horseneck 1700000 Glenn L-25 2500Brighton Brooklyn 7100000 Brooks L-10 2200

When we compare Branch Loan-info and the lending relation with which we started, we notice no difference.

There is one attribute in common between these two schemas: Branch loan-info = {branch-name}Thus, the only way that we can represent a relationship between, for example, customer-name and assets is through branch-name.

The difference between this example and the preceding one is that for a given branch-name, there is exactly one assets value and exactly one branch-city (branch-name is superkey of Branch); whereas a similar statement cannot be made for customer-name. That is, the functional dependency branch-name -> assets, branch-city holds.

Testing for Lossless-join decomposition

If a relation is decomposed into two relations Let R be a relation schema, and let F be a set of functional dependencies on R. Let R1, and R2

form a decomposition of R. This decomposition is a lossless-join decomposition of R if at least one of the following functional dependencies is in F+:R1 R2 -> R1

R1 R2 -> R2

In other words, if R1 R2 forms a superkey of either R1 or R2 , the decomposition of R is a lossless-join decomposition.

If a relation is decomposed into more than two relations

26

Page 27: Database Normalization Revised

Sudipta Saha Page 27 4/12/2023

There is an algorithm of testing lossless join property of a decomposition in which a relation is decomposed into two or more relations.

Input of the algorithm is: - A universal relation R, a decomposition of R into D={R1, R2, R3, ....,Rm} and a set F of functional dependencies.

Algorithm is as follows: 1. Create an initial matrix S with one row i for each relation Ri in D, and one column j for each attribute Aj in R.

2. Set S(i, j) := bij for all matrix entries.

3. For each row i representing relation schema Ri

{for each column j representing attribute Aj

{if (relation includes attribute ) then set S(i, j):=aj ;};};4. Repeat the following loop until a complete loop execution results in no changes to S {for each functional dependency X →Y in F {for all rows in S which have the same symbols in the columns corresponding to attributes in X {make the symbols in each column that correspond to an attribute in Y be the same in all these rows as follows: if any of the rows has an "a" symbol for the column, set the other rows to that same "a" symbol in the column. If no "a" symbol exists for the attribute in any of the rows, choose one of the "b" symbols that appear in one of the rows for the attribute and set the other rows to that same "b" symbol in the column ;}If a row is made up entirely of "a" symbols, goto step 5;};};

5. If a row is made up entirely of "a" symbols, then the decomposition has the lossless join property; otherwise it does not.

Example 1-

Let EMP_PROJ = {SSN, ENAME, PNUMBER, PNAME, PLOCATIONS, HOURS) is decomposed intoEMP_LOCS = {ENAME, PLOCATIONS}EMP_PROJ1 = {SSN, PNUMBER, HOURS, PNAME, PLOCATIONS}Set of functional dependency F = {SSN → ENAME, PNUMBER→(PNAME, PLOCATIONS), (SSN, PNUMBER) → HOURS }

Show the decomposition is lossless (holds non-additive join property) or not

Answer: The initial Matrix S looks as follows:

SSN ENAME PNUMBER PNAME PLOCATIONS HOURS

EMP_LOCS b11 a2 b13 b14 a5 b16

EMP_PROJ1 a1 b22 a3 a4 a5 a6

27

Page 28: Database Normalization Revised

Sudipta Saha Page 28 4/12/2023

Then we apply the functional dependencies SSN → ENAME, PNUMBER→(PNAME, PLOCATIONS), (SSN, PNUMBER) → HOURS one by one according to step 3

The loop in step 4 of the algorithm cannot change any ‘b’ symbols to ‘a’ symbols, hence the resulting matrix S does not have a row with all a symbols, and so the decomposition does not have lossless join property.

Example 2-

Let EMP_PROJ = {SSN, ENAME, PNUMBER, PNAME, PLOCATIONS, HOURS) is decomposed intoEMP = {SSN, ENAME}PROJ = {PNAME, PNUMBER, PLOCATIONS}WORKS_ON = {SSN, PNUMBER, HOURS}

Set of functional dependency F = {SSN → ENAME, PNUMBER→(PNAME, PLOCATIONS), (SSN, PNUMBER) → HOURS }

Show the decomposition is lossless (non-additive join property or not)

Answer The initial Matrix S looks as follows:

SSN ENAME PNUMBER PNAME PLOCATIONS HOURS

EMP a1 a2 b13 b14 b15 b16PROJ b21 b22 a3 a4 a5 b26WORKS_ON a1 b32 a3 b34 b35 a6

After applying SSN → ENAME, S matrix will beSSN ENAME PNUMBER PNAME PLOCATIONS HOURS

EMP a1 a2 b13 b14 b15 b16PROJ b21 b22 a3 a4 a5 b26WORKS_ON a1 b32 a2 a3 b34 b35 a6

After applying PNUMBER→(PNAME, PLOCATIONS), S matrix will be

SSN ENAME PNUMBER PNAME PLOCATIONS HOURS

EMP a1 a2 b13 b14 b15 b16PROJ b21 b22 a3 a4 a5 b26

28

Page 29: Database Normalization Revised

Sudipta Saha Page 29 4/12/2023

WORKS_ON a1 b32 a2 a3 b34 a4 b35 a5 a6

Now one row of matrix S made up entirely of "a" symbols, hence the decomposition has the lossless join property.

Example 3

Let R = ABCDE, R1 = AD, R2 = AB, R3 = BE, R4 = CDE, and R5 = AE. Let the functionaldependencies be: A -> C, B -> C, C -> D, DE -> C, CE -> A. Test if the decomposition of R into {R1,..,R5} is a lossless join decomposition.Answer:

The initial Matrix S looks as follows:A B C D E

R1 a1 b12 b13 a4 b15

R2 a1 a2 b23 b24 b25

R3 b31 a2 b33 b34 a5

R4 b41 b42 a3 a4 a5

R5 a1 b52 b53 b54 a5

After applying A -> C the matrix S will beA B C D E

R1 a1 b12 b13 a4 b15

R2 a1 a2 b23 b13 b24 b25

R3 b31 a2 b33 b34 a5

R4 b41 b42 a3 a4 a5

R5 a1 b52 b53 b13 b54 a5

After applying B -> C the matrix S will beA B C D E

R1 a1 b12 b13 a4 b15

R2 a1 a2 b13 b24 b25

R3 b31 a2 b33 b13 b34 a5

R4 b41 b42 a3 a4 a5

R5 a1 b52 b13 b54 a5

After applying C -> D the matrix S will beA B C D E

R1 a1 b12 b13 a4 b15

R2 a1 a2 b13 b24 a4 b25

R3 b31 a2 b13 b34 a4 a5

R4 b41 b42 a3 a4 a5

R5 a1 b52 b13 b54 a4 a5

After applying DE -> C the matrix S will be

29

Page 30: Database Normalization Revised

Sudipta Saha Page 30 4/12/2023

A B C D ER1 a1 b12 b13 a4 b15

R2 a1 a2 b13 a4 b25

R3 b31 a2 b13 a3 a4 a5

R4 b41 b42 a3 a4 a5

R5 a1 b52 b13 a3 a4 a5

After applying CE -> A the matrix S will beA B C D E

R1 a1 b12 b13 a4 b15

R2 a1 a2 b13 a4 b25

R3 b31 a1 a2 a3 a4 a5

R4 b41 a1 b42 a3 a4 a5

R5 a1 b52 a3 a4 a5

Now one row of matrix S made up entirely of "a" symbols, hence the decomposition has the lossless join property.

Dependency Preservation

There is another goal in relational-database design: dependency preservation. When an update is made to the database, the system should be able to check that the update will not create an illegal relation—that is, one that does not satisfy all the given functional dependencies.

If we want to check updates efficiently, we should design relational- database schemas that allow update validation without the computation of joins. ---(In other words we should design relational database schemas in such a way that dependency is preserved.)

To decide whether joins must be computed to check an update, we need to determine what functional dependencies can be tested by checking each relation individually.

Let F be a set of functional dependencies on a schema R, and let R1, R2, ..., Rm be a decomposition of R. The projection of F to Ri (denoted by ΠRi(F) )is the set of all functional dependencies in F+ that include only attributes of Ri. Since all functional dependencies in a

projection involve attributes of only one relation schema, it is possible to test such a dependency for satisfaction by checking only one relation.

Note that the definition of restriction uses all dependencies in F+, not just those in F. For instance, suppose F = {A → B, B → C}, and we have a decomposition into AC and AB. The restriction of F to AC is then A → C, since AC is in F+, even though it is not in F.

The set of projections ΠR1(F), ΠR2(F), … ΠRm(F) is the set of dependencies that can be checked efficiently. We now must ask whether testing only the restrictions is sufficient. Let F' = ((ΠR1(F)) (ΠR2(F)) … (ΠRm(F))). F' is a set of functional dependencies on schema R, but, in general, F' ≠F. However, even if F' F, it may be that F´+ = F+. If the latter is true, then every dependency in F is logically implied by F', and, if we verify that F' is satisfied, we have

30

Page 31: Database Normalization Revised

Sudipta Saha Page 31 4/12/2023

verified that F is satisfied. We say that a decomposition having the property F'+ = F+ is a dependency-preserving decomposition.

An algorithm to test dependency preservation without computing closure F + of set of functional dependencies F

Checking the dependency preservation i.e., is weather ((ΠR1(F)) (ΠR2(F)) … (ΠRm(F)))+ = F+ is expensive, sine it requires computation of F+ and F´+.

We now give a more efficient test for dependency preservation, which avoids computing F+. The idea is to test each functional dependency α → β in F by using a modified form of attribute closure to see if it is preserved by the decomposition. We apply the following procedure to each α → β in F.result = αwhile (changes to result) dofor each Ri in the decomposition t = (result Ri)+ Ri

result = result t

If result contains all attributes in β, then the functional dependency α → β is preserved. The decomposition is dependency preserving if and only if all the dependencies in F are preserved.

Example 1:Suppose R (A, B, C) is a relation schema with a set of functional dependencies F = {A->B, B->C}. R (A, B, C) is decomposed intoR1 = {A, B} and R2= {A, C}. Check weather the decomposition is dependency preserved or not?

Answer: First we compute F+

Since A->B and B->C, by applying transitivity rule we get, A->CNow F+ = {A->B, B->C, A->C, A->BC}

Now the projection of F on R1, i,e., ΠR1(F) = A->BNow the projection of F on R2, i,e., ΠR2(F) = A->C

F´= ΠR1(F) ΠR2(F) = { A->B, A->C}F´+ ={ A->B, A->C}

Since F+ ≠ F´+, decomposition of R(A, B, C) into R1 = {A, B} and R2= {A, C}are not dependency preserving.

31

Page 32: Database Normalization Revised

Sudipta Saha Page 32 4/12/2023

Example 2:Suppose R (A, B, C) is a relation schema with a set of functional dependencies F = {A->B, B->C}. R (A, B, C) is decomposed intoR1 = {A, B} and R2= {B, C}. Check weather the decomposition is dependency preserved or not?

Answer: First we compute F+

Since A->B and B->C, by applying transitivity rule we get, A->CNow F+ = {A->B, B->C, A->C, A->BC}

Now the projection of F on R1, i,e., ΠR1(F) = A->BNow the projection of F on R2, i,e., ΠR2(F) = B->C

F´= ΠR1(F) ΠR2(F) = { A->B, B->C}Since, Since A->B and B->C, by applying transitivity rule we get, A->CF´+ ={ A->B, B->C , A->C, A->BC }

Since F+ = F´+, decomposition of R(A, B, C) into R1 = {A, B} and R2= {A, C}are dependency preserving.

Not every BCNF decomposition is dependency preserving, but 3NF decomposition is always dependency preserving.

Example Suppose R (branch_name, customer_name, banker_name) is a relation schema with a set of functional dependencies F = {banker_name -> branch_name, (branch_name, customer_name)-> banker_name}.

Here the primary key is branch_name, customer_name.Since here a non-trivial functional dependency banker_name -> branch_name holds and banker_name is not a superkey and branch_name is a prime attribute, R (branch_name, customer_name, banker_name) is not in BCNF but it is in 3NF. If we want to make it BCNF, we have to decompose it into R1 ={ banker_name, branch_name }and R2 ={ banker_name , customer_name }.

Now we compute F+= {banker_name -> branch_name, (branch_name, customer_name)-> banker_name}.

Projection of F on R1 (ΠR1(F) )= banker_name -> branch_nameProjection of F on R2 (ΠR2(F) ) =F´= ΠR1(F) ΠR2(F) = { banker_name -> branch_name }F´+ = { banker_name -> branch_name }

32

Page 33: Database Normalization Revised

Sudipta Saha Page 33 4/12/2023

Since, F+ ≠ F´+, decomposition of R(branch_name, customer_name, banker_name) into R1 = { banker_name, branch_name } and R2= { banker_name , customer_name }are not dependency preserving.

So, from this we can say that not every BCNF decomposition is dependency preserving. R (branch_name, customer_name, banker_name) with a set of functional dependencies F = {banker_name -> branch_name, (branch_name, customer_name)-> banker_name} are already in 3NF.

Compare between 3NF and BCNF with example

3NF BCNFDefinition A relation schema R is in third normal form

(3NF) with respect to a set F of functional dependencies if, whenever a nontrivial functional dependency X -> A holds in R, either (a) X is a superkey of R, or (b) A is a prime attribute of R.

A relation schema R is in BCNF with respect to a set F of functional dependencies if whenever a nontrivial functional dependency X -> A holds in R, then X is a superkey of R.

Example R (branch_name, customer_name, banker_name) is a relation schema with a set of functional dependencies F = {banker_name -> branch_name, (branch_name, customer_name)-> banker_name} is in 3NF, because here superkey is (branch_name, customer_name) and right hand side of nontrivial functional dependency banker_name -> branch_name is a prime attribute.

R (branch_name, customer_name, banker_name) is a relation schema with a set of functional dependencies F = {banker_name -> branch_name, (branch_name, customer_name)-> banker_name}.is not in BCNF, because here superkey is (branch_name, customer_name) and right hand side of nontrivial functional dependency banker_name -> branch_name is a prime attribute.

Which is stricter

BCNF is more stricter than 3NF BCNF is more stricter than 3NF

Weather we have lossless join property and Dependency is preserved or not?

It is always possible to obtain a 3NF design without sacrificing a lossless join or dependency preservation.

It is always possible to obtain a BCNF design without sacrificing dependency preservation.

Disadvantages After achieving 3NF, since all transitive dependencies are not eliminated, we may have to use null values to represent some of the possible

After achieving BCNF, since all transitive dependencies are eliminated, we do not have to use

33

Page 34: Database Normalization Revised

Sudipta Saha Page 34 4/12/2023

meaningful relationships among data items, and there is the problem of repetition of information.

null values to represent some of the possible meaningful relationships among data items, and there is no the problem of repetition of information.

Which is preferable

Since it is not always possible to satisfy BCNF and dependency preservation, we may be forced to choose between BCNF and dependency preservation with 3NF. If we choose dependency preservation with 3NF, application programmer needs to worry about writing code to keep redundant data consistent on updates.

If we choose BCNF and dependency is not preserved, we have to consider each dependency in a minimum cover Fc, that is not preserved in the decomposition. For each such dependency α→β we define a materialized view that computes a join of all relations in the decomposition, and projects the result on αβ. The functional dependency can be easily tested on this materialized view.

34