normalization information systems ii ioan despi. informal approach building a database structure : a...
TRANSCRIPT
NormalizationNormalization
Information Systems IIIoan Despi
Informal approach
Building a database structure :
•A process of examining the data which is useful & necessary for an application
•Then breaking it down into a relative simple row and column format
There are two points to understand about tables and columns that are the essence of any database:
1. Tables store data about an entity
An entity may be a person, a part in a machine, a book, or
any other tangible or intangible object, but the primary
consideration is that a table only contain data about one
thing
2. Columns contain the attributes of an entity
Just as a table contains data about a single entity, each
column should only contain one item of data about that
entity
Personal tricks:
1. If (for example) you’re creating a table of addresses, there is no point in having a single column contain the city, state and postal code when it is just as easy to create three columns and record each attribute separately.
2. I use a plural form of a noun for table names (Authors, Books, Orders,aso) and a noun or a noun and adjective for column names (FirstName, City)
3. If I’m coming up with names that require the use of the word “and” or the use of two nouns, it’s an indication I haven’t gone enough in breaking down data
An ugly table
StudentName
AdvisorName
CourseID1
CourseDescription1
CourseInstructorName1
Al Gore BillClinton
VB1 Intro toVisual Basic
Bruce Lee
DanQuayle
GeorgeBush
DAO1 Intro to DAOProgramming
Joe Killy
GeorgeBush
RonaldRagan
API1 APIProgramming
Dan Ciuhan
WalterMondale
JimmyCarter
VB1 Intro toVisual Basic
Bruce Lee
Problems with this structure:
1. Repeating Groups
The CourseID, Description and Instructor are repeated for each class.
If a student need a second or a third class, you need to go back and modify the table design in order to record it.
Additionally, adding all those fields when most students would never use them is a waste of storage
2. Delete anomalies
If you no longer wish to track Joe Killy’s Intro to DAO class, you would need to delete a student, an adviser and an instructor in order to do it.
3. Insert anomalies
Perhaps the department head wishes to add a new class, “Intro to C++”, but hasn’t yet set up a schedule or even an instructor. What would you enter for the student, advisor and instructor names?
4. Inconsistent data
If after entering these rows you’ll discover that Bruce Lee’s course
is actually “Intro to Advanced Visual Basic”, you would need to
examine all the rows and change each individually, in order to
reflect this change.
This introduces the potential for errors if one the changes is
omitted or done incorrectly.
As you can see, this single simple flat table introduced a number of problems- all of which can be solved by normalizing the table design
Normalization =the process of taking a wide table with lots of columns but few rows and redesigning it as several narrow tables with fewer columns but more rows.
A properly normalized design allows:
1. To use storage space efficiently
2. To eliminate redundant data
3. To reduce or eliminate inconsistent data
4. To ease the data maintenance burden
The rule:
you must be able to reconstruct the original flat view of the data
Relational db theorists have divided normalization into several rules, called normal forms :
First normal form ( 1NF ) :
No repeating groups
Second normal form ( 2NF ) : 1NF +
No nonkey attributes depend on a
portion of the primary key
Third normal form (3NF ) : 2NF +
No attributes depend on other
nonkey attributes
1NF
A repeating group :
StudentNameAdvisorName
CourseID1CourseDescription1
CourseInstructorName1
CourseID2CourseDescription2
CourseInstructorName2
CourseID3CourseDescription3
CourseInstructorName3
Columns for course information have been duplicated to allow the student to take 3 courses.
The problem occurs when the student wants to take 4 courses or more.
The proper solution is to remove the repeating group of columns to another table
Ugly(StudentName, AdvisorName, CourseID1, CourseDescription1, CourseInstructorName1)
Students (StudentID, StudentName, AdvisorName)
StudentCourses (SCStudentID, SCCourseID,
SCCourseDescription, SCCourseInstructorName)
The primary keys are shown in italics. The new field SCStudentID is a foreign key to the Students table.
We’ve divided the table so that the student can now take as many courses he wants by removing the course information from the original table and creating two tables: one for the student information and one for the course list . The repeating group of columns in the original table is gone. We can still reconstruct the original table using StudentID and SCStudentID columns from the new two tables.
2NF :
No nonkey attributes depend on a portion of the primary key
2NF really only apply to tables where the primary key is defined by two or more columns.
The essence is that if there are columns which can be identified by only part of the primary key, they need to be in their own table.
StudentCourses (SCStudentID, SCCourseID, SCCourseDescription, SCCourseInstructorName)
The primary key is the combination: SCStudentID, SCCourseID
The columns SCCourseDescription, SCCourseInstructorName are only dependent on the SCCourseID column.
In other words, the description and instructor’s name will be the same regardless of the student.
To solve the problem, we split the table StudentCourses, obtaining three tables from the original one:
Students (StudentID, StudentName, AdvisorName)
StudentCourses (SCStudentID, SCCourseID)
Courses (CourseID, CourseDescription, CourseInstructorName)
What we’ve done is to remove the details of the course information to their own table Courses.
The relationship between students and courses revealed to be a
many -to many relationship:
each student can take many courses and each course can have many students
The StudentCourses table now contains only the two foreign keys to Students and Courses. It is also called a intersection entity.
Let us add a little more detail to the sample tables to make them look something more like the real world
Students
StudentID
StudentName
StudentPhone
StudentAddress
StudentCity
StudentState
StudentZIP
AdvisorName
AdvisorPhone
StudentCourses
SCStudentID
SCCourseID
Courses
CourseID
CourseDescription
CourseInstructorName
CourseInstructorPhone
3NF:
No attributes depend on other nonkey attributes
All the columns in the table containd data about the entity that is defined by the primary key.
The columns in the table must contain data about only one thing.
This is really a extension of 2NF : both are used to remove columns that belong in their own table.
To complete the normalization we need to look for columns that are not dependent on the primary key of the table.
Students table:
the advisor information is not dependent on the student:
if the student leaves the school, the advisor’s name &
phone number will remain the same
Courses table:
the same logic applies to the instructor information:
the data for the instructor is not dependent on the primarykey CourseID since the instructor will be unaffected ifthe course is dropped from the curriculum
Students
StudentID
StudentName
StudentPhone
StudentAddress
StudentCity
StudentState
StudentZIP
StudentAdvisorID
StudentCourses
SCStudentID
SCCourseID
Courses
CourseID
CourseDescription
CourseInstructorIDAdvisors
AdvisorID
AdvisorName
AdvisorPhoneInstructors
InstructorID
InstructorName
InstructorPhone