Download - 141405 - DataBase Management Systems
-
7/27/2019 141405 - DataBase Management Systems
1/60
http://
csetub
e.co
.nr/
GKMCET
Lecture Plan
Code & Subject Name: 141405 & Database Management Systems
1
Unit Number & Name: I & Introduction Period: 1 of 45
DBMS contains information about a particular enterprise
Collection of interrelated dataSet of programs to access the dataAn environment that is both convenientand efficientto use
Database Applications:Banking: all transactionsAirlines: reservations, schedulesUniversities: registration, gradesSales: customers, products, purchasesOnline retailers: order tracking, customized recommendationsManufacturing: production, inventory, orders, supply chainHuman resources: employee records, salaries, tax deductions
Databases touch all aspects of our lives
In the early days, database applications were built directly on top of file systemsDrawbacks of using file systems to store data:
Data redundancy and inconsistency Multiple file formats, duplication of information in different files
Difficulty in accessing data Need to write a new program to carry out each new task
Data isolation multiple files and formatsIntegrity problems
Integrity constraints (e.g. account balance > 0) become buried inprogram code rather than being stated explicitly
Hard to add new constraints or change existing onesDrawbacks of using file systems (cont.)
Atomicity of updates Failures may leave database in an inconsistent state with partial
updates carried out Example: Transfer of funds from one account to another should
either complete or not happen at allConcurrent access by multiple users
Concurrent accessed needed for performance Uncontrolled concurrent accesses can lead to inconsistencies
Example: Two people reading a balance and updating it at the
same timeSecurity problems
Hard to provide user access to some, but not all, dataDatabase systems offer solutions to all the above problems
http://csetube.co.nr/
http://csetube.co.nr/ -
7/27/2019 141405 - DataBase Management Systems
2/60
http://
csetub
e.co
.nr/
GKMCET
Lecture Plan
Code & Subject Name: 141405 & Database Management Systems
2
Unit Number & Name: I & Introduction Period: 2 of 45
Page: 2 of 2
Levels of Abstraction
Physical level: describes how a record (e.g., customer) is stored.Logical level: describes data stored in database, and the relationships among the data.
type customer= recordcustomer_id: string;
customer_name : string;customer_street: string;customer_city : string;
end;View level: application programs hide details of data types. Views can also hide
information (such as an employees salary) for security purposes.View of Data
http://csetube.co.nr/
http://csetube.co.nr/ -
7/27/2019 141405 - DataBase Management Systems
3/60
http://
csetub
e.co
.nr/
GKMCET
Lecture Plan
Code & Subject Name: 141405 & Database Management Systems
3
Unit Number & Name: I & Introduction Period: 3 of 45
Page: 2 of 2
Instances and SchemasSimilar to types and variables in programming languagesSchema the logical structure of the database
Example: The database consists of information about a set of customers andaccounts and the relationship between them)Analogous to type information of a variable in a programPhysical schema: database design at the physical levelLogical schema: database design at the logical level
Instance the actual content of the database at a particular point in timeAnalogous to the value of a variable
Physical Data Independence the ability to modify the physical schema without
changing the logical schemaApplications depend on the logical schemaIn general, the interfaces between the various levels and components shouldbe well defined so that changes in some parts do not seriously influenceothers.
Data ModelsA collection of tools for describing
DataData relationshipsData semantics
Data constraintsRelational modelEntity-Relationship data model (mainly for database design)Object-based data models (Object-oriented and Object-relational)Semistructured data model (XML)Other older models:
Network modelHierarchical model
http://csetube.co.nr/
http://csetube.co.nr/ -
7/27/2019 141405 - DataBase Management Systems
4/60
http://
csetub
e.co
.nr/
GKMCET
Lecture Plan
Code & Subject Name: 141405 & Database Management Systems
4
Unit Number & Name: I & Introduction Period: 4 of 45
Page: 2 of 2
Data Manipulation Language (DML)
Language for accessing and manipulating the data organized by the appropriate data
model
DML also known as query language
Two classes of languages
Procedural user specifies what data is required and how to get those data
Declarative (nonprocedural) user specifies what data is required without
specifying how to get those data
SQL is the most widely used query language
Data Definition Language (DDL)
Specification notation for defining the database schema
Example: create table account(
account_number char(10),
branch_name char(10),
balance integer)
DDL compiler generates a set of tables stored in a data dictionary
Data dictionary contains metadata (i.e., data about data)
Database schema
Datastorage and definition language
Specifies the storage structure and access methods used
Integrity constraints
Domain constraints Referential integrity (e.g. branch_name must correspond to a valid
branch in the branch table)
Authorization
http://csetube.co.nr/
http://csetube.co.nr/ -
7/27/2019 141405 - DataBase Management Systems
5/60
http://
csetub
e.co
.nr/
GKMCET
Lecture Plan
Code & Subject Name: 141405 & Database Management Systems
5
Unit Number & Name: I & Introduction Period: 5 of 45
Page: 2 of 2
Overall System StructureThe architecture of a database systems is greatly influenced by the underlying computersystem on which the database is running:
CentralizedClient-serverParallel (multiple processors and disks)Distributed
Lecture Plan R/TP/02
http://csetube.co.nr/
http://csetube.co.nr/ -
7/27/2019 141405 - DataBase Management Systems
6/60
http://
csetub
e.co
.nr/
GKMCET
Lecture Plan
Code & Subject Name: 141405 & Database Management Systems
6
Unit Number & Name: I & Introduction Period: 6 of 45
Page: 2 of 2
Database Users
Users are differentiated by the way they expect to interact with
the system
Application programmers interact with system through DML calls
Sophisticated users form requests in a database query language
Specialized users write specialized database applications that do not fit into the
traditional data processing framework
Nave users invoke one of the permanent application programs that have been
written previously
Examples, people accessing database over the web, bank tellers, clerical staff
Database Administrator
Coordinates all the activities of the database system
has a good understanding of the enterprises information resources and needs.
Database administrator's duties include:
Storage structure and access method definition
Schema and physical organization modification
Granting users authority to access the database
Backing up data
Monitoring performance and responding to changes
Database tuning
http://csetube.co.nr/
http://csetube.co.nr/ -
7/27/2019 141405 - DataBase Management Systems
7/60
http://
csetub
e.co
.nr/
GKMCET
Lecture Plan
Code & Subject Name: 141405 & Database Management Systems
7
Unit Number & Name: I & Introduction Period: 7 of 45
Page: 2 of 2
The Entity-Relationship Model
Models an enterprise as a collection ofentities and relationships
Entity: a thing or object in the enterprise that is distinguishable from other
objects
Described by a set ofattributes
Relationship: an association among several entities
Represented diagrammatically by an entity-relationship diagram:
http://csetube.co.nr/
http://csetube.co.nr/ -
7/27/2019 141405 - DataBase Management Systems
8/60
http://
csetub
e.co
.nr/
GKMCET
Lecture Plan
Code & Subject Name: 141405 & Database Management Systems
8
Unit Number & Name: I & Introduction Period: 8 of 45
Page: 2 of 2
Entity Setscustomerandloan
customer_id customer_ customer_ customer_ loan_ amount
name street city number
http://csetube.co.nr/
http://csetube.co.nr/ -
7/27/2019 141405 - DataBase Management Systems
9/60
http://
csetub
e.co
.nr/
GKMCET
Lecture Plan
Code & Subject Name: 141405 & Database Management Systems
9
Unit Number & Name: I & Introduction Period: 9 of 45
Page: 2 of 2
Relational Model
Structure of Relational Databases
Fundamental Relational-Algebra-Operations
Additional Relational-Algebra-Operations
Extended Relational-Algebra-Operations
Null Values
Modification of the Database
Example of a Relation
http://csetube.co.nr/
http://csetube.co.nr/ -
7/27/2019 141405 - DataBase Management Systems
10/60
http://
csetub
e.co
.nr/
GKMCET
Lecture Plan
Code & Subject Name: 141405 & Database Management Systems
10
Unit Number & Name: II & RELATIONAL MODEL Period: 10 of 45
Page: 2 of 2
The relational Model The catalog- Types KeysEach attribute of a relation has a name
The set of allowed values for each attribute is called the domain of the attribute
Attribute values are (normally) required to be atomic; that is, indivisible
E.g. the value of an attribute can be an account number,
but cannot be a set of account numbers
Domain is said to be atomic if all its members are atomic
The special value null is a member of every domain
The null value causes complications in the definition of many operations
We shall ignore the effect of null values in our main presentation and consider
their effect later
Keys
Let K R
Kis a superkey ofR if values forKare sufficient to identify a unique tuple of each
possible relation r(R)
by possible r we mean a relation rthat could exist in the enterprise we are
modeling.
Example: {customer_name, customer_street} and
{customer_name}
are both superkeys ofCustomer, if no two customers can possibly have the
same name
In real life, an attribute such as customer_idwould be used instead
ofcustomer_name to uniquely identify customers, but we omit it tokeep our examples small, and instead assume customer names are
unique.
Kis a candidate key ifKis minimal
Example: {customer_name} is a candidate key forCustomer, since it is a superkey
and no subset of it is a superkey.
http://csetube.co.nr/
http://csetube.co.nr/ -
7/27/2019 141405 - DataBase Management Systems
11/60
http://
csetub
e.co
.nr/
GKMCET
Lecture Plan
Code & Subject Name: 141405 & Database Management Systems
11
Primary key: a candidate key chosen as the principal means of identifying tuples
within a relation
Should choose an attribute whose value never, or very rarely, changes.
E.g. email address is unique, but may change
Relational Algebra
Procedural language
Six basic operators
select:
project:
union:
set difference:
Cartesian product: x
rename:
The operators take one or two relations as inputs and produce a new relation as a
result.
Select Operation Example
Relation r
A=B ^ D > 5 (r)
A B C D
15
1223
773
10
http://csetube.co.nr/
http://csetube.co.nr/ -
7/27/2019 141405 - DataBase Management Systems
12/60
http://
csetub
e.co
.nr/
GKMCET
Lecture Plan
Code & Subject Name: 141405 & Database Management Systems
12
A B C D
123
710
PPrroojjeeccttOOppeerraattiioonnEExxaammppllee
n Relation r:A B C
10
203040
1
112
A C
11
12
=
A C
11
2
A,C (r)
http://csetube.co.nr/
http://csetube.co.nr/ -
7/27/2019 141405 - DataBase Management Systems
13/60
http://
csetub
e.co
.nr/
GKMCET
Lecture Plan
Code & Subject Name: 141405 & Database Management Systems
13
UUnniioonnOOppeerraattiioonnEExxaammppllee
A B
121
A B
23
rs
A B
1213
http://csetube.co.nr/
http://csetube.co.nr/ -
7/27/2019 141405 - DataBase Management Systems
14/60
http://
csetub
e.co
.nr/
GKMCET
Lecture Plan
Code & Subject Name: 141405 & Database Management Systems
14
SSeettDDiiffffeerreenncceeOOppeerraattiioonnEExxaammppllee
n Relations r, s:
n r s:
A B
121
A B
23
r
s
A B
11
CCaarrtteessiiaann--PPrroodduuccttOO eerraattiioonn EExxaamm llee
n Relations r, s:
n rx s:
A B
12
A B
11112222
C D
1010201010102010
E
aabbaabb
C D
10102010
E
aabbr
s
http://csetube.co.nr/
http://csetube.co.nr/ -
7/27/2019 141405 - DataBase Management Systems
15/60
http://
csetub
e.co
.nr/
GKMCET
Lecture Plan
Code & Subject Name: 141405 & Database Management Systems
15
CCaarrtteessiiaann--PPrroodduuccttOO eerraattiioonn EExxaamm llee
n Relations r, s:
n rx s:
A B
12
A B
11112222
C D
1010201010102010
E
aabbaabb
C D
10102010
E
aabbr
s
http://csetube.co.nr/
http://csetube.co.nr/ -
7/27/2019 141405 - DataBase Management Systems
16/60
http://
csetub
e.co
.nr/
GKMCET
Lecture Plan
Code & Subject Name: 141405 & Database Management Systems
16
Unit Number & Name: II & RELATIONAL MODEL Period: 12 of 45
Page: 2 of 2
Tuple Relational CalculusDomain Relational Calculus
Tuple Relational CalculusA nonprocedural query language, where each query is of the form
{t|P(t) }It is the set of all tuples tsuch that predicatePis true forttis a tuple variable, t[A ] denotes the value of tuple ton attributeA
t rdenotes that tuple tis in relation rPis aformula similar to that of the predicate calculus
Predicate Calculus Formula
1. Set of attributes and constants
2. Set of comparison operators: (e.g.,,,,,,)3. Set of connectives: and (), or (v) not ()
4. Implication (): x y, if x if true, then y is true
xy x vy5. Set of quantifiers:
t r(Q (t)) there exists a tuple in tin relation rsuch that predicate Q (t) is true
t r(Q (t)) Q is true for all tuples tin relation rBanking Example
branch (branch_name, branch_city, assets )customer(customer_name, customer_street, customer_city )
account(account_number, branch_name, balance )loan (loan_number, branch_name, amount)depositor(customer_name, account_number)borrower(customer_name, loan_number)
Example Queries
Find the loan_number, branch_name, and amount for loans of over $1200
{t | t loan t [amount ] 1200}n Find the loan number for each loan of an amount greater than $1200
{t |s loan (t [loan_number ] = s [loan_number ]s [amount ] 1200)}
Notice that a relation on schema [loan_number ] is implicitly defined bythe query
http://csetube.co.nr/
http://csetube.co.nr/ -
7/27/2019 141405 - DataBase Management Systems
17/60
http://
csetub
e.co
.nr/
GKMCET
Lecture Plan
Code & Subject Name: 141405 & Database Management Systems
17
Domain Relational Calculus
A nonprocedural query language equivalent in power to the tuple relational calculusEach query is an expression of the form:
{x1, x2, , xn | P (x1, x2, , xn)}x1, x2, , xn represent domain variablesP represents a formula similar to that of the predicate calculus
EExxaammpplleeQQuueerriieess
n Find the loan_number, branch_name, and amountfor loans of over$1200
n Find the names of all customers who have a loan from the Perryridge branchand the loan amount:
{ c, a | l( c, l borrower b ( l, b, a loanb = Perryridge))}
{ c, a | l( c, l borrower l, Perryridge, a loan)}
{ c | l, b, a ( c, l borrower l, b, a loan a > 1200)}
n Find the names of all customers who have a loan of over $1200
{ l, b, a | l, b, a loan a > 1200}
http://csetube.co.nr/
http://csetube.co.nr/ -
7/27/2019 141405 - DataBase Management Systems
18/60
http://
csetub
e.co
.nr/
GKMCET
Lecture Plan
Code & Subject Name: 141405 & Database Management Systems
18
Unit Number & Name: II & RELATIONAL MODEL Period: 13 of 45
Page: 2 of 2
EExxaammpplleeQQuueerriieess
n Find the names of all customers having a loan, an account, or both atthe Perryridge branch:
{ c | s,n ( c, s, n customer)x,y,z(x, y, z branch y= Brooklyn)
a,b (x, y, z account c,a depositor)}
n Find the names of all customers who have an account at allbranches located in Brooklyn:
{ c | l( c, l borrower
b,a ( l, b, a loan b = Perryridge))
a ( c, a depositor b,n ( a, b, n account b = Perryridge))}
http://csetube.co.nr/
http://csetube.co.nr/ -
7/27/2019 141405 - DataBase Management Systems
19/60
http://
csetub
e.co
.nr/
GKMCET
Lecture Plan
Code & Subject Name: 141405 & Database Management Systems
19
Unit Number & Name: II & RELATIONAL MODEL Period: 14 of 45
Page: 2 of 2
SQLData DefinitionBasic Query StructureSet OperationsAggregate FunctionsNull ValuesNested SubqueriesComplex QueriesViewsModification of the DatabaseJoined Relations
Domain Types in SQLchar(n). Fixed length character string, with user-specified length n.varchar(n). Variable length character strings, with user-specified maximum lengthn.int. Integer (a finite subset of the integers that is machine-dependent).smallint. Small integer (a machine-dependent subset of the integer domain type).numeric(p,d). Fixed point number, with user-specified precision ofp digits, with ndigits to the right of decimal point.real, double precision. Floating point and double-precision floating point numbers,with machine-dependent precision.float(n). Floating point number, with user-specified precision of at least n digits.
Integrity Constraints
Integrity constraints guard against accidental damage to the database, by ensuring thatauthorized changes to the database do not result in a loss of data consistency.
A checking account must have a balance greater than $10,000.00A salary of a bank employee must be at least $4.00 an hourA customer must have a (non-null) phone number
http://csetube.co.nr/
http://csetube.co.nr/ -
7/27/2019 141405 - DataBase Management Systems
20/60
http://
csetub
e.co
.nr/
GKMCET
Lecture Plan
Code & Subject Name: 141405 & Database Management Systems
20
Unit Number & Name: II & RELATIONAL MODEL Period: 15 of 45
Page: 2 of 2
Advanced SQLSQL Data Types and SchemasIntegrity ConstraintsAuthorizationEmbedded SQLDynamic SQLFunctions and Procedural ConstructsRecursive QueriesAdvanced SQL Features
Built-in Data Types in SQL
date: Dates, containing a (4 digit) year, month and date
Example: date 2005-7-27time: Time of day, in hours, minutes and seconds.Example: time 09:00:30 time 09:00:30.75
timestamp: date plus time of dayExample: timestamp 2005-7-27 09:00:30.75
interval: period of timeExample: interval 1 daySubtracting a date/time/timestamp value from another gives an interval valueInterval values can be added to date/time/timestamp values
Can extract values of individual fields from date/time/timestampExample: extract (year from r.starttime)
Can cast string types to date/time/timestampExample: cast as dateExample: cast as time
Referential Integrity in SQL Example
create tablecustomer
(customer_name char(20),
customer_street char(30),
customer_city char(30),
primary key (customer_name ))
create tablebranch
(branch_name char(15),
branch_city char(30),assets numeric(12,2),
primary key (branch_name ))
create tableaccount
(account_number char(10),
branch_name char(15),
balance integer,
http://csetube.co.nr/
http://csetube.co.nr/ -
7/27/2019 141405 - DataBase Management Systems
21/60
http://
csetub
e.co
.nr/
GKMCET
Lecture Plan
Code & Subject Name: 141405 & Database Management Systems
21
primary key (account_number),
foreign key (branch_name) referencesbranch )
create tabledepositor
(customer_name char(20),
account_number char(10),primary key (customer_name, account_number),
foreign key (account_number) referencesaccount,
foreign key (customer_name ) referencescustomer)
Privileges in SQL
select: allows read access to relation,or the ability to query using the view
Example: grant usersU1,U2, andU3 select authorization on thebranch
relation:
grant select onbranch toU1, U2, U3
insert: the ability to insert tuples
update: the ability to update using the SQL update statementdelete: the ability to delete tuples.
all privileges: used as a short form for all the allowable privileges
Revoking Authorization in SQL
The revoke statement is used to revoke authorization.
revoke
on from
Example:
revoke select onbranch fromU1, U2, U3
All privileges that depend on the privilege being revoked are also revoked.
may be all to revoke all privileges the revokee may hold.
If the same privilege was granted twice to the same user by different grantees,the user may retain the privilege after the revocation.
http://csetube.co.nr/
http://csetube.co.nr/ -
7/27/2019 141405 - DataBase Management Systems
22/60
http://
csetub
e.co
.nr/
GKMCET
Lecture Plan
Code & Subject Name: 141405 & Database Management Systems
22
Unit Number & Name: II & RELATIONAL MODEL Period: 16 of 45
Page: 2 of 2
Embedded SQLThe SQL standard defines embeddings of SQL in a variety of programminglanguages such as C, Java, and Cobol.A language to which SQL queries are embedded is referred to as a host language,and the SQL structures permitted in the host language comprise embeddedSQL.The basic form of these languages follows that of the System R embedding of SQLinto PL/I.EXEC SQL statement is used to identify embedded SQL request to the preprocessor
EXEC SQL END_EXECNote: this varies by language (for example, the Java embedding uses
# SQL { . }; )
From within a host language, find the names and cities of customers with more thanthe variable amount dollars in some account.Specify the query in SQL and declare a cursor for itEXEC SQL
declare c cursor forselect depositor.customer_name, customer_cityfrom depositor, customer, accountwhere depositor.customer_name = customer.customer_name
and depositor account_number = account.account_numberand account.balance > :amount
END_EXEC
The open statement causes the query to be evaluatedEXEC SQL open c END_EXEC
The fetch statement causes the values of one tuple in the query result to be placed onhost language variables.
EXEC SQL fetch c into :cn, :cc END_EXECRepeated calls to fetch get successive tuples in the query result
A variable called SQLSTATE in the SQL communication area (SQLCA) gets set to02000 to indicate no more data is availableThe close statement causes the database system to delete the temporary relation that
holds the result of the query.EXEC SQL close c END_EXECNote: above details vary with language. For example, the Java embedding defines Javaiterators to step through result tuples.
http://csetube.co.nr/
http://csetube.co.nr/ -
7/27/2019 141405 - DataBase Management Systems
23/60
http://
csetub
e.co
.nr/
GKMCET
Lecture Plan
Code & Subject Name: 141405 & Database Management Systems
23
Dynamic SQLAllows programs to construct and submit SQL queries at run time.Example of the use of dynamic SQL from within a C program.
char * sqlprog = update accountset balance = balance * 1.05
where account_number = ?EXEC SQL prepare dynprog from :sqlprog;char account[10] = A-101;EXEC SQL execute dynprogusing :account;The dynamic SQL program contains a ?, which is a place holder for a value that isprovided when the SQL program is executed.
http://csetube.co.nr/
http://csetube.co.nr/ -
7/27/2019 141405 - DataBase Management Systems
24/60
http://
csetub
e.co
.nr/
GKMCET
Lecture Plan
Code & Subject Name: 141405 & Database Management Systems
24
Unit Number & Name: III & DATABASE DESIGN Period: 19 of 45
Page: 2 of 2
Functional DependenciesConstraints on the set of legal relations.Require that the value for a certain set of attributes determines uniquelythe value for another set of attributes.A functional dependency is a generalization of the notion of a key.LetRbe a relation schema
R and R
The functional dependency
holds onR if and only if for any legal relations r(R), whenever any twotuples t1 and t2 ofragree on the attributes, they also agree on theattributes. That is,
t1[] = t2 [] t1[ ] = t2 [ ]Example: Considerr(A,B ) with the following instance ofr.On this instance,AB does NOT hold, but BA does hold.
Kis a superkey for relation schemaR if and only ifKRKis a candidate key forR if and only if
KR, andfor no K, R
Functional dependencies allow us to express constraints that cannot beexpressed using superkeys. Consider the schema:
bor_loan = (customer_id, loan_number, amount).We expect this functional dependency to hold:
loan_number amount
but would not expect the following to hold:amount customer_name
A functional dependency is trivial if it is satisfied by all instances of arelation
Example: customer_name, loan_number customer_name
customer_name customer_nameIn general, is trivial if
http://csetube.co.nr/
http://csetube.co.nr/ -
7/27/2019 141405 - DataBase Management Systems
25/60
http://
csetub
e.co
.nr/
GKMCET
Lecture Plan
Code & Subject Name: 141405 & Database Management Systems
25
Unit Number & Name: III & DATABASE DESIGN Period: 21 of 45
Page: 2 of 2
Functional Dependencies First, Second, Third Normal Forms
First Normal FormDomain is atomic if its elements are considered to be indivisible units
Examples of non-atomic domains: Set of names, composite attributes Identification numbers like CS101 that can be broken up into parts
A relational schema R is in first normal form if the domains of all attributes of R areatomicNon-atomic values complicate storage and encourage redundant (repeated) storage ofdata
Example: Set of accounts stored with each customer, and set of owners storedwith each accountWe assume all relations are in first normal form
Atomicity is actually a property of how the elements of the domain are used.Example: Strings would normally be considered indivisibleSuppose that students are given roll numbers which are strings of the formCS0012 orEE1127If the first two characters are extracted to find the department, the domain ofroll numbers is not atomic.Doing so is a bad idea: leads to encoding of information in applicationprogram rather than in the database.
Third Normal Form
A relation schemaR is in third normal form (3NF) if for all:
inF+at least one of the following holds:
is trivial (i.e., )
is a superkey forR
Each attributeA in is contained in a candidate key forR.(NOTE:
each attribute may be in a different candidate key)If a relation is in BCNF it is in 3NF (since in BCNF one of the first two conditionsabove must hold).Third condition is a minimal relaxation of BCNF to ensure dependency preservation
http://csetube.co.nr/
http://csetube.co.nr/ -
7/27/2019 141405 - DataBase Management Systems
26/60
http://
csetub
e.co
.nr/
GKMCET
Lecture Plan
Code & Subject Name: 141405 & Database Management Systems
26
Unit Number & Name: III & DATABASE DESIGN Period: 22 of 45
Page: 2 of 2
Functional Dependencies First, Second, Third Normal Forms
Third Normal Form: Motivation
There are some situations where
BCNF is not dependency preserving, and
efficient checking for FD violation on updates is important
Solution: define a weaker normal form, called Third Normal Form
(3NF)
Allows some redundancy
But functional dependencies can be checked on individual relations
without computing a join.
There is always a lossless-join, dependency-preserving decomposition into3NF.
EXAMPLE
Relation R:R = (J, K, L )
F = {JKL, LK}Two candidate keys: JKandJLR is in 3NF
JKL JKis a superkey
LK Kis contained in a candidate key
Testing for 3NF
Optimization: Need to check only FDs in F, need not check all FDs inF+.
Use attribute closure to check for each dependency , if is a superkey.
If is not a superkey, we have to verify if each attribute in is contained in acandidate key ofR
this test is rather more expensive, since it involve finding candidate keystesting for 3NF has been shown to be NP-hardInterestingly, decomposition into third normal form (described shortly) can bedone in polynomial time
http://csetube.co.nr/
http://csetube.co.nr/ -
7/27/2019 141405 - DataBase Management Systems
27/60
http://
csetub
e.co
.nr/
GKMCET
Lecture Plan
Code & Subject Name: 141405 & Database Management Systems
27
Unit Number & Name: III & DATABASE DESIGN Period: 23 of 45
Page: 2 of 2
Dependency PreservationLetFibe the set of dependenciesF + that include only attributes inRi.
A decomposition is dependency preserving, if(F1F2 Fn )+ =F +
If it is not, then checking updates for violation offunctional dependencies may require computing joins,which is expensive.
Testing for Dependency Preservation
To check if a dependency is preserved in a decomposition ofR intoR1,R2, ,Rn we apply the following test (with attribute closure donewith respect toF)
result=while (changes to result) do
for eachRi in the decompositiont= (resultRi)+Riresult = result t
Ifresultcontains all attributes in, then the functional dependency is preserved.
We apply the test on all dependencies inF to check if a decomposition isdependency preservingThis procedure takes polynomial time, instead of the exponential timerequired to computeF+ and (F1F2 Fn)+Example
R = (A, B, C)F = {AB
B C}Key = {A}
R is not in BCNFDecompositionR1 = (A, B), R2 = (B, C)R1 andR2 in BCNFLossless-join decompositionDependency preserving
http://csetube.co.nr/
http://csetube.co.nr/ -
7/27/2019 141405 - DataBase Management Systems
28/60
http://
csetub
e.co
.nr/
GKMCET
Lecture Plan
Code & Subject Name: 141405 & Database Management Systems
28
Unit Number & Name: III & DATABASE DESIGN Period: 24 of 45
Page: 2 of 2
Boyce/Codd Normal Form
A relation schemaR is in BCNF with respect to a setFof functionaldependencies if for all functional dependencies inF+ of the form
where R and R, at least one of the following holds: is trivial (i.e., )
is a superkey forRExample schema notin BCNF:
bor_loan = ( customer_id, loan_number, amount)
because loan_number amountholds on bor_loanbut loan_numberisnot a superkey
How good is BCNF?
There are database schemas in BCNF that do not seem to be sufficientlynormalized
Consider a databaseclasses (course, teacher, book)
such that (c, t, b) classes means that tis qualified to teach c, and b is arequired textbook forc
The database is supposed to list for each course the set of teachers anyone of which can be the courses instructor, and the set of books, all ofwhich are required for the course
http://csetube.co.nr/
http://csetube.co.nr/ -
7/27/2019 141405 - DataBase Management Systems
29/60
http://
csetub
e.co
.nr/
GKMCET
Lecture Plan
Code & Subject Name: 141405 & Database Management Systems
29
Unit Number & Name: III & DATABASE DESIGN Period: 25 of 45
Page: 2 of 2
Multi-valued Dependencies and Fourth Normal Form
LetRbe a relation schema and let R and R. The multivalueddependency
holds onR if in any legal relation r(R), for all pairs for tuples t1 and t2in rsuch that t1[] = t2 [], there exist tuples t3 and t4 in rsuch that:
t1[] = t2 [] = t3 [] = t4 []t3[] = t1 []
t3[R ] = t2[R ]t4 [] = t2[]t4[R ] = t1[R ]
Tabular representation of
Example
LetRbe a relation schema with a set of attributes that are partitioned into3 nonempty subsets.
Y, Z, WWe say that YZ(YmultideterminesZ)if and only if for all possible relations r(R )
rand rthen
rand rNote that since the behavior ofZand Ware identical it follows that
YZifY WIn our example:
http://csetube.co.nr/
http://csetube.co.nr/ -
7/27/2019 141405 - DataBase Management Systems
30/60
http://
csetub
e.co
.nr/
GKMCET
Lecture Plan
Code & Subject Name: 141405 & Database Management Systems
30
course teachercourse book
The above formal definition is supposed to formalize the notion that
given a particular value ofY(course) it has associated with it a set ofvalues ofZ (teacher) and a set of values ofW (book), and these two setsare in some sense independent of each other.
Note:l IfYZ then YZl Indeed we have (in above notation)Z1 = Z2
The claim follows.Use of Multivalued Dependencies
We use multivalued dependencies in two ways:1. To test relations to determine whether they are legal under a given set
of functional and multivalued dependencies2. To specify constraints on the set of legal relations. We shall thusconcern ourselves only with relations that satisfy a given set of functionaland multivalued dependencies.
If a relation rfails to satisfy a given multivalued dependency, we canconstruct a relations r that does satisfy the multivalued dependency byadding tuples to r.
Fourth Normal Form
A relation schemaR is in 4NF with respect to a setD of functional andmultivalued dependencies if for all multivalued dependencies inD+ ofthe form , where R and R, at least one of the followinghold:
is trivial (i.e., or = R) is a superkey for schemaR
If a relation is in 4NF it is in BCNF
http://csetube.co.nr/
http://csetube.co.nr/ -
7/27/2019 141405 - DataBase Management Systems
31/60
http://
csetub
e.co
.nr/
GKMCET
Lecture Plan
Code & Subject Name: 141405 & Database Management Systems
31
Unit Number & Name: III & DATABASE DESIGN Period: 26 of 45
Page: 2 of 2
Join Dependencies and Fifth Normal Form
Join dependencies generalize multivalued dependencieslead to project-join normal form (PJNF) (also called fifthnormal form)
A class of even more general constraints, leads to a normal form calleddomain-key normal form.Problem with these generalized constraints: are hard to reason with, andno set of sound and complete set of inference rules exists.
Hence rarely used
http://csetube.co.nr/
http://csetube.co.nr/ -
7/27/2019 141405 - DataBase Management Systems
32/60
http://
csetub
e.co
.nr/
GKMCET
Lecture Plan
Code & Subject Name: 141405 & Database Management Systems
32
Unit Number & Name: IV & TRANSACTIONS Period: 28 of 45
Page: 2 of 2
Transaction Concepts - Transaction Recovery
A transaction is a unitof program execution that accesses and possiblyupdates various data items.E.g. transaction to transfer $50 from account A to account B:
1. read(A)2. A :=A 503. write(A)4. read(B)
5. B :=B + 506. write(B)Two main issues to deal with:
Failures of various kinds, such as hardware failures and systemcrashesConcurrent execution of multiple transactions
Example of Fund Transfer
n Transaction to transfer $50 from account A to account B:1. read(A)
2. A :=A 503. write(A)4. read(B)5. B :=B + 506. write(B)
Atomicity requirement
if the transaction fails after step 3 and before step 6, money will belost leading to an inconsistent database state
Failure could be due to software or hardware
the system should ensure that updates of a partially executedtransaction are not reflected in the databaseDurability requirement once the user has been notified that thetransaction has completed (i.e., the transfer of the $50 has taken place),the updates to the database by the transaction must persist even if thereare software or hardware failures.n Transaction to transfer $50 from account A to account B:
http://csetube.co.nr/
http://csetube.co.nr/ -
7/27/2019 141405 - DataBase Management Systems
33/60
http://
csetub
e.co
.nr/
GKMCET
Lecture Plan
Code & Subject Name: 141405 & Database Management Systems
33
1. read(A)2. A :=A 503. write(A)
4. read(B)5. B :=B + 506. write(B)Consistency requirement in above example:
the sum of A and B is unchanged by the execution of thetransaction
In general, consistency requirements include Explicitly specified integrity constraints such as primary
keys and foreign keys Implicit integrity constraints
e.g. sum of balances of all accounts, minus sum ofloan amounts must equal value of cash-in-hand
A transaction must see a consistent database.During transaction execution the database may be temporarilyinconsistent.When the transaction completes successfully the database must beconsistent
Erroneous transaction logic can lead to inconsistency
Isolation requirement if between steps 3 and 6, another transactionT2 is allowed to access the partially updated database, it will see aninconsistent database (the sum A + B will be less than it should be).
T1 T2
1. read(A)2. A :=A 503. write(A)
read(A), read(B), print(A+B)4. read(B)5. B :=B + 50
6. write(BIsolation can be ensured trivially by running transactions serially
that is, one after the other.However, executing multiple transactions concurrently has significant
benefits, as we will see later.
http://csetube.co.nr/
http://csetube.co.nr/ -
7/27/2019 141405 - DataBase Management Systems
34/60
http://
csetub
e.co
.nr/
GKMCET
Lecture Plan
Code & Subject Name: 141405 & Database Management Systems
34
Unit Number & Name: IV & TRANSACTIONS Period: 29 of 45
Page: 2 of 2
ACID Properties
A transaction is a unit of program execution that accesses and possiblyupdates various data items.To preserve the integrity of data the databasesystem must ensure:
Atomicity. Either all operations of the transaction are properly reflectedin the database or none are.Consistency. Execution of a transaction in isolation preserves the
consistency of the database.Isolation. Although multiple transactions may execute concurrently,each transaction must be unaware of other concurrently executingtransactions. Intermediate transaction results must be hidden from otherconcurrently executed transactions.
That is, for every pair of transactions Ti and Tj, it appears to Ti thateitherTj, finished execution before Ti started, orTj startedexecution afterTi finished.
Durability. After a transaction completes successfully, the changes ithas made to the database persist, even if there are system failures.
System Recovery Media Recovery
Failure ClassificationStorage StructureRecovery and AtomicityLog-Based RecoveryShadow PagingRecovery With Concurrent Transactions
Buffer ManagementFailure with Loss of Nonvolatile StorageAdvanced Recovery TechniquesARIES Recovery AlgorithmRemote Backup Systems
Failure Classification
http://csetube.co.nr/
http://csetube.co.nr/ -
7/27/2019 141405 - DataBase Management Systems
35/60
http://
csetub
e.co
.nr/
GKMCET
Lecture Plan
Code & Subject Name: 141405 & Database Management Systems
35
Transaction failure :Logical errors: transaction cannot complete due to some internalerror condition
System errors: the database system must terminate an activetransaction due to an error condition (e.g., deadlock)System crash: a power failure or other hardware or software failurecauses the system to crash.
Fail-stop assumption: non-volatile storage contents are assumedto not be corrupted by system crash
Database systems have numerous integrity checks toprevent corruption of disk data
Disk failure: a head crash or similar disk failure destroys all or part ofdisk storage
Destruction is assumed to be detectable: disk drives use checksumsto detect failures
Recovery Algorithms
Recovery algorithms are techniques to ensure database consistency andtransaction atomicity and durability despite failures
Focus of this chapterRecovery algorithms have two parts
Actions taken during normal transaction processing to ensureenough information exists to recover from failures
Actions taken after a failure to recover the database contents to astate that ensures atomicity, consistency and durabilityRecovery and Atomicity
Modifying the database without ensuring that the transaction will commitmay leave the database in an inconsistent state.Consider transaction Ti that transfers $50 from accountA to accountB;goal is either to perform all database modifications made by Ti or none atall.Several output operations may be required forTi (to outputA andB). A
failure may occur after one of these modifications have been made butbefore all of them are made.
To ensure atomicity despite failures, we first output informationdescribing the modifications to stable storage without modifying thedatabase itself.
http://csetube.co.nr/
http://csetube.co.nr/ -
7/27/2019 141405 - DataBase Management Systems
36/60
http://
csetub
e.co
.nr/
GKMCET
Lecture Plan
Code & Subject Name: 141405 & Database Management Systems
36
We study two approaches:log-based recovery, andshadow-paging
We assume (initially) that transactions run serially, that is, one after theother.
Two Phase Commit - Save Points
Lock-Based Protocols
A lock is a mechanism to control concurrent access to a data itemData items can be locked in two modes :
1. exclusive (X) mode. Data item can be both read as well aswritten. X-lock is requested using lock-X instruction.
2. shared (S) mode. Data item can only be read. S-lock isrequested using lock-S instruction.
Lock requests are made to concurrency-control manager. Transaction canproceed only after request is granted.Lock-compatibility matrixA transaction may be granted a lock on an item if the requested lock iscompatible with locks already held on the item by other transactionsAny number of transactions can hold shared locks on an item,
but if any transaction holds an exclusive on the item no other
transaction may hold any lock on the item.If a lock cannot be granted, the requesting transaction is made to wait tillall incompatible locks held by other transactions have been released. Thelock is then granted.
Example of a transaction performing locking:T2: lock-S(A);
read (A);unlock(A);lock-S(B);
read (B);unlock(B);display(A+B)
Locking as above is not sufficient to guarantee serializability ifA andB get updated in-between the read ofA andB, the displayed sum wouldbe wrong.
http://csetube.co.nr/
http://csetube.co.nr/ -
7/27/2019 141405 - DataBase Management Systems
37/60
http://
csetub
e.co
.nr/
GKMCET
Lecture Plan
Code & Subject Name: 141405 & Database Management Systems
37
A locking protocol is a set of rules followed by all transactions whilerequesting and releasing locks. Locking protocols restrict the set of
possible schedules.
The Two-Phase Locking Protocol
This is a protocol which ensures conflict-serializable schedules.Phase 1: Growing Phase
transaction may obtain lockstransaction may not release locks
Phase 2: Shrinking Phasetransaction may release lockstransaction may not obtain locks
The protocol assures serializability. It can be proved that the transactionscan be serialized in the order of theirlock points (i.e. the point where atransaction acquired its final lock).
Two-phase locking does notensure freedom from deadlocksCascading roll-back is possible under two-phase locking. To avoid this,follow a modified protocol called strict two-phase locking. Here atransaction must hold all its exclusive locks till it commits/aborts.Rigorous two-phase locking is even stricter: here alllocks are held till
commit/abort. In this protocol transactions can be serialized in the orderin which they commit.There can be conflict serializable schedules that cannot be obtained if
two-phase locking is used.However, in the absence of extra information (e.g., ordering of access todata), two-phase locking is needed for conflict serializability in thefollowing sense:
Given a transaction Ti that does not follow two-phase locking, we can finda transaction Tj that uses two-phase locking, and a schedule forTi and Tjthat is not conflict serializable.
http://csetube.co.nr/
http://csetube.co.nr/ -
7/27/2019 141405 - DataBase Management Systems
38/60
http://
csetub
e.co
.nr/
GKMCET
Lecture Plan
Code & Subject Name: 141405 & Database Management Systems
38
Unit Number & Name: IV & TRANSACTIONS Period: 32 of 45
Page: 2 of 2
SQL Facilities for recovery
Transaction Definition in SQL
Data manipulation language must include a construct for specifying theset of actions that comprise a transaction.In SQL, a transaction begins implicitly.A transaction in SQL ends by:
Commit workcommits current transaction and begins a new one.Rollback workcauses current transaction to abort.
In almost all database systems, by default, every SQL statement alsocommits implicitly if it executes successfullyImplicit commit can be turned off by a database directive
E.g. in JDBC, connection.setAutoCommit(false);
Concurrency Need for Concurrency
Lock-Based ProtocolsTimestamp-Based ProtocolsValidation-Based ProtocolsMultiple GranularityMultiversion SchemesInsert and Delete OperationsConcurrency in Index Structures
Lock-Based ProtocolsA lock is a mechanism to control concurrent access to a data itemData items can be locked in two modes :
1. exclusive (X) mode. Data item can be both read as well aswritten. X-lock is requested using lock-X instruction.
2. shared (S) mode. Data item can only be read. S-lock isrequested using lock-S instruction.
Lock requests are made to concurrency-control manager. Transaction can proceedonly after request is granted.
Graph-Based ProtocolsGraph-based protocols are an alternative to two-phase locking
Impose a partial ordering on the set D = {d1, d2 ,..., dh} of all data items.
Ifdi dj then any transaction accessing both di and dj must access di beforeaccessing dj.
http://csetube.co.nr/
http://csetube.co.nr/ -
7/27/2019 141405 - DataBase Management Systems
39/60
http://
csetub
e.co
.nr/
GKMCET
Lecture Plan
Code & Subject Name: 141405 & Database Management Systems
39
Implies that the set D may now be viewed as a directed acyclic graph, called adatabase graph.
The tree-protocolis a simple kind of graph protocol.
Tree Protocol
1. Only exclusive locks are allowed.2. The first lock by Ti may be on any data item. Subsequently, a data Q can be
locked by Ti only if the parent ofQ is currently locked by Ti.3. Data items may be unlocked at any time.4. A data item that has been locked and unlocked by Ti cannot subsequently be
relocked by Ti
Graph-Based Protocols (Cont.)The tree protocol ensures conflict serializability as well as freedom from deadlock.Unlocking may occur earlier in the tree-locking protocol than in the two-phaselocking protocol.
shorter waiting times, and increase in concurrencyprotocol is deadlock-free, no rollbacks are required
DrawbacksProtocol does not guarantee recoverability or cascade freedom
Need to introduce commit dependencies to ensure recoverabilityTransactions may have to lock data items that they do not access.
increased locking overhead, and additional waiting time potential decrease in concurrency
Schedules not possible under two-phase locking are possible under tree protocol, andvice versa.
http://csetube.co.nr/
http://csetube.co.nr/ -
7/27/2019 141405 - DataBase Management Systems
40/60
http://
csetub
e.co
.nr/
GKMCET
Lecture Plan
Code & Subject Name: 141405 & Database Management Systems
40
Unit Number & Name: IV & TRANSACTIONS Period: 34 of 45
Page: 2 of 2
Locking Protocols Two Phase Locking Intent Locking
Lock-Based Protocols
A lock is a mechanism to control concurrent access to a data itemData items can be locked in two modes :
1. exclusive (X) mode. Data item can be both read as well aswritten. X-lock is requested using lock-X instruction.
2. shared (S) mode. Data item can only be read. S-lock isrequested using lock-S instruction.
Lock requests are made to concurrency-control manager. Transaction canproceed only after request is granted.Lock-compatibility matrixA transaction may be granted a lock on an item if the requested lock iscompatible with locks already held on the item by other transactionsAny number of transactions can hold shared locks on an item,
but if any transaction holds an exclusive on the item no othertransaction may hold any lock on the item.
If a lock cannot be granted, the requesting transaction is made to wait tillall incompatible locks held by other transactions have been released. The
lock is then granted.
Example of a transaction performing locking:T2: lock-S(A);
read (A);unlock(A);lock-S(B);read (B);unlock(B);
display(A+B)Locking as above is not sufficient to guarantee serializability ifA and
B get updated in-between the read ofA andB, the displayed sum wouldbe wrong.A locking protocol is a set of rules followed by all transactions whilerequesting and releasing locks. Locking protocols restrict the set of
possible schedules.
http://csetube.co.nr/
http://csetube.co.nr/ -
7/27/2019 141405 - DataBase Management Systems
41/60
http://
csetub
e.co
.nr/
GKMCET
Lecture Plan
Code & Subject Name: 141405 & Database Management Systems
41
The Two-Phase Locking Protocol
This is a protocol which ensures conflict-serializable schedules.Phase 1: Growing Phasetransaction may obtain lockstransaction may not release locks
Phase 2: Shrinking Phasetransaction may release lockstransaction may not obtain locks
The protocol assures serializability. It can be proved that the transactionscan be serialized in the order of theirlock points (i.e. the point where atransaction acquired its final lock).
Two-phase locking does notensure freedom from deadlocksCascading roll-back is possible under two-phase locking. To avoid this,follow a modified protocol called strict two-phase locking. Here atransaction must hold all its exclusive locks till it commits/aborts.Rigorous two-phase locking is even stricter: here alllocks are held tillcommit/abort. In this protocol transactions can be serialized in the orderin which they commit.
There can be conflict serializable schedules that cannot be obtained if
two-phase locking is used.However, in the absence of extra information (e.g., ordering of access todata), two-phase locking is needed for conflict serializability in thefollowing sense:
Given a transaction Ti that does not follow two-phase locking, we can finda transaction Tj that uses two-phase locking, and a schedule forTi and Tjthat is not conflict serializable.
http://csetube.co.nr/
http://csetube.co.nr/ -
7/27/2019 141405 - DataBase Management Systems
42/60
http://
csetub
e.co
.nr/
GKMCET
Lecture Plan
Code & Subject Name: 141405 & Database Management Systems
42
Unit Number & Name: IV & TRANSACTIONS Period: 35 of 45
Page: 2 of 2
Deadlock- Serializability Recovery Isolation Levels
System is deadlocked if there is a set of transactions such that everytransaction in the set is waiting for another transaction in the set.
Deadlock preventionprotocols ensure that the system will neverenterinto a deadlock state. Some prevention strategies :
Require that each transaction locks all its data items before itbegins execution (predeclaration).Impose partial ordering of all data items and require that atransaction can lock data items only in the order specified by the
partial order (graph-based protocol).
More Deadlock Prevention Strategies
DDeeaaddlloocckkHHaannddlliinngg
n Consider the following two transactions:T1: write (X) T2: write(Y)
write(Y) write(X)
n Schedule with deadlock
T1 T2
lock-X onXwrite (X)
lock-X on Ywrite (X)wait forlock-X onX
wait forlock-X on Y
http://csetube.co.nr/
http://csetube.co.nr/ -
7/27/2019 141405 - DataBase Management Systems
43/60
http://
csetub
e.co
.nr/
GKMCET
Lecture Plan
Code & Subject Name: 141405 & Database Management Systems
43
Following schemes use transaction timestamps for the sake of deadlockprevention alone.
wait-die scheme non-preemptiveolder transaction may wait for younger one to release data item.Younger transactions never wait for older ones; they are rolled
back instead.a transaction may die several times before acquiring needed dataitem
wound-wait scheme preemptiveolder transaction wounds (forces rollback) of younger transactioninstead of waiting for it. Younger transactions may wait for olderones.
may be fewer rollbacks than wait-die scheme.
Both in wait-die and in wound-waitschemes, a rolled back transactions isrestarted with its original timestamp. Older transactions thus have
precedence over newer ones, and starvation is hence avoided.Timeout-Based Schemes :
a transaction waits for a lock only for a specified amount of time.After that, the wait times out and the transaction is rolled back.thus deadlocks are not possible
simple to implement; but starvation is possible. Also difficult todetermine good value of the timeout interval.
Deadlock Detection
Deadlocks can be described as a wait-for graph, which consists of a pairG = (V,E),
Vis a set of vertices (all the transactions in the system)Eis a set of edges; each element is an ordered pairTiTj.
IfTi Tj is inE, then there is a directed edge from Ti to Tj, implyingthat Ti is waiting forTj to release a data item.
When Ti requests a data item currently being held by Tj, then the edge TiTj is inserted in the wait-for graph. This edge is removed only when Tj isno longer holding a data item needed by Ti.The system is in a deadlock state if and only if the wait-for graph has acycle. Must invoke a deadlock-detection algorithm periodically to lookfor cycles.
http://csetube.co.nr/
http://csetube.co.nr/ -
7/27/2019 141405 - DataBase Management Systems
44/60
http://
csetub
e.co
.nr/
GKMCET
Lecture Plan
Code & Subject Name: 141405 & Database Management Systems
44
Deadlock Recovery
When deadlock is detected :Some transaction will have to rolled back (made a victim) to breakdeadlock. Select that transaction as victim that will incur
minimum cost.Rollback -- determine how far to roll back transaction
Total rollback: Abort the transaction and then restart it. More effective to roll back transaction only as far as
necessary to break deadlock.Starvation happens if same transaction is always chosen as victim.Include the number of rollbacks in the cost factor to avoidstarvation
Wait-for graph without a cycle Wait-for graph with a cycle
http://csetube.co.nr/
http://csetube.co.nr/ -
7/27/2019 141405 - DataBase Management Systems
45/60
http://
csetub
e.co
.nr/
GKMCET
Lecture Plan
Code & Subject Name: 141405 & Database Management Systems
45
Unit Number & Name: V & IMPLEMENTATION TECHNIQUES Period: 37 of 45
Page: 2 of 2
Overview of Physical Storage Media
Several types of data storage exist in most computer systems. They vary in speedof access, cost per unit of data, and reliability.
Cache: most costly and fastest form of storage. Usually very small, and managedby the operating system.
Main Memory (MM): the storage area for data available to be operated on. General-purpose machine instructions operate on main memory. Contents of main memory are usually lost in a power failure or ``crash''. Usually too small (even with megabytes) and too expensive to store the
entire database. Flash memory: EEPROM (electrically erasable programmable read-only
memory). Data in flash memory survive from power failure. Reading data from flash memory takes about 10 nano-secs (roughly as fast as
from main memory), and writing data into flash memory is morecomplicated: write-once takes about 4-10 microsecs.
To overwrite what has been written, one has to first erase the entire bank ofthe memory. It may support only a limited number of erase cycles( to ).
It has found its popularity as a replacement for disks for storing smallvolumes of data (5-10 megabytes).
Magnetic-disk storage:primary medium for long-term storage. Typically the entire database is stored on disk. Data must be moved from disk to main memory in order for the data to be
operated on. After operations are performed, data must be copied back to disk if any
changes were made. Disk storage is called direct access storage as it is possible to read data on
the disk in any order (unlike sequential access). Disk storage usually survives power failures and system crashes.
Optical storage: CD-ROM (compact-disk read-only memory), WORM (write-once read-many) disk (for archival storage of data), and Juke box (containing a
few drives and numerous disks loaded on demand). Tape Storage: used primarily for backup and archival data.
Cheaper, but much slower access, since tape must be read sequentially fromthe beginning.
Used as protection from disk failures! The storage device hierarchy is presented in Figure 10.1, where the higher levels
are expensive (cost per bit), fast (access time), but the capacity is smaller.
http://csetube.co.nr/
http://csetube.co.nr/ -
7/27/2019 141405 - DataBase Management Systems
46/60
http://
csetub
e.co
.nr/
GKMCET
Lecture Plan
Code & Subject Name: 141405 & Database Management Systems
46
Figure 10.1: Storage-device hierarchy
Another classification: Primary, secondary, and tertiary storage.1. Primary storage: the fastest storage media, such as cash and main memory.2. Secondary (or on-line) storage: the next level of the hierarchy, e.g., magnetic
disks.3. Tertiary (or off-line) storage: magnetic tapes and optical disk juke boxes. Volatility of storage. Volatile storage loses its contents when the power is
removed. Without power backup, data in the volatile storage (the part of thehierarchy from main memory up) must be written to nonvolatile storage forsafekeeping.
http://csetube.co.nr/
http://csetube.co.nr/ -
7/27/2019 141405 - DataBase Management Systems
47/60
http://
csetub
e.co
.nr/
GKMCET
Lecture Plan
Code & Subject Name: 141405 & Database Management Systems
47
Unit Number & Name: V & IMPLEMENTATION TECHNIQUES Period: 38 of 45
Page: 2 of 2
Magnetic Disks RAID
RAID (an acronym forredundant array of independent disks; originally redundant
array of inexpensive disks[1][2]) is a storage technology that combines multiple disk
drivecomponents into a logical unit. Data is distributed across the drives in one of several
ways called "RAID levels", depending on what level of redundancy and performance
(via parallel communication) is required.
RAID is an example of storage virtualization and was first defined by David A.
Patterson, Garth A. Gibson, and Randy Katz at the University of California, Berkeley in
1987.[3] Marketers representing industry RAID manufacturers later attempted to reinvent
the term to describe a redundant array of independent disks as a means of dissociating alow-cost expectation from RAID technology.[4]
RAID is now used as an umbrella term for computer data storage schemes that can divide
and replicate data among multiple physical drives. The physical drives are said to be in a
RAID,[5] which is accessed by the operating system as one single drive. The different
schemes or architectures are named by the word RAID followed by a number (e.g., RAID
0, RAID 1). Each scheme provides a different balance between two key goals:
increase data reliability and increase input/output performance.
A number of standard schemes have evolved which are referred to as levels. There were
five RAID levels originally conceived, but many more variations have evolved, notably
severalnested levels and many non-standard levels (mostly proprietary). RAID levels andtheir associated data formats are standardised by SNIA in the Common RAID Disk Drive
Format (DDF) standard.
Following is a brief textual summary of the most commonly used RAID levels.[6]
RAID 0 (block-level striping without parity or mirroring) has no (or zero)redundancy. It provides improved performance and additional storage but no faulttolerance. Hence simple stripe sets are normally referred to as RAID 0. Any drivefailure destroys the array, and the likelihood of failure increases with more drivesin the array (at a minimum, catastrophic data loss is almost twice as likelycompared to single drives without RAID). A single drive failure destroys theentire array because when data is written to a RAID 0 volume, the data is broken
into fragments called blocks. The number of blocks is dictated by the stripe size,which is a configuration parameter of the array. The blocks are written to theirrespective drives simultaneously on the same sector. This allows smaller sectionsof the entire chunk of data to be read off the drive in parallel, increasingbandwidth. RAID 0 does not implement error checking, so any error isuncorrectable. More drives in the array means higher bandwidth, but greater riskof data loss.
http://csetube.co.nr/
http://csetube.co.nr/ -
7/27/2019 141405 - DataBase Management Systems
48/60
http://
csetub
e.co
.nr/
GKMCET
Lecture Plan
Code & Subject Name: 141405 & Database Management Systems
48
InRAID 1 (mirroring without parity or striping), data is written identically tomultiple drives, thereby producing a "mirrored set"; at least 2 drives are requiredto constitute such an array. While more constituent drives may be employed,many implementations deal with a maximum of only 2; of course, it might be
possible to use such a limited level 1 RAID itself as a constituent of a level 1RAID, effectively masking the limitation.[citation needed] The array continues tooperate as long as at least one drive is functioning. With appropriate operatingsystem support, there can be increased read performance, and only a minimalwrite performance reduction; implementing RAID 1 with a separate controller foreach drive in order to perform simultaneous reads (and writes) is sometimescalled multiplexing(orduplexingwhen there are only 2 drives).
InRAID 2 (bit-level striping with dedicated Hamming-code parity), all diskspindle rotation is synchronized, and data is striped such that each sequential bit ison a different drive.Hamming-code parity is calculated across corresponding bitsand stored on at least one parity drive.
InRAID 3 (byte-level striping with dedicated parity), all disk spindle rotation issynchronized, and data is striped so each sequential byte is on a different drive.Parity is calculated across corresponding bytes and stored on a dedicated paritydrive.
RAID 4 (block-level striping with dedicated parity) is identical to RAID 5 (seebelow), but confines all parity data to a single drive. In this setup, files may bedistributed between multiple drives. Each drive operates independently, allowingI/O requests to be performed in parallel. However, the use of a dedicated paritydrive could create a performancebottleneck; because the parity data must bewritten to a single, dedicated parity drive for each block of non-parity data, theoverall write performance may depend a great deal on the performance of this
parity drive. RAID 5 (block-level striping with distributed parity) distributes parity along with
the data and requires all drives but one to be present to operate; the array is notdestroyed by a single drive failure. Upon drive failure, any subsequent reads canbe calculated from the distributed parity such that the drive failure is masked fromthe end user. However, a single drive failure results in reduced performance of theentire array until the failed drive has been replaced and the associated data rebuilt.Additionally, there is the potentially disastrousRAID 5 write hole.
RAID 6(block-level striping with double distributed parity) provides faulttolerance of two drive failures; the array continues to operate with up to two faileddrives. This makes larger RAID groups more practical, especially for high-
availability systems. This becomes increasingly important as large-capacity driveslengthen the time needed to recover from the failure of a single drive. Single-parity RAID levels are as vulnerable to data loss as a RAID 0 array until the faileddrive is replaced and its data rebuilt; the larger the drive, the longer the rebuildtakes. Double parity gives additional time to rebuild the array without the databeing at risk if a single additional drive fails before the rebuild is complete.
http://csetube.co.nr/
http://csetube.co.nr/ -
7/27/2019 141405 - DataBase Management Systems
49/60
http://
csetub
e.co
.nr/
GKMCET
Lecture Plan
Code & Subject Name: 141405 & Database Management Systems
49
Unit Number & Name: V & IMPLEMENTATION TECHNIQUES Period: 39 of 45
Page: 2 of 2
Tertiary storage File Organization Organization of Records in Files
Optical Disks
1. CD-ROM has become a popular medium for distributing software, multimediadata, and other electronic published information.
2. Capacity of CD-ROM: 500 MB. Disks are cheap to mass produce and alsodrives.
3. CD-ROM: much longer seek time (250m-sec), lower rotation speed (400 rpm),leading to high latency and lower data-transfer rate (about 150 KB/sec). Drives
spins at audio CD spin speed (standard) is available.4. Recently, a new optical format, digit video disk (DVD) has become standard.
These disks hold between 4.7 and 17 GB data.5. WORM (write-once, read many) disks are popular for archival storage of data
since they have a high capacity (about 500 MB), longer life time than HD, andcan be removed from drive -- good for audit trail (hard to tamper).
Magnetic Tapes6. Long history, slow, and limited to sequential access, and thus are used for
backup, storage for infrequent access, and off-line medium for system transfer.7. Moving to the correct spot may take minutes, but once positioned, tape drives
can write data at density and speed approaching to those of disk drives.8. 8mm tape drive has the highest density, and we store 5 GB data on a 350-foot
tape.9. Popularly used for storage of large volumes of data, such as video, image, or
remote sensing data.
File organization is the methodology which is applied to structured computer files. Filescontain computer records which can be documents or information which is stored in acertain way for later retrieval. File organization refers primarily to the logicalarrangement of data (which can itself be organized in a system of records with correlation
between the fields/columns) in a file system. It should not be confused with the physicalstorage of the file in some types of storage media. There are certain basic types ofcomputer file, which can include files stored as blocks of data and streams of data, wherethe information streams out of the file while it is being read until the end of the file isencountered.We will look at two components of file organization here:
1. The way the internal file structure is arranged and
http://csetube.co.nr/
http://csetube.co.nr/ -
7/27/2019 141405 - DataBase Management Systems
50/60
http://
csetub
e.co
.nr/
GKMCET
Lecture Plan
Code & Subject Name: 141405 & Database Management Systems
50
2. The external file as it is presented to the O/S or program that calls it. Here wewill also examine the concept of file extensions.
We will examine various ways that files can be stored and organized. Files are presentedto the application as a stream of bytes and then an EOF (end of file) condition.
A program that uses a file needs to know the structure of the file and needs to interpret itscontents.
Internal File Structure
Methods and Design Paradigm
It is a high-level design decision to specify a system of file organization for a computersoftware program or a computer system designed for a particular purpose. Performance is
high on the list of priorities for this design process, depending on how the file is beingused. The design of the file organization usually depends mainly on the systemenvironment. For instance, factors such as whether the file is going to be used fortransaction-oriented processes like OLTP or Data Warehousing, or whether the file isshared among various processes like those found in a typical distributed system orstandalone. It must also be asked whether the file is on a network and used by a numberof users and whether it may be accessed internally or remotely and how often it isaccessed.However, all things considered the most important considerations might be:
1. Rapid access to a record or a number of records which are related to each other.2. The Adding, modification, or deletion of records.
3. Efficiency of storage and retrieval of records.4. Redundancy, being the method of ensuring data integrity.
A file should be organized in such a way that the records are always available forprocessing with no delay. This should be done in line with the activity and volatility ofthe information.
Types of File Organization
Organizing a file depends on what kind of file it happens to be: a file in the simplest formcan be a text file, (in other words a file which is composed of ascii (American StandardCode for Information Interchange) text.) Files can also be created as binary or executabletypes (containing elements other than plain text.) Also, files are keyed with attributes
which help determine their use by the host operating system.
Techniques of File Organization
The three techniques of file organization are:1. Heap (unordered)2. Sorted
1. Sequential (SAM)
http://csetube.co.nr/
http://csetube.co.nr/ -
7/27/2019 141405 - DataBase Management Systems
51/60
http://
csetub
e.co
.nr/
GKMCET
Lecture Plan
Code & Subject Name: 141405 & Database Management Systems
51
2. Line Sequential (LSAM)3. Indexed Sequential (ISAM)
3. Hashed or DirectIn addition to the three techniques, there are four methods of organizing files. They
aresequential, line-sequential, indexed-sequential, inverted list and direct or hashedaccessorganization.
Sequential Organization
A sequential file contains records organized in the order they were entered. The order ofthe records is fixed. The records are stored and sorted in physical, contiguous blockswithin each block the records are in sequence.Records in these files can only be read or written sequentially.Once stored in the file, the record cannot be made shorter, or longer, or deleted. However,the record can be updatedif the length does not change. (This is done by replacing the
records by creating a new file.) New records will always appear at the end of the file.If the order of the records in a file is not important, sequential organization willsuffice, no matter how many records you may have. Sequential output is also useful forreport printing orsequential reads which some programs prefer to do.
Line-Sequential Organization
Line-sequential files are like sequential files, except that the records can contain onlycharacters as data. Line-sequential files are maintained by the native byte stream files ofthe operating system.In the COBOL environment, line-sequential files that are created with WRITE statements
with the ADVANCING phrase can be directed to a printer as well as to a disk.
Indexed-Sequential Organization
Key searches are improved by this system too. The single-level indexing structure is thesimplest one where a file, whose records are pairs, contains a key pointer. This pointeristhe position in the data file of the record with the given key. A subset of the records,which are evenly spaced along the data file, is indexed, in order to mark intervals of datarecords.This is how a key search is performed: the search key is compared with the index keys tofind the highest index key coming in front of the search key, while a linear search isperformed from the record that the index key points to, until the search key is matched oruntil the record pointed to by the next index entry is reached. Regardless of double fileaccess (index + data) required by this sort of search, the access time reduction issignificant compared with sequential file searches.Let's examine, for sake of example, a simple linear search on a 1,000 record sequentiallyorganized file. An average of 500 key comparisons are needed (and this assumes thesearch keys are uniformly distributed among the data keys). However, using an indexevenly spaced with 100 entries, the total number of comparisons is reduced to 50 in theindex file plus 50 in the data file: a five to one reduction in the operations count!
http://csetube.co.nr/
http://csetube.co.nr/ -
7/27/2019 141405 - DataBase Management Systems
52/60
http://
csetub
e.co
.nr/
GKMCET
Lecture Plan
Code & Subject Name: 141405 & Database Management Systems
52
Hierarchical extension of this scheme is possible since an index is a sequential file initself, capable of indexing in turn by another second-level index, and so forth and so on.And the exploit of the hierarchical decomposition of the searches more and more, todecrease the access time will pay increasing dividends in the reduction of processing
time. There is however a point when this advantage starts to be reduced by the increasedcost of storage and this in turn will increase the index access time.Hardware for Index-Sequential Organization is usually Disk-based, rather than tape.Records are physically ordered by primary key. And the index gives the physical locationof each record. Records can be accessed sequentially or directly, via the index. The indexis stored in a file and read into memory at the point when the file is opened. Also, indexesmust be maintained.Life sequential organization the data is stored in physical contiguous box. How ever thedifference is in the use of indexes. There are three areas in the disc storage:
Primary Area:-Contains file records stored by key or ID numbers. Overflow Area:-Contains records area that cannot be placed in primary area.
Index Area:-It contains keys of records and there locations on the disc.
Inverted List
In file organization, this is a file that is indexed on many of the attributes of the dataitself. The inverted list method has a single index for each key type. The records are notnecessarily stored in a sequence. They are placed in the are data storage area, but indexesare updated for the record keys and location.Here's an example, in a company file, an index could be maintained for all products,another one might be maintained forproduct types. Thus, it is faster to search the indexesthan every record. These types of file are also known as "inverted
indexes."Nevertheless, inverted list files use more media space and the storage devicesget full quickly with this type of organization. The benefits are apparent immediatelybecause searching is fast. However, updating is much slower.Content-based queries in text retrieval systems use inverted indexes as their preferredmechanism. Data items in these systems are usually stored compressedwhich wouldnormally slow the retrieval process, but the compression algorithm will be chosen tosupport this technique.When querying a file there are certain circumstances when the query is designed tobe modalwhich means that rules are set which require that different information be heldin the index. Here's an example of this modality: when phrase querying is undertaken, theparticular algorithm requires that offsets to word classifications are held in addition to
document numbers.
]Direct or Hashed Access
With direct or hashed access a portion of disk space is reserved and a hashingalgorithm computes the record address. So there is additional space required for this kindof file in the store. Records are placed randomly through out the file. Records areaccessed by addresses that specify their disc location. Also, this type of file organization
http://csetube.co.nr/
http://csetube.co.nr/ -
7/27/2019 141405 - DataBase Management Systems
53/60
http://
csetub
e.co
.nr/
GKMCET
Lecture Plan
Code & Subject Name: 141405 & Database Management Systems
53
requires a disk storage rather than tape. It has an excellent search retrieval performance,but care must be taken to maintain the indexes. If the indexes become corrupt, what is leftmay as well go to the bit-bucket, so it is as well to have regular backups of this kind offile just as it is for all stored valuable data!
External File Structure and File Extensions
Microsoft Windows and MS-DOS File Systems
The external structure of a file depends on whether it is being created ona FAT or NTFSpartition. The maximum filename length on a NTFS partition is 256characters, and 11 characters on FAT (8 character name+"."+3 characterextension.) NTFS filenames keep their case, whereas FAT filenames have no concept ofcase (but case is ignored when performing a search under NTFS Operating System).Also, there is the new VFAT which permits 256 character filenames.
UNIX and Apple Macintosh File Systems
The concept of directories and files is fundamental to the UNIX operating system.On Microsoft Windows-based operating systems, directories are depicted asfolders andmoving about is accomplished by clicking on the different icons. In UNIX, the directoriesare arranged as a hierarchy with the root directorybeing at the top of the tree.The rootdirectory is always depicted as /. Within the / directory, there are subdirectories(e.g.: etc and sys). Files can be written to any directory depending on the permissions.Files can be readable, writable and/orexecutable.
Organizing files using Libraries
With the advent of Microsoft Windows 7 the concept of file organization and
management has improved drastically by way of use of powerful tool called Libraries. ALibrary is file organization system to bring together related files and folders stored indifferent locations of the local as well as network computer such that these can beaccessed centrally through a single access point. For instance, various images stored indifferent folders in the local computer or/and across a computer network can beaccumulated in an Image Library. Aggregation of similar files can be manipulated, sortedor accessed conveniently as and when required through a single access point on acomputer desktop by use of a Library. This feature is particularly very useful foraccessing similar content of related content, and also, for managing projects using relatedand common data.
http://csetube.co.nr/
http://csetube.co.nr/ -
7/27/2019 141405 - DataBase Management Systems
54/60
http://
csetub
e.co
.nr/
GKMCET
Lecture Plan
Code & Subject Name: 141405 & Database Management Systems
54
Unit Number & Name: V & IMPLEMENTATION TECHNIQUES Period: 40 of 45
Page: 2 of 2
Indexing and Hashing Ordered Indices
Indexing mechanisms used to speed up access to desireddata. E.g. author catalog in library_ Search key attribute or set of attributes used to look uprecords in a file._ An index file consists of records (called index entries) of theformsearch-key pointer_ Index files are typically much smaller than the original file
_ Two basic kinds of indices: Ordered indices: search keys are stored in sorted order Hash indices: search keys are distributed uniformly acrossbuckets using a hash function
Index Evaluation MetricsIndexing techniques evaluated on basis of:_ Access types supported efficiently. E.g., records with a specified value in an attribute or records with an attribute value falling in a specified rangeof values.
_ Access time_ Insertion time_ Deletion time_ Space overhead
Ordered Indices_ In an ordered index, index entries are stored sorted on thesearch key value. E.g., author catalog in library._ Primary index: in a sequentially ordered file, the index whosesearch key specifies the sequential order of the file. Also called clustering index
The search key of a primary index is usually but notnecessarily the primary key._ Secondary index: an index whose search key specifies anorder different from the sequential order of the file. Also callednon-clustering index._ Index-sequential file: ordered sequential file with a primaryindex.
http://csetube.co.nr/
http://csetube.co.nr/ -
7/27/2019 141405 - DataBase Management Systems
55/60
http://
csetub
e.co
.nr/
GKMCET
Lecture Plan
Code & Subject Name: 141405 & Database Management Systems
55
Unit Number & Name: V & IMPLEMENTATION TECHNIQUES Period: 41 of 45
Page: 2 of 2
B+ tree Index Files B tree Index Files
B+-tree indices are an alternative to indexed-sequential files._ Disadvantage of indexed-sequential files: performancedegrades as file grows, since many overflow blocks getcreated. Periodic reorganization of entire file is required._ Advantage of B+-tree index files: automatically reorganizesitself with small, local, changes, in the face of insertions anddeletions. Reorganization of entire file is not required to
maintain performance._ Disadvantage of B+-trees: extra insertion and deletionoverhead, space overhead._ Advantages of B+-trees outweigh disadvantages, and they areused extensively.
A B+-tree is a rooted tree satisfying the following properties:_ All paths from root to leaf are of the same length
_ Each node that is not a root or a leaf has between dn/ 2e and nchildren._ A leaf node has between d(n 1)/ 2e and n 1 values_ Special cases: if the root is not a leaf, it has at least 2 children.If the root is a leaf (that is, there are no other nodes in thetree), it can have between 0 and (n 1) values.
B+-Tree Node Structure_ Typical nodeP1 K1 P2 . . . Pn1 Kn1 Pn
Ki are the search-key values Pi are pointers to children (for non-leaf nodes) or pointers torecords or buckets of records (for leaf nodes)._ The search-keys in a node are orderedK1 < K2 < K3 < ... < Kn 1
Example of a B+
http://csetube.co.nr/
http://csetube.co.nr/ -
7/27/2019 141405 - DataBase Management Systems
56/60
http://
csetub
e.co
.nr/
GKMCET
Lecture Plan
Code & Subject Name: 141405 & Database Management Systems
56
-treePerryridgeMianus RedwoodBrighton Downtown Mianus Perryridge Redwood Round Hill
B+-tree for account file (n = 3)
Static Hashing Dynamic Hashing
Static Hashing_ A bucket is a unit of storage containing one or more records (abucket is typically a disk block). In a hash file organization we
obtain the bucket of a record directly from its search-key valueusing a hash function.
_ Hash function h is a function from the set of all search-keyvalues K to the set of all bucket addresses B.
_ Hash function is used to locate records for access, insertion aswell as deletion.
_ Records with different search-key values may be mapped tothe same bucket; thus entire bucket has to be searchedsequentially to locate a record.
Dynamic Hashing_ Good for database that grows and shrinks in size_ Allows the hash function to be modified dynamically_ Extendable hashing one form of dynamic hashing Hash function generates values over a large range typically b-bit integers, with b = 32.
At any time use only a prefix of the hash function to indexinto a table of bucket addresses. Let the length of the prefix
be i bits, 0 _ i _ 32 Initially i = 0 Value of i grows and shrinks as the