141405 - database management systems

7/27/2019 141405 - DataBase Management Systems

1/60

http://

csetub

e.co

.nr/

GKMCET

Lecture Plan

Code & Subject Name: 141405 & Database Management Systems

1

Unit Number & Name: I & Introduction Period: 1 of 45

DBMS contains information about a particular enterprise

Collection of interrelated dataSet of programs to access the dataAn environment that is both convenientand efficientto use

Database Applications:Banking: all transactionsAirlines: reservations, schedulesUniversities: registration, gradesSales: customers, products, purchasesOnline retailers: order tracking, customized recommendationsManufacturing: production, inventory, orders, supply chainHuman resources: employee records, salaries, tax deductions

Databases touch all aspects of our lives

In the early days, database applications were built directly on top of file systemsDrawbacks of using file systems to store data:

Data redundancy and inconsistency Multiple file formats, duplication of information in different files

Difficulty in accessing data Need to write a new program to carry out each new task

Data isolation multiple files and formatsIntegrity problems

Integrity constraints (e.g. account balance > 0) become buried inprogram code rather than being stated explicitly

Hard to add new constraints or change existing onesDrawbacks of using file systems (cont.)

Atomicity of updates Failures may leave database in an inconsistent state with partial

updates carried out Example: Transfer of funds from one account to another should

either complete or not happen at allConcurrent access by multiple users

Concurrent accessed needed for performance Uncontrolled concurrent accesses can lead to inconsistencies

Example: Two people reading a balance and updating it at the

same timeSecurity problems

Hard to provide user access to some, but not all, dataDatabase systems offer solutions to all the above problems

http://csetube.co.nr/


2/60

http://

csetub

e.co

.nr/

GKMCET

Lecture Plan


2


Page: 2 of 2

Levels of Abstraction

Physical level: describes how a record (e.g., customer) is stored.Logical level: describes data stored in database, and the relationships among the data.

type customer= recordcustomer_id: string;

customer_name : string;customer_street: string;customer_city : string;

end;View level: application programs hide details of data types. Views can also hide

information (such as an employees salary) for security purposes.View of Data



3/60

http://

csetub

e.co

.nr/

GKMCET

Lecture Plan


3


Page: 2 of 2

Instances and SchemasSimilar to types and variables in programming languagesSchema the logical structure of the database

Example: The database consists of information about a set of customers andaccounts and the relationship between them)Analogous to type information of a variable in a programPhysical schema: database design at the physical levelLogical schema: database design at the logical level

Instance the actual content of the database at a particular point in timeAnalogous to the value of a variable

Physical Data Independence the ability to modify the physical schema without

changing the logical schemaApplications depend on the logical schemaIn general, the interfaces between the various levels and components shouldbe well defined so that changes in some parts do not seriously influenceothers.

Data ModelsA collection of tools for describing

DataData relationshipsData semantics

Data constraintsRelational modelEntity-Relationship data model (mainly for database design)Object-based data models (Object-oriented and Object-relational)Semistructured data model (XML)Other older models:

Network modelHierarchical model



4/60

http://

csetub

e.co

.nr/

GKMCET

Lecture Plan


4


Page: 2 of 2

Data Manipulation Language (DML)

Language for accessing and manipulating the data organized by the appropriate data

model

DML also known as query language

Two classes of languages

Procedural user specifies what data is required and how to get those data

Declarative (nonprocedural) user specifies what data is required without

specifying how to get those data

SQL is the most widely used query language

Data Definition Language (DDL)

Specification notation for defining the database schema

Example: create table account(

account_number char(10),

branch_name char(10),

balance integer)

DDL compiler generates a set of tables stored in a data dictionary

Data dictionary contains metadata (i.e., data about data)

Database schema

Datastorage and definition language

Specifies the storage structure and access methods used

Integrity constraints

Domain constraints Referential integrity (e.g. branch_name must correspond to a valid

branch in the branch table)

Authorization



5/60

http://

csetub

e.co

.nr/

GKMCET

Lecture Plan


5


Page: 2 of 2

Overall System StructureThe architecture of a database systems is greatly influenced by the underlying computersystem on which the database is running:

CentralizedClient-serverParallel (multiple processors and disks)Distributed

Lecture Plan R/TP/02



6/60

http://

csetub

e.co

.nr/

GKMCET

Lecture Plan


6


Page: 2 of 2

Database Users

Users are differentiated by the way they expect to interact with

the system

Application programmers interact with system through DML calls

Sophisticated users form requests in a database query language

Specialized users write specialized database applications that do not fit into the

traditional data processing framework

Nave users invoke one of the permanent application programs that have been

written previously

Examples, people accessing database over the web, bank tellers, clerical staff

Database Administrator

Coordinates all the activities of the database system

has a good understanding of the enterprises information resources and needs.

Database administrator's duties include:

Storage structure and access method definition

Schema and physical organization modification

Granting users authority to access the database

Backing up data

Monitoring performance and responding to changes

Database tuning



7/60

http://

csetub

e.co

.nr/

GKMCET

Lecture Plan


7


Page: 2 of 2

The Entity-Relationship Model

Models an enterprise as a collection ofentities and relationships

Entity: a thing or object in the enterprise that is distinguishable from other

objects

Described by a set ofattributes

Relationship: an association among several entities

Represented diagrammatically by an entity-relationship diagram:



8/60

http://

csetub

e.co

.nr/

GKMCET

Lecture Plan


8


Page: 2 of 2

Entity Setscustomerandloan

customer_id customer_ customer_ customer_ loan_ amount

name street city number



9/60

http://

csetub

e.co

.nr/

GKMCET

Lecture Plan


9


Page: 2 of 2

Relational Model

Structure of Relational Databases

Fundamental Relational-Algebra-Operations

Additional Relational-Algebra-Operations

Extended Relational-Algebra-Operations

Null Values

Modification of the Database

Example of a Relation



10/60

http://

csetub

e.co

.nr/

GKMCET

Lecture Plan


10

Unit Number & Name: II & RELATIONAL MODEL Period: 10 of 45

Page: 2 of 2

The relational Model The catalog- Types KeysEach attribute of a relation has a name

The set of allowed values for each attribute is called the domain of the attribute

Attribute values are (normally) required to be atomic; that is, indivisible

E.g. the value of an attribute can be an account number,

but cannot be a set of account numbers

Domain is said to be atomic if all its members are atomic

The special value null is a member of every domain

The null value causes complications in the definition of many operations

We shall ignore the effect of null values in our main presentation and consider

their effect later

Keys

Let K R

Kis a superkey ofR if values forKare sufficient to identify a unique tuple of each

possible relation r(R)

by possible r we mean a relation rthat could exist in the enterprise we are

modeling.

Example: {customer_name, customer_street} and

{customer_name}

are both superkeys ofCustomer, if no two customers can possibly have the

same name

In real life, an attribute such as customer_idwould be used instead

ofcustomer_name to uniquely identify customers, but we omit it tokeep our examples small, and instead assume customer names are

unique.

Kis a candidate key ifKis minimal

Example: {customer_name} is a candidate key forCustomer, since it is a superkey

and no subset of it is a superkey.



11/60

http://

csetub

e.co

.nr/

GKMCET

Lecture Plan


11

Primary key: a candidate key chosen as the principal means of identifying tuples

within a relation

Should choose an attribute whose value never, or very rarely, changes.

E.g. email address is unique, but may change

Relational Algebra

Procedural language

Six basic operators

select:

project:

union:

set difference:

Cartesian product: x

rename:

The operators take one or two relations as inputs and produce a new relation as a

result.

Select Operation Example

Relation r

A=B ^ D > 5 (r)

A B C D

15

1223

773

10



12/60

http://

csetub

e.co

.nr/

GKMCET

Lecture Plan


12

A B C D

123

710

PPrroojjeeccttOOppeerraattiioonnEExxaammppllee

n Relation r:A B C

10

203040

1

112

A C

11

12

=

A C

11

2

A,C (r)



13/60

http://

csetub

e.co

.nr/

GKMCET

Lecture Plan


13

UUnniioonnOOppeerraattiioonnEExxaammppllee

A B

121

A B

23

rs

A B

1213



14/60

http://

csetub

e.co

.nr/

GKMCET

Lecture Plan


14

SSeettDDiiffffeerreenncceeOOppeerraattiioonnEExxaammppllee

n Relations r, s:

n r s:

A B

121

A B

23

r

s

A B

11

CCaarrtteessiiaann--PPrroodduuccttOO eerraattiioonn EExxaamm llee

n Relations r, s:

n rx s:

A B

12

A B

11112222

C D

1010201010102010

E

aabbaabb

C D

10102010

E

aabbr

s



15/60

http://

csetub

e.co

.nr/

GKMCET

Lecture Plan


15

CCaarrtteessiiaann--PPrroodduuccttOO eerraattiioonn EExxaamm llee

n Relations r, s:

n rx s:

A B

12

A B

11112222

C D

1010201010102010

E

aabbaabb

C D

10102010

E

aabbr

s



16/60

http://

csetub

e.co

.nr/

GKMCET

Lecture Plan


16


Page: 2 of 2

Tuple Relational CalculusDomain Relational Calculus

Tuple Relational CalculusA nonprocedural query language, where each query is of the form

{t|P(t) }It is the set of all tuples tsuch that predicatePis true forttis a tuple variable, t[A ] denotes the value of tuple ton attributeA

t rdenotes that tuple tis in relation rPis aformula similar to that of the predicate calculus

Predicate Calculus Formula

1. Set of attributes and constants

2. Set of comparison operators: (e.g.,,,,,,)3. Set of connectives: and (), or (v) not ()

4. Implication (): x y, if x if true, then y is true

xy x vy5. Set of quantifiers:

t r(Q (t)) there exists a tuple in tin relation rsuch that predicate Q (t) is true

t r(Q (t)) Q is true for all tuples tin relation rBanking Example

branch (branch_name, branch_city, assets )customer(customer_name, customer_street, customer_city )

account(account_number, branch_name, balance )loan (loan_number, branch_name, amount)depositor(customer_name, account_number)borrower(customer_name, loan_number)

Example Queries

Find the loan_number, branch_name, and amount for loans of over $1200

{t | t loan t [amount ] 1200}n Find the loan number for each loan of an amount greater than $1200

{t |s loan (t [loan_number ] = s [loan_number ]s [amount ] 1200)}

Notice that a relation on schema [loan_number ] is implicitly defined bythe query



17/60

http://

csetub

e.co

.nr/

GKMCET

Lecture Plan


17

Domain Relational Calculus

A nonprocedural query language equivalent in power to the tuple relational calculusEach query is an expression of the form:

{x1, x2, , xn | P (x1, x2, , xn)}x1, x2, , xn represent domain variablesP represents a formula similar to that of the predicate calculus

EExxaammpplleeQQuueerriieess

n Find the loan_number, branch_name, and amountfor loans of over$1200

n Find the names of all customers who have a loan from the Perryridge branchand the loan amount:

{ c, a | l( c, l borrower b ( l, b, a loanb = Perryridge))}

{ c, a | l( c, l borrower l, Perryridge, a loan)}

{ c | l, b, a ( c, l borrower l, b, a loan a > 1200)}

n Find the names of all customers who have a loan of over $1200

{ l, b, a | l, b, a loan a > 1200}



18/60

http://

csetub

e.co

.nr/

GKMCET

Lecture Plan


18


Page: 2 of 2

EExxaammpplleeQQuueerriieess

n Find the names of all customers having a loan, an account, or both atthe Perryridge branch:

{ c | s,n ( c, s, n customer)x,y,z(x, y, z branch y= Brooklyn)

a,b (x, y, z account c,a depositor)}

n Find the names of all customers who have an account at allbranches located in Brooklyn:

{ c | l( c, l borrower

b,a ( l, b, a loan b = Perryridge))

a ( c, a depositor b,n ( a, b, n account b = Perryridge))}



19/60

http://

csetub

e.co

.nr/

GKMCET

Lecture Plan


19


Page: 2 of 2

SQLData DefinitionBasic Query StructureSet OperationsAggregate FunctionsNull ValuesNested SubqueriesComplex QueriesViewsModification of the DatabaseJoined Relations

Domain Types in SQLchar(n). Fixed length character string, with user-specified length n.varchar(n). Variable length character strings, with user-specified maximum lengthn.int. Integer (a finite subset of the integers that is machine-dependent).smallint. Small integer (a machine-dependent subset of the integer domain type).numeric(p,d). Fixed point number, with user-specified precision ofp digits, with ndigits to the right of decimal point.real, double precision. Floating point and double-precision floating point numbers,with machine-dependent precision.float(n). Floating point number, with user-specified precision of at least n digits.

Integrity Constraints

Integrity constraints guard against accidental damage to the database, by ensuring thatauthorized changes to the database do not result in a loss of data consistency.

A checking account must have a balance greater than $10,000.00A salary of a bank employee must be at least $4.00 an hourA customer must have a (non-null) phone number



20/60

http://

csetub

e.co

.nr/

GKMCET

Lecture Plan


20


Page: 2 of 2

Advanced SQLSQL Data Types and SchemasIntegrity ConstraintsAuthorizationEmbedded SQLDynamic SQLFunctions and Procedural ConstructsRecursive QueriesAdvanced SQL Features

Built-in Data Types in SQL

date: Dates, containing a (4 digit) year, month and date

Example: date 2005-7-27time: Time of day, in hours, minutes and seconds.Example: time 09:00:30 time 09:00:30.75

timestamp: date plus time of dayExample: timestamp 2005-7-27 09:00:30.75

interval: period of timeExample: interval 1 daySubtracting a date/time/timestamp value from another gives an interval valueInterval values can be added to date/time/timestamp values

Can extract values of individual fields from date/time/timestampExample: extract (year from r.starttime)

Can cast string types to date/time/timestampExample: cast as dateExample: cast as time

Referential Integrity in SQL Example

create tablecustomer

(customer_name char(20),

customer_street char(30),

customer_city char(30),

primary key (customer_name ))

create tablebranch

(branch_name char(15),

branch_city char(30),assets numeric(12,2),

primary key (branch_name ))

create tableaccount

(account_number char(10),

branch_name char(15),

balance integer,



21/60

http://

csetub

e.co

.nr/

GKMCET

Lecture Plan


21

primary key (account_number),

foreign key (branch_name) referencesbranch )

create tabledepositor

(customer_name char(20),

account_number char(10),primary key (customer_name, account_number),

foreign key (account_number) referencesaccount,

foreign key (customer_name ) referencescustomer)

Privileges in SQL

select: allows read access to relation,or the ability to query using the view

Example: grant usersU1,U2, andU3 select authorization on thebranch

relation:

grant select onbranch toU1, U2, U3

insert: the ability to insert tuples

update: the ability to update using the SQL update statementdelete: the ability to delete tuples.

all privileges: used as a short form for all the allowable privileges

Revoking Authorization in SQL

The revoke statement is used to revoke authorization.

revoke

on from

Example:

revoke select onbranch fromU1, U2, U3

All privileges that depend on the privilege being revoked are also revoked.

may be all to revoke all privileges the revokee may hold.

If the same privilege was granted twice to the same user by different grantees,the user may retain the privilege after the revocation.



22/60

http://

csetub

e.co

.nr/

GKMCET

Lecture Plan


22


Page: 2 of 2

Embedded SQLThe SQL standard defines embeddings of SQL in a variety of programminglanguages such as C, Java, and Cobol.A language to which SQL queries are embedded is referred to as a host language,and the SQL structures permitted in the host language comprise embeddedSQL.The basic form of these languages follows that of the System R embedding of SQLinto PL/I.EXEC SQL statement is used to identify embedded SQL request to the preprocessor

EXEC SQL END_EXECNote: this varies by language (for example, the Java embedding uses

# SQL { . }; )

From within a host language, find the names and cities of customers with more thanthe variable amount dollars in some account.Specify the query in SQL and declare a cursor for itEXEC SQL

declare c cursor forselect depositor.customer_name, customer_cityfrom depositor, customer, accountwhere depositor.customer_name = customer.customer_name

and depositor account_number = account.account_numberand account.balance > :amount

END_EXEC

The open statement causes the query to be evaluatedEXEC SQL open c END_EXEC

The fetch statement causes the values of one tuple in the query result to be placed onhost language variables.

EXEC SQL fetch c into :cn, :cc END_EXECRepeated calls to fetch get successive tuples in the query result

A variable called SQLSTATE in the SQL communication area (SQLCA) gets set to02000 to indicate no more data is availableThe close statement causes the database system to delete the temporary relation that

holds the result of the query.EXEC SQL close c END_EXECNote: above details vary with language. For example, the Java embedding defines Javaiterators to step through result tuples.



23/60

http://

csetub

e.co

.nr/

GKMCET

Lecture Plan


23

Dynamic SQLAllows programs to construct and submit SQL queries at run time.Example of the use of dynamic SQL from within a C program.

char * sqlprog = update accountset balance = balance * 1.05

where account_number = ?EXEC SQL prepare dynprog from :sqlprog;char account[10] = A-101;EXEC SQL execute dynprogusing :account;The dynamic SQL program contains a ?, which is a place holder for a value that isprovided when the SQL program is executed.



24/60

http://

csetub

e.co

.nr/

GKMCET

Lecture Plan


24

Unit Number & Name: III & DATABASE DESIGN Period: 19 of 45

Page: 2 of 2

Functional DependenciesConstraints on the set of legal relations.Require that the value for a certain set of attributes determines uniquelythe value for another set of attributes.A functional dependency is a generalization of the notion of a key.LetRbe a relation schema

R and R

The functional dependency

holds onR if and only if for any legal relations r(R), whenever any twotuples t1 and t2 ofragree on the attributes, they also agree on theattributes. That is,

t1[] = t2 [] t1[ ] = t2 [ ]Example: Considerr(A,B ) with the following instance ofr.On this instance,AB does NOT hold, but BA does hold.

Kis a superkey for relation schemaR if and only ifKRKis a candidate key forR if and only if

KR, andfor no K, R

Functional dependencies allow us to express constraints that cannot beexpressed using superkeys. Consider the schema:

bor_loan = (customer_id, loan_number, amount).We expect this functional dependency to hold:

loan_number amount

but would not expect the following to hold:amount customer_name

A functional dependency is trivial if it is satisfied by all instances of arelation

Example: customer_name, loan_number customer_name

customer_name customer_nameIn general, is trivial if



25/60

http://

csetub

e.co

.nr/

GKMCET

Lecture Plan


25


Page: 2 of 2

Functional Dependencies First, Second, Third Normal Forms

First Normal FormDomain is atomic if its elements are considered to be indivisible units

Examples of non-atomic domains: Set of names, composite attributes Identification numbers like CS101 that can be broken up into parts

A relational schema R is in first normal form if the domains of all attributes of R areatomicNon-atomic values complicate storage and encourage redundant (repeated) storage ofdata

Example: Set of accounts stored with each customer, and set of owners storedwith each accountWe assume all relations are in first normal form

Atomicity is actually a property of how the elements of the domain are used.Example: Strings would normally be considered indivisibleSuppose that students are given roll numbers which are strings of the formCS0012 orEE1127If the first two characters are extracted to find the department, the domain ofroll numbers is not atomic.Doing so is a bad idea: leads to encoding of information in applicationprogram rather than in the database.

Third Normal Form

A relation schemaR is in third normal form (3NF) if for all:

inF+at least one of the following holds:

is trivial (i.e., )

is a superkey forR

Each attributeA in is contained in a candidate key forR.(NOTE:

each attribute may be in a different candidate key)If a relation is in BCNF it is in 3NF (since in BCNF one of the first two conditionsabove must hold).Third condition is a minimal relaxation of BCNF to ensure dependency preservation



26/60

http://

csetub

e.co

.nr/

GKMCET

Lecture Plan


26


Page: 2 of 2

Functional Dependencies First, Second, Third Normal Forms

Third Normal Form: Motivation

There are some situations where

BCNF is not dependency preserving, and

efficient checking for FD violation on updates is important

Solution: define a weaker normal form, called Third Normal Form

(3NF)

Allows some redundancy

But functional dependencies can be checked on individual relations

without computing a join.

There is always a lossless-join, dependency-preserving decomposition into3NF.

EXAMPLE

Relation R:R = (J, K, L )

F = {JKL, LK}Two candidate keys: JKandJLR is in 3NF

JKL JKis a superkey

LK Kis contained in a candidate key

Testing for 3NF

Optimization: Need to check only FDs in F, need not check all FDs inF+.

Use attribute closure to check for each dependency , if is a superkey.

If is not a superkey, we have to verify if each attribute in is contained in acandidate key ofR

this test is rather more expensive, since it involve finding candidate keystesting for 3NF has been shown to be NP-hardInterestingly, decomposition into third normal form (described shortly) can bedone in polynomial time



27/60

http://

csetub

e.co

.nr/

GKMCET

Lecture Plan


27


Page: 2 of 2

Dependency PreservationLetFibe the set of dependenciesF + that include only attributes inRi.

A decomposition is dependency preserving, if(F1F2 Fn )+ =F +

If it is not, then checking updates for violation offunctional dependencies may require computing joins,which is expensive.

Testing for Dependency Preservation

To check if a dependency is preserved in a decomposition ofR intoR1,R2, ,Rn we apply the following test (with attribute closure donewith respect toF)

result=while (changes to result) do

for eachRi in the decompositiont= (resultRi)+Riresult = result t

Ifresultcontains all attributes in, then the functional dependency is preserved.

We apply the test on all dependencies inF to check if a decomposition isdependency preservingThis procedure takes polynomial time, instead of the exponential timerequired to computeF+ and (F1F2 Fn)+Example

R = (A, B, C)F = {AB

B C}Key = {A}

R is not in BCNFDecompositionR1 = (A, B), R2 = (B, C)R1 andR2 in BCNFLossless-join decompositionDependency preserving



28/60

http://

csetub

e.co

.nr/

GKMCET

Lecture Plan


28


Page: 2 of 2

Boyce/Codd Normal Form

A relation schemaR is in BCNF with respect to a setFof functionaldependencies if for all functional dependencies inF+ of the form

where R and R, at least one of the following holds: is trivial (i.e., )

is a superkey forRExample schema notin BCNF:

bor_loan = ( customer_id, loan_number, amount)

because loan_number amountholds on bor_loanbut loan_numberisnot a superkey

How good is BCNF?

There are database schemas in BCNF that do not seem to be sufficientlynormalized

Consider a databaseclasses (course, teacher, book)

such that (c, t, b) classes means that tis qualified to teach c, and b is arequired textbook forc

The database is supposed to list for each course the set of teachers anyone of which can be the courses instructor, and the set of books, all ofwhich are required for the course



29/60

http://

csetub

e.co

.nr/

GKMCET

Lecture Plan


29


Page: 2 of 2

Multi-valued Dependencies and Fourth Normal Form

LetRbe a relation schema and let R and R. The multivalueddependency

holds onR if in any legal relation r(R), for all pairs for tuples t1 and t2in rsuch that t1[] = t2 [], there exist tuples t3 and t4 in rsuch that:

t1[] = t2 [] = t3 [] = t4 []t3[] = t1 []

t3[R ] = t2[R ]t4 [] = t2[]t4[R ] = t1[R ]

Tabular representation of

Example

LetRbe a relation schema with a set of attributes that are partitioned into3 nonempty subsets.

Y, Z, WWe say that YZ(YmultideterminesZ)if and only if for all possible relations r(R )

rand rthen

rand rNote that since the behavior ofZand Ware identical it follows that

YZifY WIn our example:



30/60

http://

csetub

e.co

.nr/

GKMCET

Lecture Plan


30

course teachercourse book

The above formal definition is supposed to formalize the notion that

given a particular value ofY(course) it has associated with it a set ofvalues ofZ (teacher) and a set of values ofW (book), and these two setsare in some sense independent of each other.

Note:l IfYZ then YZl Indeed we have (in above notation)Z1 = Z2

The claim follows.Use of Multivalued Dependencies

We use multivalued dependencies in two ways:1. To test relations to determine whether they are legal under a given set

of functional and multivalued dependencies2. To specify constraints on the set of legal relations. We shall thusconcern ourselves only with relations that satisfy a given set of functionaland multivalued dependencies.

If a relation rfails to satisfy a given multivalued dependency, we canconstruct a relations r that does satisfy the multivalued dependency byadding tuples to r.

Fourth Normal Form

A relation schemaR is in 4NF with respect to a setD of functional andmultivalued dependencies if for all multivalued dependencies inD+ ofthe form , where R and R, at least one of the followinghold:

is trivial (i.e., or = R) is a superkey for schemaR

If a relation is in 4NF it is in BCNF



31/60

http://

csetub

e.co

.nr/

GKMCET

Lecture Plan


31


Page: 2 of 2

Join Dependencies and Fifth Normal Form

Join dependencies generalize multivalued dependencieslead to project-join normal form (PJNF) (also called fifthnormal form)

A class of even more general constraints, leads to a normal form calleddomain-key normal form.Problem with these generalized constraints: are hard to reason with, andno set of sound and complete set of inference rules exists.

Hence rarely used



32/60

http://

csetub

e.co

.nr/

GKMCET

Lecture Plan


32

Unit Number & Name: IV & TRANSACTIONS Period: 28 of 45

Page: 2 of 2

Transaction Concepts - Transaction Recovery

A transaction is a unitof program execution that accesses and possiblyupdates various data items.E.g. transaction to transfer $50 from account A to account B:

1. read(A)2. A :=A 503. write(A)4. read(B)

5. B :=B + 506. write(B)Two main issues to deal with:

Failures of various kinds, such as hardware failures and systemcrashesConcurrent execution of multiple transactions

Example of Fund Transfer

n Transaction to transfer $50 from account A to account B:1. read(A)

2. A :=A 503. write(A)4. read(B)5. B :=B + 506. write(B)

Atomicity requirement

if the transaction fails after step 3 and before step 6, money will belost leading to an inconsistent database state

Failure could be due to software or hardware

the system should ensure that updates of a partially executedtransaction are not reflected in the databaseDurability requirement once the user has been notified that thetransaction has completed (i.e., the transfer of the $50 has taken place),the updates to the database by the transaction must persist even if thereare software or hardware failures.n Transaction to transfer $50 from account A to account B:



33/60

http://

csetub

e.co

.nr/

GKMCET

Lecture Plan


33

1. read(A)2. A :=A 503. write(A)

4. read(B)5. B :=B + 506. write(B)Consistency requirement in above example:

the sum of A and B is unchanged by the execution of thetransaction

In general, consistency requirements include Explicitly specified integrity constraints such as primary

keys and foreign keys Implicit integrity constraints

e.g. sum of balances of all accounts, minus sum ofloan amounts must equal value of cash-in-hand

A transaction must see a consistent database.During transaction execution the database may be temporarilyinconsistent.When the transaction completes successfully the database must beconsistent

Erroneous transaction logic can lead to inconsistency

Isolation requirement if between steps 3 and 6, another transactionT2 is allowed to access the partially updated database, it will see aninconsistent database (the sum A + B will be less than it should be).

T1 T2

1. read(A)2. A :=A 503. write(A)

read(A), read(B), print(A+B)4. read(B)5. B :=B + 50

6. write(BIsolation can be ensured trivially by running transactions serially

that is, one after the other.However, executing multiple transactions concurrently has significant

benefits, as we will see later.



34/60

http://

csetub

e.co

.nr/

GKMCET

Lecture Plan


34


Page: 2 of 2

ACID Properties

A transaction is a unit of program execution that accesses and possiblyupdates various data items.To preserve the integrity of data the databasesystem must ensure:

Atomicity. Either all operations of the transaction are properly reflectedin the database or none are.Consistency. Execution of a transaction in isolation preserves the

consistency of the database.Isolation. Although multiple transactions may execute concurrently,each transaction must be unaware of other concurrently executingtransactions. Intermediate transaction results must be hidden from otherconcurrently executed transactions.

That is, for every pair of transactions Ti and Tj, it appears to Ti thateitherTj, finished execution before Ti started, orTj startedexecution afterTi finished.

Durability. After a transaction completes successfully, the changes ithas made to the database persist, even if there are system failures.

System Recovery Media Recovery

Failure ClassificationStorage StructureRecovery and AtomicityLog-Based RecoveryShadow PagingRecovery With Concurrent Transactions

Buffer ManagementFailure with Loss of Nonvolatile StorageAdvanced Recovery TechniquesARIES Recovery AlgorithmRemote Backup Systems

Failure Classification



35/60

http://

csetub

e.co

.nr/

GKMCET

Lecture Plan


35

Transaction failure :Logical errors: transaction cannot complete due to some internalerror condition

System errors: the database system must terminate an activetransaction due to an error condition (e.g., deadlock)System crash: a power failure or other hardware or software failurecauses the system to crash.

Fail-stop assumption: non-volatile storage contents are assumedto not be corrupted by system crash

Database systems have numerous integrity checks toprevent corruption of disk data

Disk failure: a head crash or similar disk failure destroys all or part ofdisk storage

Destruction is assumed to be detectable: disk drives use checksumsto detect failures

Recovery Algorithms

Recovery algorithms are techniques to ensure database consistency andtransaction atomicity and durability despite failures

Focus of this chapterRecovery algorithms have two parts

Actions taken during normal transaction processing to ensureenough information exists to recover from failures

Actions taken after a failure to recover the database contents to astate that ensures atomicity, consistency and durabilityRecovery and Atomicity

Modifying the database without ensuring that the transaction will commitmay leave the database in an inconsistent state.Consider transaction Ti that transfers $50 from accountA to accountB;goal is either to perform all database modifications made by Ti or none atall.Several output operations may be required forTi (to outputA andB). A

failure may occur after one of these modifications have been made butbefore all of them are made.

To ensure atomicity despite failures, we first output informationdescribing the modifications to stable storage without modifying thedatabase itself.



36/60

http://

csetub

e.co

.nr/

GKMCET

Lecture Plan


36

We study two approaches:log-based recovery, andshadow-paging

We assume (initially) that transactions run serially, that is, one after theother.

Two Phase Commit - Save Points

Lock-Based Protocols

A lock is a mechanism to control concurrent access to a data itemData items can be locked in two modes :

1. exclusive (X) mode. Data item can be both read as well aswritten. X-lock is requested using lock-X instruction.

2. shared (S) mode. Data item can only be read. S-lock isrequested using lock-S instruction.

Lock requests are made to concurrency-control manager. Transaction canproceed only after request is granted.Lock-compatibility matrixA transaction may be granted a lock on an item if the requested lock iscompatible with locks already held on the item by other transactionsAny number of transactions can hold shared locks on an item,

but if any transaction holds an exclusive on the item no other

transaction may hold any lock on the item.If a lock cannot be granted, the requesting transaction is made to wait tillall incompatible locks held by other transactions have been released. Thelock is then granted.

Example of a transaction performing locking:T2: lock-S(A);

read (A);unlock(A);lock-S(B);

read (B);unlock(B);display(A+B)

Locking as above is not sufficient to guarantee serializability ifA andB get updated in-between the read ofA andB, the displayed sum wouldbe wrong.



37/60

http://

csetub

e.co

.nr/

GKMCET

Lecture Plan


37

A locking protocol is a set of rules followed by all transactions whilerequesting and releasing locks. Locking protocols restrict the set of

possible schedules.

The Two-Phase Locking Protocol

This is a protocol which ensures conflict-serializable schedules.Phase 1: Growing Phase

transaction may obtain lockstransaction may not release locks

Phase 2: Shrinking Phasetransaction may release lockstransaction may not obtain locks

The protocol assures serializability. It can be proved that the transactionscan be serialized in the order of theirlock points (i.e. the point where atransaction acquired its final lock).

Two-phase locking does notensure freedom from deadlocksCascading roll-back is possible under two-phase locking. To avoid this,follow a modified protocol called strict two-phase locking. Here atransaction must hold all its exclusive locks till it commits/aborts.Rigorous two-phase locking is even stricter: here alllocks are held till

commit/abort. In this protocol transactions can be serialized in the orderin which they commit.There can be conflict serializable schedules that cannot be obtained if

two-phase locking is used.However, in the absence of extra information (e.g., ordering of access todata), two-phase locking is needed for conflict serializability in thefollowing sense:

Given a transaction Ti that does not follow two-phase locking, we can finda transaction Tj that uses two-phase locking, and a schedule forTi and Tjthat is not conflict serializable.



38/60

http://

csetub

e.co

.nr/

GKMCET

Lecture Plan


38


Page: 2 of 2

SQL Facilities for recovery

Transaction Definition in SQL

Data manipulation language must include a construct for specifying theset of actions that comprise a transaction.In SQL, a transaction begins implicitly.A transaction in SQL ends by:

Commit workcommits current transaction and begins a new one.Rollback workcauses current transaction to abort.

In almost all database systems, by default, every SQL statement alsocommits implicitly if it executes successfullyImplicit commit can be turned off by a database directive

E.g. in JDBC, connection.setAutoCommit(false);

Concurrency Need for Concurrency

Lock-Based ProtocolsTimestamp-Based ProtocolsValidation-Based ProtocolsMultiple GranularityMultiversion SchemesInsert and Delete OperationsConcurrency in Index Structures

Lock-Based ProtocolsA lock is a mechanism to control concurrent access to a data itemData items can be locked in two modes :



Lock requests are made to concurrency-control manager. Transaction can proceedonly after request is granted.

Graph-Based ProtocolsGraph-based protocols are an alternative to two-phase locking

Impose a partial ordering on the set D = {d1, d2 ,..., dh} of all data items.

Ifdi dj then any transaction accessing both di and dj must access di beforeaccessing dj.



39/60

http://

csetub

e.co

.nr/

GKMCET

Lecture Plan


39

Implies that the set D may now be viewed as a directed acyclic graph, called adatabase graph.

The tree-protocolis a simple kind of graph protocol.

Tree Protocol

1. Only exclusive locks are allowed.2. The first lock by Ti may be on any data item. Subsequently, a data Q can be

locked by Ti only if the parent ofQ is currently locked by Ti.3. Data items may be unlocked at any time.4. A data item that has been locked and unlocked by Ti cannot subsequently be

relocked by Ti

Graph-Based Protocols (Cont.)The tree protocol ensures conflict serializability as well as freedom from deadlock.Unlocking may occur earlier in the tree-locking protocol than in the two-phaselocking protocol.

shorter waiting times, and increase in concurrencyprotocol is deadlock-free, no rollbacks are required

DrawbacksProtocol does not guarantee recoverability or cascade freedom

Need to introduce commit dependencies to ensure recoverabilityTransactions may have to lock data items that they do not access.

increased locking overhead, and additional waiting time potential decrease in concurrency

Schedules not possible under two-phase locking are possible under tree protocol, andvice versa.



40/60

http://

csetub

e.co

.nr/

GKMCET

Lecture Plan


40


Page: 2 of 2

Locking Protocols Two Phase Locking Intent Locking

Lock-Based Protocols

A lock is a mechanism to control concurrent access to a data itemData items can be locked in two modes :



Lock requests are made to concurrency-control manager. Transaction canproceed only after request is granted.Lock-compatibility matrixA transaction may be granted a lock on an item if the requested lock iscompatible with locks already held on the item by other transactionsAny number of transactions can hold shared locks on an item,

but if any transaction holds an exclusive on the item no othertransaction may hold any lock on the item.

If a lock cannot be granted, the requesting transaction is made to wait tillall incompatible locks held by other transactions have been released. The

lock is then granted.

Example of a transaction performing locking:T2: lock-S(A);

read (A);unlock(A);lock-S(B);read (B);unlock(B);

display(A+B)Locking as above is not sufficient to guarantee serializability ifA and

B get updated in-between the read ofA andB, the displayed sum wouldbe wrong.A locking protocol is a set of rules followed by all transactions whilerequesting and releasing locks. Locking protocols restrict the set of

possible schedules.



41/60

http://

csetub

e.co

.nr/

GKMCET

Lecture Plan


41

The Two-Phase Locking Protocol

This is a protocol which ensures conflict-serializable schedules.Phase 1: Growing Phasetransaction may obtain lockstransaction may not release locks

Phase 2: Shrinking Phasetransaction may release lockstransaction may not obtain locks

The protocol assures serializability. It can be proved that the transactionscan be serialized in the order of theirlock points (i.e. the point where atransaction acquired its final lock).

Two-phase locking does notensure freedom from deadlocksCascading roll-back is possible under two-phase locking. To avoid this,follow a modified protocol called strict two-phase locking. Here atransaction must hold all its exclusive locks till it commits/aborts.Rigorous two-phase locking is even stricter: here alllocks are held tillcommit/abort. In this protocol transactions can be serialized in the orderin which they commit.

There can be conflict serializable schedules that cannot be obtained if

two-phase locking is used.However, in the absence of extra information (e.g., ordering of access todata), two-phase locking is needed for conflict serializability in thefollowing sense:

Given a transaction Ti that does not follow two-phase locking, we can finda transaction Tj that uses two-phase locking, and a schedule forTi and Tjthat is not conflict serializable.



42/60

http://

csetub

e.co

.nr/

GKMCET

Lecture Plan


42


Page: 2 of 2

Deadlock- Serializability Recovery Isolation Levels

System is deadlocked if there is a set of transactions such that everytransaction in the set is waiting for another transaction in the set.

Deadlock preventionprotocols ensure that the system will neverenterinto a deadlock state. Some prevention strategies :

Require that each transaction locks all its data items before itbegins execution (predeclaration).Impose partial ordering of all data items and require that atransaction can lock data items only in the order specified by the

partial order (graph-based protocol).

More Deadlock Prevention Strategies

DDeeaaddlloocckkHHaannddlliinngg

n Consider the following two transactions:T1: write (X) T2: write(Y)

write(Y) write(X)

n Schedule with deadlock

T1 T2

lock-X onXwrite (X)

lock-X on Ywrite (X)wait forlock-X onX

wait forlock-X on Y



43/60

http://

csetub

e.co

.nr/

GKMCET

Lecture Plan


43

Following schemes use transaction timestamps for the sake of deadlockprevention alone.

wait-die scheme non-preemptiveolder transaction may wait for younger one to release data item.Younger transactions never wait for older ones; they are rolled

back instead.a transaction may die several times before acquiring needed dataitem

wound-wait scheme preemptiveolder transaction wounds (forces rollback) of younger transactioninstead of waiting for it. Younger transactions may wait for olderones.

may be fewer rollbacks than wait-die scheme.

Both in wait-die and in wound-waitschemes, a rolled back transactions isrestarted with its original timestamp. Older transactions thus have

precedence over newer ones, and starvation is hence avoided.Timeout-Based Schemes :

a transaction waits for a lock only for a specified amount of time.After that, the wait times out and the transaction is rolled back.thus deadlocks are not possible

simple to implement; but starvation is possible. Also difficult todetermine good value of the timeout interval.

Deadlock Detection

Deadlocks can be described as a wait-for graph, which consists of a pairG = (V,E),

Vis a set of vertices (all the transactions in the system)Eis a set of edges; each element is an ordered pairTiTj.

IfTi Tj is inE, then there is a directed edge from Ti to Tj, implyingthat Ti is waiting forTj to release a data item.

When Ti requests a data item currently being held by Tj, then the edge TiTj is inserted in the wait-for graph. This edge is removed only when Tj isno longer holding a data item needed by Ti.The system is in a deadlock state if and only if the wait-for graph has acycle. Must invoke a deadlock-detection algorithm periodically to lookfor cycles.



44/60

http://

csetub

e.co

.nr/

GKMCET

Lecture Plan


44

Deadlock Recovery

When deadlock is detected :Some transaction will have to rolled back (made a victim) to breakdeadlock. Select that transaction as victim that will incur

minimum cost.Rollback -- determine how far to roll back transaction

Total rollback: Abort the transaction and then restart it. More effective to roll back transaction only as far as

necessary to break deadlock.Starvation happens if same transaction is always chosen as victim.Include the number of rollbacks in the cost factor to avoidstarvation

Wait-for graph without a cycle Wait-for graph with a cycle



45/60

http://

csetub

e.co

.nr/

GKMCET

Lecture Plan


45

Unit Number & Name: V & IMPLEMENTATION TECHNIQUES Period: 37 of 45

Page: 2 of 2

Overview of Physical Storage Media

Several types of data storage exist in most computer systems. They vary in speedof access, cost per unit of data, and reliability.

Cache: most costly and fastest form of storage. Usually very small, and managedby the operating system.

Main Memory (MM): the storage area for data available to be operated on. General-purpose machine instructions operate on main memory. Contents of main memory are usually lost in a power failure or ``crash''. Usually too small (even with megabytes) and too expensive to store the

entire database. Flash memory: EEPROM (electrically erasable programmable read-only

memory). Data in flash memory survive from power failure. Reading data from flash memory takes about 10 nano-secs (roughly as fast as

from main memory), and writing data into flash memory is morecomplicated: write-once takes about 4-10 microsecs.

To overwrite what has been written, one has to first erase the entire bank ofthe memory. It may support only a limited number of erase cycles( to ).

It has found its popularity as a replacement for disks for storing smallvolumes of data (5-10 megabytes).

Magnetic-disk storage:primary medium for long-term storage. Typically the entire database is stored on disk. Data must be moved from disk to main memory in order for the data to be

operated on. After operations are performed, data must be copied back to disk if any

changes were made. Disk storage is called direct access storage as it is possible to read data on

the disk in any order (unlike sequential access). Disk storage usually survives power failures and system crashes.

Optical storage: CD-ROM (compact-disk read-only memory), WORM (write-once read-many) disk (for archival storage of data), and Juke box (containing a

few drives and numerous disks loaded on demand). Tape Storage: used primarily for backup and archival data.

Cheaper, but much slower access, since tape must be read sequentially fromthe beginning.

Used as protection from disk failures! The storage device hierarchy is presented in Figure 10.1, where the higher levels

are expensive (cost per bit), fast (access time), but the capacity is smaller.



46/60

http://

csetub

e.co

.nr/

GKMCET

Lecture Plan


46

Figure 10.1: Storage-device hierarchy

Another classification: Primary, secondary, and tertiary storage.1. Primary storage: the fastest storage media, such as cash and main memory.2. Secondary (or on-line) storage: the next level of the hierarchy, e.g., magnetic

disks.3. Tertiary (or off-line) storage: magnetic tapes and optical disk juke boxes. Volatility of storage. Volatile storage loses its contents when the power is

removed. Without power backup, data in the volatile storage (the part of thehierarchy from main memory up) must be written to nonvolatile storage forsafekeeping.



47/60

http://

csetub

e.co

.nr/

GKMCET

Lecture Plan


47


Page: 2 of 2

Magnetic Disks RAID

RAID (an acronym forredundant array of independent disks; originally redundant

array of inexpensive disks[1][2]) is a storage technology that combines multiple disk

drivecomponents into a logical unit. Data is distributed across the drives in one of several

ways called "RAID levels", depending on what level of redundancy and performance

(via parallel communication) is required.

RAID is an example of storage virtualization and was first defined by David A.

Patterson, Garth A. Gibson, and Randy Katz at the University of California, Berkeley in

1987.[3] Marketers representing industry RAID manufacturers later attempted to reinvent

the term to describe a redundant array of independent disks as a means of dissociating alow-cost expectation from RAID technology.[4]

RAID is now used as an umbrella term for computer data storage schemes that can divide

and replicate data among multiple physical drives. The physical drives are said to be in a

RAID,[5] which is accessed by the operating system as one single drive. The different

schemes or architectures are named by the word RAID followed by a number (e.g., RAID

0, RAID 1). Each scheme provides a different balance between two key goals:

increase data reliability and increase input/output performance.

A number of standard schemes have evolved which are referred to as levels. There were

five RAID levels originally conceived, but many more variations have evolved, notably

severalnested levels and many non-standard levels (mostly proprietary). RAID levels andtheir associated data formats are standardised by SNIA in the Common RAID Disk Drive

Format (DDF) standard.

Following is a brief textual summary of the most commonly used RAID levels.[6]

RAID 0 (block-level striping without parity or mirroring) has no (or zero)redundancy. It provides improved performance and additional storage but no faulttolerance. Hence simple stripe sets are normally referred to as RAID 0. Any drivefailure destroys the array, and the likelihood of failure increases with more drivesin the array (at a minimum, catastrophic data loss is almost twice as likelycompared to single drives without RAID). A single drive failure destroys theentire array because when data is written to a RAID 0 volume, the data is broken

into fragments called blocks. The number of blocks is dictated by the stripe size,which is a configuration parameter of the array. The blocks are written to theirrespective drives simultaneously on the same sector. This allows smaller sectionsof the entire chunk of data to be read off the drive in parallel, increasingbandwidth. RAID 0 does not implement error checking, so any error isuncorrectable. More drives in the array means higher bandwidth, but greater riskof data loss.



48/60

http://

csetub

e.co

.nr/

GKMCET

Lecture Plan


48

InRAID 1 (mirroring without parity or striping), data is written identically tomultiple drives, thereby producing a "mirrored set"; at least 2 drives are requiredto constitute such an array. While more constituent drives may be employed,many implementations deal with a maximum of only 2; of course, it might be

possible to use such a limited level 1 RAID itself as a constituent of a level 1RAID, effectively masking the limitation.[citation needed] The array continues tooperate as long as at least one drive is functioning. With appropriate operatingsystem support, there can be increased read performance, and only a minimalwrite performance reduction; implementing RAID 1 with a separate controller foreach drive in order to perform simultaneous reads (and writes) is sometimescalled multiplexing(orduplexingwhen there are only 2 drives).

InRAID 2 (bit-level striping with dedicated Hamming-code parity), all diskspindle rotation is synchronized, and data is striped such that each sequential bit ison a different drive.Hamming-code parity is calculated across corresponding bitsand stored on at least one parity drive.

InRAID 3 (byte-level striping with dedicated parity), all disk spindle rotation issynchronized, and data is striped so each sequential byte is on a different drive.Parity is calculated across corresponding bytes and stored on a dedicated paritydrive.

RAID 4 (block-level striping with dedicated parity) is identical to RAID 5 (seebelow), but confines all parity data to a single drive. In this setup, files may bedistributed between multiple drives. Each drive operates independently, allowingI/O requests to be performed in parallel. However, the use of a dedicated paritydrive could create a performancebottleneck; because the parity data must bewritten to a single, dedicated parity drive for each block of non-parity data, theoverall write performance may depend a great deal on the performance of this

parity drive. RAID 5 (block-level striping with distributed parity) distributes parity along with

the data and requires all drives but one to be present to operate; the array is notdestroyed by a single drive failure. Upon drive failure, any subsequent reads canbe calculated from the distributed parity such that the drive failure is masked fromthe end user. However, a single drive failure results in reduced performance of theentire array until the failed drive has been replaced and the associated data rebuilt.Additionally, there is the potentially disastrousRAID 5 write hole.

RAID 6(block-level striping with double distributed parity) provides faulttolerance of two drive failures; the array continues to operate with up to two faileddrives. This makes larger RAID groups more practical, especially for high-

availability systems. This becomes increasingly important as large-capacity driveslengthen the time needed to recover from the failure of a single drive. Single-parity RAID levels are as vulnerable to data loss as a RAID 0 array until the faileddrive is replaced and its data rebuilt; the larger the drive, the longer the rebuildtakes. Double parity gives additional time to rebuild the array without the databeing at risk if a single additional drive fails before the rebuild is complete.



49/60

http://

csetub

e.co

.nr/

GKMCET

Lecture Plan


49


Page: 2 of 2

Tertiary storage File Organization Organization of Records in Files

Optical Disks

1. CD-ROM has become a popular medium for distributing software, multimediadata, and other electronic published information.

2. Capacity of CD-ROM: 500 MB. Disks are cheap to mass produce and alsodrives.

3. CD-ROM: much longer seek time (250m-sec), lower rotation speed (400 rpm),leading to high latency and lower data-transfer rate (about 150 KB/sec). Drives

spins at audio CD spin speed (standard) is available.4. Recently, a new optical format, digit video disk (DVD) has become standard.

These disks hold between 4.7 and 17 GB data.5. WORM (write-once, read many) disks are popular for archival storage of data

since they have a high capacity (about 500 MB), longer life time than HD, andcan be removed from drive -- good for audit trail (hard to tamper).

Magnetic Tapes6. Long history, slow, and limited to sequential access, and thus are used for

backup, storage for infrequent access, and off-line medium for system transfer.7. Moving to the correct spot may take minutes, but once positioned, tape drives

can write data at density and speed approaching to those of disk drives.8. 8mm tape drive has the highest density, and we store 5 GB data on a 350-foot

tape.9. Popularly used for storage of large volumes of data, such as video, image, or

remote sensing data.

File organization is the methodology which is applied to structured computer files. Filescontain computer records which can be documents or information which is stored in acertain way for later retrieval. File organization refers primarily to the logicalarrangement of data (which can itself be organized in a system of records with correlation

between the fields/columns) in a file system. It should not be confused with the physicalstorage of the file in some types of storage media. There are certain basic types ofcomputer file, which can include files stored as blocks of data and streams of data, wherethe information streams out of the file while it is being read until the end of the file isencountered.We will look at two components of file organization here:

1. The way the internal file structure is arranged and



50/60

http://

csetub

e.co

.nr/

GKMCET

Lecture Plan


50

2. The external file as it is presented to the O/S or program that calls it. Here wewill also examine the concept of file extensions.

We will examine various ways that files can be stored and organized. Files are presentedto the application as a stream of bytes and then an EOF (end of file) condition.

A program that uses a file needs to know the structure of the file and needs to interpret itscontents.

Internal File Structure

Methods and Design Paradigm

It is a high-level design decision to specify a system of file organization for a computersoftware program or a computer system designed for a particular purpose. Performance is

high on the list of priorities for this design process, depending on how the file is beingused. The design of the file organization usually depends mainly on the systemenvironment. For instance, factors such as whether the file is going to be used fortransaction-oriented processes like OLTP or Data Warehousing, or whether the file isshared among various processes like those found in a typical distributed system orstandalone. It must also be asked whether the file is on a network and used by a numberof users and whether it may be accessed internally or remotely and how often it isaccessed.However, all things considered the most important considerations might be:

1. Rapid access to a record or a number of records which are related to each other.2. The Adding, modification, or deletion of records.

3. Efficiency of storage and retrieval of records.4. Redundancy, being the method of ensuring data integrity.

A file should be organized in such a way that the records are always available forprocessing with no delay. This should be done in line with the activity and volatility ofthe information.

Types of File Organization

Organizing a file depends on what kind of file it happens to be: a file in the simplest formcan be a text file, (in other words a file which is composed of ascii (American StandardCode for Information Interchange) text.) Files can also be created as binary or executabletypes (containing elements other than plain text.) Also, files are keyed with attributes

which help determine their use by the host operating system.

Techniques of File Organization

The three techniques of file organization are:1. Heap (unordered)2. Sorted

1. Sequential (SAM)



51/60

http://

csetub

e.co

.nr/

GKMCET

Lecture Plan


51

2. Line Sequential (LSAM)3. Indexed Sequential (ISAM)

3. Hashed or DirectIn addition to the three techniques, there are four methods of organizing files. They

aresequential, line-sequential, indexed-sequential, inverted list and direct or hashedaccessorganization.

Sequential Organization

A sequential file contains records organized in the order they were entered. The order ofthe records is fixed. The records are stored and sorted in physical, contiguous blockswithin each block the records are in sequence.Records in these files can only be read or written sequentially.Once stored in the file, the record cannot be made shorter, or longer, or deleted. However,the record can be updatedif the length does not change. (This is done by replacing the

records by creating a new file.) New records will always appear at the end of the file.If the order of the records in a file is not important, sequential organization willsuffice, no matter how many records you may have. Sequential output is also useful forreport printing orsequential reads which some programs prefer to do.

Line-Sequential Organization

Line-sequential files are like sequential files, except that the records can contain onlycharacters as data. Line-sequential files are maintained by the native byte stream files ofthe operating system.In the COBOL environment, line-sequential files that are created with WRITE statements

with the ADVANCING phrase can be directed to a printer as well as to a disk.

Indexed-Sequential Organization

Key searches are improved by this system too. The single-level indexing structure is thesimplest one where a file, whose records are pairs, contains a key pointer. This pointeristhe position in the data file of the record with the given key. A subset of the records,which are evenly spaced along the data file, is indexed, in order to mark intervals of datarecords.This is how a key search is performed: the search key is compared with the index keys tofind the highest index key coming in front of the search key, while a linear search isperformed from the record that the index key points to, until the search key is matched oruntil the record pointed to by the next index entry is reached. Regardless of double fileaccess (index + data) required by this sort of search, the access time reduction issignificant compared with sequential file searches.Let's examine, for sake of example, a simple linear search on a 1,000 record sequentiallyorganized file. An average of 500 key comparisons are needed (and this assumes thesearch keys are uniformly distributed among the data keys). However, using an indexevenly spaced with 100 entries, the total number of comparisons is reduced to 50 in theindex file plus 50 in the data file: a five to one reduction in the operations count!



52/60

http://

csetub

e.co

.nr/

GKMCET

Lecture Plan


52

Hierarchical extension of this scheme is possible since an index is a sequential file initself, capable of indexing in turn by another second-level index, and so forth and so on.And the exploit of the hierarchical decomposition of the searches more and more, todecrease the access time will pay increasing dividends in the reduction of processing

time. There is however a point when this advantage starts to be reduced by the increasedcost of storage and this in turn will increase the index access time.Hardware for Index-Sequential Organization is usually Disk-based, rather than tape.Records are physically ordered by primary key. And the index gives the physical locationof each record. Records can be accessed sequentially or directly, via the index. The indexis stored in a file and read into memory at the point when the file is opened. Also, indexesmust be maintained.Life sequential organization the data is stored in physical contiguous box. How ever thedifference is in the use of indexes. There are three areas in the disc storage:

Primary Area:-Contains file records stored by key or ID numbers. Overflow Area:-Contains records area that cannot be placed in primary area.

Index Area:-It contains keys of records and there locations on the disc.

Inverted List

In file organization, this is a file that is indexed on many of the attributes of the dataitself. The inverted list method has a single index for each key type. The records are notnecessarily stored in a sequence. They are placed in the are data storage area, but indexesare updated for the record keys and location.Here's an example, in a company file, an index could be maintained for all products,another one might be maintained forproduct types. Thus, it is faster to search the indexesthan every record. These types of file are also known as "inverted

indexes."Nevertheless, inverted list files use more media space and the storage devicesget full quickly with this type of organization. The benefits are apparent immediatelybecause searching is fast. However, updating is much slower.Content-based queries in text retrieval systems use inverted indexes as their preferredmechanism. Data items in these systems are usually stored compressedwhich wouldnormally slow the retrieval process, but the compression algorithm will be chosen tosupport this technique.When querying a file there are certain circumstances when the query is designed tobe modalwhich means that rules are set which require that different information be heldin the index. Here's an example of this modality: when phrase querying is undertaken, theparticular algorithm requires that offsets to word classifications are held in addition to

document numbers.

]Direct or Hashed Access

With direct or hashed access a portion of disk space is reserved and a hashingalgorithm computes the record address. So there is additional space required for this kindof file in the store. Records are placed randomly through out the file. Records areaccessed by addresses that specify their disc location. Also, this type of file organization



53/60

http://

csetub

e.co

.nr/

GKMCET

Lecture Plan


53

requires a disk storage rather than tape. It has an excellent search retrieval performance,but care must be taken to maintain the indexes. If the indexes become corrupt, what is leftmay as well go to the bit-bucket, so it is as well to have regular backups of this kind offile just as it is for all stored valuable data!

External File Structure and File Extensions

Microsoft Windows and MS-DOS File Systems

The external structure of a file depends on whether it is being created ona FAT or NTFSpartition. The maximum filename length on a NTFS partition is 256characters, and 11 characters on FAT (8 character name+"."+3 characterextension.) NTFS filenames keep their case, whereas FAT filenames have no concept ofcase (but case is ignored when performing a search under NTFS Operating System).Also, there is the new VFAT which permits 256 character filenames.

UNIX and Apple Macintosh File Systems

The concept of directories and files is fundamental to the UNIX operating system.On Microsoft Windows-based operating systems, directories are depicted asfolders andmoving about is accomplished by clicking on the different icons. In UNIX, the directoriesare arranged as a hierarchy with the root directorybeing at the top of the tree.The rootdirectory is always depicted as /. Within the / directory, there are subdirectories(e.g.: etc and sys). Files can be written to any directory depending on the permissions.Files can be readable, writable and/orexecutable.

Organizing files using Libraries

With the advent of Microsoft Windows 7 the concept of file organization and

management has improved drastically by way of use of powerful tool called Libraries. ALibrary is file organization system to bring together related files and folders stored indifferent locations of the local as well as network computer such that these can beaccessed centrally through a single access point. For instance, various images stored indifferent folders in the local computer or/and across a computer network can beaccumulated in an Image Library. Aggregation of similar files can be manipulated, sortedor accessed conveniently as and when required through a single access point on acomputer desktop by use of a Library. This feature is particularly very useful foraccessing similar content of related content, and also, for managing projects using relatedand common data.



54/60

http://

csetub

e.co

.nr/

GKMCET

Lecture Plan


54


Page: 2 of 2

Indexing and Hashing Ordered Indices

Indexing mechanisms used to speed up access to desireddata. E.g. author catalog in library_ Search key attribute or set of attributes used to look uprecords in a file._ An index file consists of records (called index entries) of theformsearch-key pointer_ Index files are typically much smaller than the original file

_ Two basic kinds of indices: Ordered indices: search keys are stored in sorted order Hash indices: search keys are distributed uniformly acrossbuckets using a hash function

Index Evaluation MetricsIndexing techniques evaluated on basis of:_ Access types supported efficiently. E.g., records with a specified value in an attribute or records with an attribute value falling in a specified rangeof values.

_ Access time_ Insertion time_ Deletion time_ Space overhead

Ordered Indices_ In an ordered index, index entries are stored sorted on thesearch key value. E.g., author catalog in library._ Primary index: in a sequentially ordered file, the index whosesearch key specifies the sequential order of the file. Also called clustering index

The search key of a primary index is usually but notnecessarily the primary key._ Secondary index: an index whose search key specifies anorder different from the sequential order of the file. Also callednon-clustering index._ Index-sequential file: ordered sequential file with a primaryindex.



55/60

http://

csetub

e.co

.nr/

GKMCET

Lecture Plan


55


Page: 2 of 2

B+ tree Index Files B tree Index Files

B+-tree indices are an alternative to indexed-sequential files._ Disadvantage of indexed-sequential files: performancedegrades as file grows, since many overflow blocks getcreated. Periodic reorganization of entire file is required._ Advantage of B+-tree index files: automatically reorganizesitself with small, local, changes, in the face of insertions anddeletions. Reorganization of entire file is not required to

maintain performance._ Disadvantage of B+-trees: extra insertion and deletionoverhead, space overhead._ Advantages of B+-trees outweigh disadvantages, and they areused extensively.

A B+-tree is a rooted tree satisfying the following properties:_ All paths from root to leaf are of the same length

_ Each node that is not a root or a leaf has between dn/ 2e and nchildren._ A leaf node has between d(n 1)/ 2e and n 1 values_ Special cases: if the root is not a leaf, it has at least 2 children.If the root is a leaf (that is, there are no other nodes in thetree), it can have between 0 and (n 1) values.

B+-Tree Node Structure_ Typical nodeP1 K1 P2 . . . Pn1 Kn1 Pn

Ki are the search-key values Pi are pointers to children (for non-leaf nodes) or pointers torecords or buckets of records (for leaf nodes)._ The search-keys in a node are orderedK1 < K2 < K3 < ... < Kn 1

Example of a B+



56/60

http://

csetub

e.co

.nr/

GKMCET

Lecture Plan


56

-treePerryridgeMianus RedwoodBrighton Downtown Mianus Perryridge Redwood Round Hill

B+-tree for account file (n = 3)

Static Hashing Dynamic Hashing

Static Hashing_ A bucket is a unit of storage containing one or more records (abucket is typically a disk block). In a hash file organization we

obtain the bucket of a record directly from its search-key valueusing a hash function.

_ Hash function h is a function from the set of all search-keyvalues K to the set of all bucket addresses B.

_ Hash function is used to locate records for access, insertion aswell as deletion.

_ Records with different search-key values may be mapped tothe same bucket; thus entire bucket has to be searchedsequentially to locate a record.

Dynamic Hashing_ Good for database that grows and shrinks in size_ Allows the hash function to be modified dynamically_ Extendable hashing one form of dynamic hashing Hash function generates values over a large range typically b-bit integers, with b = 32.

At any time use only a prefix of the hash function to indexinto a table of bucket addresses. Let the length of the prefix

be i bits, 0 _ i _ 32 Initially i = 0 Value of i grows and shrinks as the

141405 - database management systems

Documents