3 f6 8_databases

48
Databases Elena Punskaya, [email protected] 1

Upload: op205

Post on 01-Nov-2014

760 views

Category:

Technology


1 download

DESCRIPTION

 

TRANSCRIPT

Page 2: 3 f6 8_databases

© 2012 Elena PunskayaCambridge University Engineering Department

Big Data• Facebook stores over 100 petabytes of media (photos and

videos) uploaded by its 845 million users

• There are 762 billion objects stored in Amazon S3 that processes over 500,000 requests per second for these objects at peak times

2

aws.typepad.com AmazonBryce Durbin, Techcrunch

2

Page 3: 3 f6 8_databases

© 2012 Elena PunskayaCambridge University Engineering Department

Big Data• Storing large amounts of data requires managing complexity:

- mapping real world to data- providing concurrent access to creating, reading and changing of the data- providing distributed access and storage of the data

• Database Management Systems decouples business logic of applications working with data from the details of physical storage and transaction (operations on data) management

• Any non-trivial system needs to store its application data:- user/password- credit cards- product information- health records ...

• It is possible to store all data directly as files but a typicalfilesystem isn’t build for transaction management and high performance

3

Cambridge High Performance Computing Cluster Darwin

3

Page 4: 3 f6 8_databases

© 2012 Elena PunskayaCambridge University Engineering Department

Transaction Processing• Databases are a common component of many distributed systems.

They store records for a large number of distinct entities and will typically support a small set of operations to access and manipulate those entities. These operations can be assumed to be atomic i.e. they cannot be interrupted.

• External clients execute transactions which are sequences of operations applied to one or more database entities designed to achieve a single logical affect.

• The transaction manager ensures that transactions appear atomic to clients. Client receives an acknowledgement of every successful transaction.

4

Database Systems I 1

Transaction Processing

Databases are a common component of many distributed sys-tems. They store records for a large number of distinct entitiesand will typically support a small set of operations to access andmanipulate those entities. These operations can be assumed tobe atomic i.e. they cannot be interrupted.

External clients execute transactions which are sequences of op-erations applied to one or more database entities designed toachieve a single logical affect.

Client A

Client B

Transaction Manager

Recovery Log

Database

TB

TA

Transactions Atomic Operations

The transaction manager ensures that transactions appear atomicto clients. Client receives an acknowledgement of every successfultransaction.

4

Page 5: 3 f6 8_databases

© 2012 Elena PunskayaCambridge University Engineering Department

Example: Bank Transfer• Each account is represented by a different database object,

which guarantees that each operation is atomic

• A key issue is what happens if there is a failure part-way through the transaction?

5

class Account { // link to required account records DBaseAccessInfo dbinfo;public: // Constructor - open an account account(string account_name);// Atomic operations void debit(float amount); void credit(float amount); float read_balance();};

// A typical transaction would bevoid transfer(account& A, account& B, float amount) { float balance = A.read_balance(); if (balance >= amount) { A.debit(amount); B.credit(amount); }}

5

Page 6: 3 f6 8_databases

© 2012 Elena PunskayaCambridge University Engineering Department

System Crash• What happens if the system crashes in the middle of

transaction?

• Account A will have had its money debited, but it will never appear in account B! – invalid state

• The transaction manager (or any transaction processing system) must have a means of recovering from errors, and always leaving the system in a valid state

• Need to ensure that Credit/Debit is ATOMIC, i.e. can only be preformed as a WHOLE not in parts

6

// a typical transaction void transfer(account& A, account& B, float amount) { float balance = A.read_balance(); if(balance >= amount) { A.debit(amount); <-----------------------------CRASH!

6

Page 7: 3 f6 8_databases

© 2012 Elena PunskayaCambridge University Engineering Department

ACID• A transaction my fail in many different ways (e.g. two clients

try to access the same entity at the same time, temporary network failure, software fault, disk crash, etc). The transaction processor tries to ensure that transactions have the following properties

• Atomicity- Either all or none of the transaction’s operations are performed

• Consistency- Transactions transform the system from one consistent state to another

• Isolation- An incomplete transaction cannot reveal its result to other transactions before it is complete

• Durability- Once the transaction is committed, the system must guarantee that the results of its operations will persist, even if there are subsequent system failures

7

7

Page 8: 3 f6 8_databases

© 2012 Elena PunskayaCambridge University Engineering Department

Recovery• In order to maintain the ACID properties, a transaction

processor must be able to recover from errors by restoring the system to a consistent state.

• To achieve this, transactions are modelled on the following state machine

8

Database Systems I 5

Recovery

In order to maintain the ACID properties, a transaction processormust be able to recover from errors by restoring the system to aconsistent state.

To achieve this, transactions are modelled on the following statemachine:

Example: the transfer transaction

void transfer(account& A, account& B, float amount)

{

try {

int id = BeginTransaction(); // Record transaction start

float balance = A.read_balance();

if (balance >= amount) {

A.debit(amount);

B.credit(amount);

}

Commit(id); // success so commit

}

catch (...){

Abort(id); // failure so undo

}

}

←Transactionprocessor mightinvalidate thistransaction (seelast slide)

// A typical transaction with commitvoid transfer(account& A, account& B, float amount) { try { // Record transaction start int id = BeginTransaction(); float balance = A.read_balance(); if (balance >= amount) { A.debit(amount); B.credit(amount); } // success, so commit (finish) Commit(id); } catch(..) { // transaction failed, recover/revert Abort(id); }}

8

Page 9: 3 f6 8_databases

© 2012 Elena PunskayaCambridge University Engineering Department

Concurrency• In practice, a database transaction processor will be receiving

a stream of transaction requests, and will need to execute transactions in parallel in order to provide acceptable response times.

• When two transactions reference the same account, uncontrolled interleaving of operations can produce an incorrect result. There are three classes of concurrency problem

• In this case, transaction 1 reads an updated account value, but transaction 2 aborts undoing the effect of the update. Transaction 1 is then left holding an incorrect account value

9

Database Systems I 9

Concurrency

In practice, a database transaction processor will be receiving astream of transaction requests, and will need to execute transac-tions in parallel in order to provide acceptable response times.

When two transactions reference the same account, uncontrolledinterleaving of operations can produce an incorrect result. Thereare three classes of concurrency problem:

• The uncommitted dependency problem

Time Transaction 1 Transaction 2t1 – A.write()t2 A.read() –t3 – abort()

In this case, transaction 1 reads an updated account value, buttransaction 2 aborts undoing the effect of the update. Transac-tion 1 is then left holding an incorrect account value.

Note: A.read() indicates any operation which reads a value from ac-

count A but does not change it (eg A.read_balance() ), A.write()

indicates any operation which changes account A (eg A.credit() or

A.debit()) .

9

Page 10: 3 f6 8_databases

© 2012 Elena PunskayaCambridge University Engineering Department

Concurrency

• The change made to account A at t3 by transaction 1 is lost because it is overwritten at time t4 by transaction 2

• Transaction 2 updates account A after transaction 1 has read its value

• Hence, transaction 1 is left holding an incorrect value for account A

10

10 Engineering Part IIA: 3F6 - Software Engineering and Design

• The lost update problem

Time Transaction 1 Transaction 2

t1 A.read() –

t2 – A.read()

t3 A.write() –

t4 – A.write()

In this case, the change made to account A at t3 by transac-

tion 1 is lost because it is overwritten at time t4 by transac-

tion 2.

• The inconsistent analysis problem

Time Transaction 1 Transaction 2

t1 A.read() –

t2 – A.read()

t3 – A.write()

t4 – commit()

In this case, transaction 2 updates account A after transac-

tion 1 has read its value. Hence, transaction 1 is left holding

an incorrect value for account A.

10 Engineering Part IIA: 3F6 - Software Engineering and Design

• The lost update problem

Time Transaction 1 Transaction 2

t1 A.read() –

t2 – A.read()

t3 A.write() –

t4 – A.write()

In this case, the change made to account A at t3 by transac-

tion 1 is lost because it is overwritten at time t4 by transac-

tion 2.

• The inconsistent analysis problem

Time Transaction 1 Transaction 2

t1 A.read() –

t2 – A.read()

t3 – A.write()

t4 – commit()

In this case, transaction 2 updates account A after transac-

tion 1 has read its value. Hence, transaction 1 is left holding

an incorrect value for account A. 10

Page 11: 3 f6 8_databases

© 2012 Elena PunskayaCambridge University Engineering Department

Managing Concurrency• The problems discussed can be managed by applying a

Pessimistic or Optimistic concurrency control

• Pessimistic- When a transaction wishes to access an account it first secures a lock on that account, when it has finished it releases the lock. If a lock is already taken, the transaction must wait until it is released.

- Locking could be on the whole table or a single row and could be declared at different levels of exclusivity (e.g. no one else can access data or some access is allowed)

- Could cause deadlocks, e.g. Tx1 and Tx2 require two resources R1 and R2 to proceed:‣ T1 holds R1 and is waiting for R2 ‣ T2 holds R2 and is waiting for R1

- Useful when there is a lot of data that is often updated by many users

• Optimistic- Allows uncontrolled access to accounts, and then simply abort any transactions which might have suffered a conflict

- Implemented by creating a new copy of the data that maybe be updated and when the update is completed checks if the master copy hasn’t changed in meantime‣ if changed – aborted‣ if not – complete

- Useful when most operations are reading data and changes occur rarely

11

11

Page 12: 3 f6 8_databases

© 2012 Elena PunskayaCambridge University Engineering Department

Relational Databases• By late 1960s, the “Software Crisis” was

already declared and data storage wasn’tdoing much better

• In 1970, Edgar Codd, an English mathematician working for IBM, published a paper “"A Relational Model of Data for Large Shared Data Banks", it started with the following words:“Future users of large data banks must be protected from having to know how the data is organised in the machine...”

12

Edgar 'Ted' Codd, 1923-2003image © IBM

Computer calculations cost hundreds of dollars a minute, so great human effort was spent to make programs as efficient as possible before they were run. Early databases used either a rigid hierarchical structure or a complex navigational plan of pointers to the physical locations of the data on magnetic tapes. Teams of programmers were needed to express queries to extract meaningful information. While such databases could be efficient in handling the specific data and queries they were designed for, they were absolutely inflexible. New types of queries required complex reprogramming, and adding new types of data forced a total redesign of the database itself.

IBM Research News, www.research.ibm.com/resources/news/20030423_edgarpassaway.shtml

12

Page 13: 3 f6 8_databases

© 2012 Elena PunskayaCambridge University Engineering Department

Relational Databases• Codd suggested to move away from hierarchical or

navigational structure of early databases to simple tables with rows and columns

• Based on Relational Algebra, this approach allowed to greatly simplify database queries (ability to access and analyse data)

• Many relational database management systems (RDBMS) are accessed using SQL (Structured Query Language)- SQL is defined by industry standards and has been developed over many revisions from SQL-87 to SQL 2008

• There are many free and commercial databases available:- Free: PostgreSQL, MySQL, SQLite...- Commercial: Oracle, DB2, SQL Server...

• SQLite is the easiest database to start using as it requires no setup, and is available on the teaching system. - Type: sqlite3 <db-name>- Then enter SQL commands followed by a ‘;’. The database will be stored in a file called <db-name> which will be created if it does not already exist.

13

13

Page 14: 3 f6 8_databases

© 2012 Elena PunskayaCambridge University Engineering Department

The Relational Model• The relational model is related to set theory. A relation is a

table. A relation contains a set of tuples (rows).

- The meaning of the data is described by the scheme, which is a set of column names. Column names are known as attributes.

- There is no ordering or grouping of attributes. The table is a relation over this scheme. A relation r over a scheme R is written as r(R). Each column has a domain, D.

14

2 Engineering Part IIA: 3F6 - Software Engineering and Design

The relational modelThe relational model is related to set theory. A relation is a ta-ble. A relation contains a set of tuples (rows).relation coursescheme Title Leader Lecturest1 RISC Processors Sanchez 8t2 QAM for modems Sanchez 34t3 Introduction to Mainframes Belford 20t4 Fast refresh LCDs Richard 1t5 t5[Title] t5[Leader] t5[Lectures]t6 t6[Title, Leader]

The meaning of the data is described by the scheme, which is aset of column names. Column names are known as attributes.

course scheme = (Title, Leader, Lectures)

There is no ordering or grouping of attributes. The table is arelation over this scheme. A relation r over a scheme R is writtenas r(R). Each column has a domain, D. So:

DTitle = strings, DLectures = Z+

So each element ti[j] ∈ Dj and ti ∈ D1 ×D2 × · · ·×Dn

For example, the scheme (x, y) with Dx = Dy = R, the domainof the tuples is the domain of all two dimensional vectors.

SQL:CREATE TABLE course (Title text, Leader text, Lectures int, CHECK(Lectures > 0))INSERT INTO course VALUES ("RISC Processors", "Sanchez", 10)UPDATE course SET Lectures=8 WHERE Leader="Sanchez"DELETE FROM course WHERE Lectures=8 AND Leader="Sanchez"DROP TABLE course

domain↓ Constraint↓

SQL allows domain of tuples: Dt ⊆ D1 ×D2 × · · ·×Dn.

2 Engineering Part IIA: 3F6 - Software Engineering and Design

The relational modelThe relational model is related to set theory. A relation is a ta-ble. A relation contains a set of tuples (rows).relation coursescheme Title Leader Lecturest1 RISC Processors Sanchez 8t2 QAM for modems Sanchez 34t3 Introduction to Mainframes Belford 20t4 Fast refresh LCDs Richard 1t5 t5[Title] t5[Leader] t5[Lectures]t6 t6[Title, Leader]

The meaning of the data is described by the scheme, which is aset of column names. Column names are known as attributes.

course scheme = (Title, Leader, Lectures)

There is no ordering or grouping of attributes. The table is arelation over this scheme. A relation r over a scheme R is writtenas r(R). Each column has a domain, D. So:

DTitle = strings, DLectures = Z+

So each element ti[j] ∈ Dj and ti ∈ D1 ×D2 × · · ·×Dn

For example, the scheme (x, y) with Dx = Dy = R, the domainof the tuples is the domain of all two dimensional vectors.

SQL:CREATE TABLE course (Title text, Leader text, Lectures int, CHECK(Lectures > 0))INSERT INTO course VALUES ("RISC Processors", "Sanchez", 10)UPDATE course SET Lectures=8 WHERE Leader="Sanchez"DELETE FROM course WHERE Lectures=8 AND Leader="Sanchez"DROP TABLE course

domain↓ Constraint↓

SQL allows domain of tuples: Dt ⊆ D1 ×D2 × · · ·×Dn.

2 Engineering Part IIA: 3F6 - Software Engineering and Design

The relational modelThe relational model is related to set theory. A relation is a ta-ble. A relation contains a set of tuples (rows).relation coursescheme Title Leader Lecturest1 RISC Processors Sanchez 8t2 QAM for modems Sanchez 34t3 Introduction to Mainframes Belford 20t4 Fast refresh LCDs Richard 1t5 t5[Title] t5[Leader] t5[Lectures]t6 t6[Title, Leader]

The meaning of the data is described by the scheme, which is aset of column names. Column names are known as attributes.

course scheme = (Title, Leader, Lectures)

There is no ordering or grouping of attributes. The table is arelation over this scheme. A relation r over a scheme R is writtenas r(R). Each column has a domain, D. So:

DTitle = strings, DLectures = Z+

So each element ti[j] ∈ Dj and ti ∈ D1 ×D2 × · · ·×Dn

For example, the scheme (x, y) with Dx = Dy = R, the domainof the tuples is the domain of all two dimensional vectors.

SQL:CREATE TABLE course (Title text, Leader text, Lectures int, CHECK(Lectures > 0))INSERT INTO course VALUES ("RISC Processors", "Sanchez", 10)UPDATE course SET Lectures=8 WHERE Leader="Sanchez"DELETE FROM course WHERE Lectures=8 AND Leader="Sanchez"DROP TABLE course

domain↓ Constraint↓

SQL allows domain of tuples: Dt ⊆ D1 ×D2 × · · ·×Dn.

14

Page 15: 3 f6 8_databases

© 2012 Elena PunskayaCambridge University Engineering Department

The Relational Model

• Built on principles of Relational Algebra- Projection, Selection, Union, Intersection, Subtraction, Join

15

2 Engineering Part IIA: 3F6 - Software Engineering and Design

The relational modelThe relational model is related to set theory. A relation is a ta-ble. A relation contains a set of tuples (rows).relation coursescheme Title Leader Lecturest1 RISC Processors Sanchez 8t2 QAM for modems Sanchez 34t3 Introduction to Mainframes Belford 20t4 Fast refresh LCDs Richard 1t5 t5[Title] t5[Leader] t5[Lectures]t6 t6[Title, Leader]

The meaning of the data is described by the scheme, which is aset of column names. Column names are known as attributes.

course scheme = (Title, Leader, Lectures)

There is no ordering or grouping of attributes. The table is arelation over this scheme. A relation r over a scheme R is writtenas r(R). Each column has a domain, D. So:

DTitle = strings, DLectures = Z+

So each element ti[j] ∈ Dj and ti ∈ D1 ×D2 × · · ·×Dn

For example, the scheme (x, y) with Dx = Dy = R, the domainof the tuples is the domain of all two dimensional vectors.

SQL:CREATE TABLE course (Title text, Leader text, Lectures int, CHECK(Lectures > 0))INSERT INTO course VALUES ("RISC Processors", "Sanchez", 10)UPDATE course SET Lectures=8 WHERE Leader="Sanchez"DELETE FROM course WHERE Lectures=8 AND Leader="Sanchez"DROP TABLE course

domain↓ Constraint↓

SQL allows domain of tuples: Dt ⊆ D1 ×D2 × · · ·×Dn.

15

Page 16: 3 f6 8_databases

© 2012 Elena PunskayaCambridge University Engineering Department

Relational algebra: Projection Π

16

Database Systems II 3

Relational algebra: Projection Π

The projection operator, Π, removes columns by listing the ones

to be retained. The operator is written as:

Πcolumn1,column2,. . . (relation).

An example of applying projection is:

ΠLeader,Lectures (course) =

Leader Lectures

Sanchez 8

Sanchez 34

Belford 20

Richard 1

Consider a relation, r(R) where R=(x,y,z) and x, y, z ∈ R. Each

row represents a 3D vector. The relation Πx,y(r) contains the

projection of the vectors onto the x, y plane.

In SQL the SELECT statement performs all of the primitive

relational algebra funcionality. The selection above is rendered

as:

SELECT Leader,Lectures FROM course

The general form being:

SELECT Col1[, Col2, [· · · ]] FROM table

Note that SQL is not entirely relational and the expression:

SELECT Leader FROM course

has duplicate rows. To remove duplicates, use:

SELECT DISTINCT Leader FROM course

The there is a shorthand for the identity projection:

SELECT * FROM table

16

Page 17: 3 f6 8_databases

© 2012 Elena PunskayaCambridge University Engineering Department

Relational algebra: Projection Π

17

Database Systems II 3

Relational algebra: Projection Π

The projection operator, Π, removes columns by listing the ones

to be retained. The operator is written as:

Πcolumn1,column2,. . . (relation).

An example of applying projection is:

ΠLeader,Lectures (course) =

Leader Lectures

Sanchez 8

Sanchez 34

Belford 20

Richard 1

Consider a relation, r(R) where R=(x,y,z) and x, y, z ∈ R. Each

row represents a 3D vector. The relation Πx,y(r) contains the

projection of the vectors onto the x, y plane.

In SQL the SELECT statement performs all of the primitive

relational algebra funcionality. The selection above is rendered

as:

SELECT Leader,Lectures FROM course

The general form being:

SELECT Col1[, Col2, [· · · ]] FROM table

Note that SQL is not entirely relational and the expression:

SELECT Leader FROM course

has duplicate rows. To remove duplicates, use:

SELECT DISTINCT Leader FROM course

The there is a shorthand for the identity projection:

SELECT * FROM table

17

Page 18: 3 f6 8_databases

© 2012 Elena PunskayaCambridge University Engineering Department

Relational algebra: Selection σ

18

4 Engineering Part IIA: 3F6 - Software Engineering and Design

Relational algebra: Selection σ

The selection operator accepts a predicate, Θ and a relation.Rows matching the predicate are retained:

σLeader=”Sanchez”(course) =Title Leader LecturesRISC Processors Sanchez 8QAM for modems Sanchez 34

The general form of the resulting relation can be written in setbuilder notation

σΘ(r) = {t|t ∈ r, Θ(t)}

That is, the result consists of all tuples t such that each tuple isboth in the relation r and for which the predicate applied to thetuple, i.e. Θ(t), is true.

In SQL, selection is also performed with the select statement withthe predicate being specified by the WHERE clause:

SELECT * FROM course WHERE Leader=”Sanchez”

Predicates can contain expressions involving any or all of therows. SQL has more or less the same set of numeric operators asC and also AND, OR, NOT, BETWEEN:SELECT * FROM course WHERE Lectures BETWEEN 2 AND 10

and IN: WHERE Leader IN ("Belford", "Richard")

Projection and selection can be readily composed, so in general:

ΠS(σΘ(r)) translates to SELECT S FROM r WHERE Θ

18

Page 19: 3 f6 8_databases

© 2012 Elena PunskayaCambridge University Engineering Department

Relational algebra: Selection σ

19

4 Engineering Part IIA: 3F6 - Software Engineering and Design

Relational algebra: Selection σ

The selection operator accepts a predicate, Θ and a relation.Rows matching the predicate are retained:

σLeader=”Sanchez”(course) =Title Leader LecturesRISC Processors Sanchez 8QAM for modems Sanchez 34

The general form of the resulting relation can be written in setbuilder notation

σΘ(r) = {t|t ∈ r, Θ(t)}

That is, the result consists of all tuples t such that each tuple isboth in the relation r and for which the predicate applied to thetuple, i.e. Θ(t), is true.

In SQL, selection is also performed with the select statement withthe predicate being specified by the WHERE clause:

SELECT * FROM course WHERE Leader=”Sanchez”

Predicates can contain expressions involving any or all of therows. SQL has more or less the same set of numeric operators asC and also AND, OR, NOT, BETWEEN:SELECT * FROM course WHERE Lectures BETWEEN 2 AND 10

and IN: WHERE Leader IN ("Belford", "Richard")

Projection and selection can be readily composed, so in general:

ΠS(σΘ(r)) translates to SELECT S FROM r WHERE Θ

19

Page 20: 3 f6 8_databases

© 2012 Elena PunskayaCambridge University Engineering Department

Union, intersection, subtraction• In SQL, union intersection and subtraction behave much more

like set theory than relational algebra. For these operations it is the order of the attributes not the names of the attributes which have significance.

• Set union, ∪ aggregates the rows of two sets together. If there are two relations, r(R) and s(R), then the union, r ∪ s can be computed:

20

Database Systems II 5

Union, intersection, subtraction

In SQL, union intersection and subtraction behave much morelike set theory than relational algebra. For these operations it isthe order of the attributes not the names of the attributes whichhave significance.

Set union, ∪ aggregates the rows of two sets together. If thereare two relations, r(R) and s(R), then the union, r ∪ s can becomputed:

SELECT * FROM r UNION SELECT * FROM s

Likewise, intersection can be computed using:

SELECT * FROM r INTERSECT SELECT * FROM s

Set differencing is either MINUS or EXCEPT depending on thedatabase.

s

r!s

r

SELECT * FROM r EXCEPT SELECT * FROM s

Since ordering, not naming matters, with the schema R=(a,b),S=(b,a) and the tables r(R), s(S):

r sa b b a1 2 3 53 4 1 2

r - s =a b3 4

20

Page 21: 3 f6 8_databases

© 2012 Elena PunskayaCambridge University Engineering Department

Join / cartesian product ו The cartesian product is the only primitive operator which

combines two tables with different schemes. Joining two relations, a × b generates a new relation with every row in in a paired with every row in b. Joining is very useful for extracting related information.

• Joining students and labs:

21

6 Engineering Part IIA: 3F6 - Software Engineering and Design

Join / cartesian product ×

The cartesian product is the only primitive operator which com-bines two tables with different schemes. Joining two relations,a × b generates a new relation with every row in in a pairedwith every row in b. Joining is very useful for extracting relatedinformation.

studentsStudent SupervisorGibson SanchezMurphy BelfordLibby GoldsteinCook Sanchez

labsLab Demonstrator3F27 Cook3F89 Libby4F185 Margo3F34 Ray

The table students× labs is on the next page. Note that theattributes get augmented with the table name to avoid ambiguity.The table name may be omitted if it is not ambiguous. SQL:

SELECT * FROM students, labs

Find all students of “Sanchez” who are demonstrating:

ΠStudent(σStudent=Demonstrator∧Supervisor=“Sanchez”(students× labs))

SELECT Student FROM students, labs

WHERE Student=Demonstrator AND

Supervisor="Sanchez"

The result is Cook . Selection is often composed with joining,

so it is given the non primitive operator, the theta join:

a ��Θ b ≡ σΘ(a× b).

Database Systems II 7

students × labsstudents.Student students.Supervisor labs.Lab labs.Demonstrator

Gibson Sanchez 3F27 CookGibson Sanchez 3F89 LibbyGibson Sanchez 4F185 MargoGibson Sanchez 3F34 RayMurphy Belford 3F27 CookMurphy Belford 3F89 LibbyMurphy Belford 4F185 MargoMurphy Belford 3F34 RayLibby Goldstein 3F27 CookLibby Goldstein 3F89 LibbyLibby Goldstein 4F185 MargoLibby Goldstein 3F34 RayCook Sanchez 3F27 CookCook Sanchez 3F89 LibbyCook Sanchez 4F185 MargoCook Sanchez 3F34 Ray

6 Engineering Part IIA: 3F6 - Software Engineering and Design

Join / cartesian product ×

The cartesian product is the only primitive operator which com-bines two tables with different schemes. Joining two relations,a × b generates a new relation with every row in in a pairedwith every row in b. Joining is very useful for extracting relatedinformation.

studentsStudent SupervisorGibson SanchezMurphy BelfordLibby GoldsteinCook Sanchez

labsLab Demonstrator3F27 Cook3F89 Libby4F185 Margo3F34 Ray

The table students× labs is on the next page. Note that theattributes get augmented with the table name to avoid ambiguity.The table name may be omitted if it is not ambiguous. SQL:

SELECT * FROM students, labs

Find all students of “Sanchez” who are demonstrating:

ΠStudent(σStudent=Demonstrator∧Supervisor=“Sanchez”(students× labs))

SELECT Student FROM students, labs

WHERE Student=Demonstrator AND

Supervisor="Sanchez"

The result is Cook . Selection is often composed with joining,

so it is given the non primitive operator, the theta join:

a ��Θ b ≡ σΘ(a× b).21

Page 22: 3 f6 8_databases

© 2012 Elena PunskayaCambridge University Engineering Department

Join / cartesian product ×

22

6 Engineering Part IIA: 3F6 - Software Engineering and Design

Join / cartesian product ×

The cartesian product is the only primitive operator which com-bines two tables with different schemes. Joining two relations,a × b generates a new relation with every row in in a pairedwith every row in b. Joining is very useful for extracting relatedinformation.

studentsStudent SupervisorGibson SanchezMurphy BelfordLibby GoldsteinCook Sanchez

labsLab Demonstrator3F27 Cook3F89 Libby4F185 Margo3F34 Ray

The table students× labs is on the next page. Note that theattributes get augmented with the table name to avoid ambiguity.The table name may be omitted if it is not ambiguous. SQL:

SELECT * FROM students, labs

Find all students of “Sanchez” who are demonstrating:

ΠStudent(σStudent=Demonstrator∧Supervisor=“Sanchez”(students× labs))

SELECT Student FROM students, labs

WHERE Student=Demonstrator AND

Supervisor="Sanchez"

The result is Cook . Selection is often composed with joining,

so it is given the non primitive operator, the theta join:

a ��Θ b ≡ σΘ(a× b).

Database Systems II 7

students × labsstudents.Student students.Supervisor labs.Lab labs.Demonstrator

Gibson Sanchez 3F27 CookGibson Sanchez 3F89 LibbyGibson Sanchez 4F185 MargoGibson Sanchez 3F34 RayMurphy Belford 3F27 CookMurphy Belford 3F89 LibbyMurphy Belford 4F185 MargoMurphy Belford 3F34 RayLibby Goldstein 3F27 CookLibby Goldstein 3F89 LibbyLibby Goldstein 4F185 MargoLibby Goldstein 3F34 RayCook Sanchez 3F27 CookCook Sanchez 3F89 LibbyCook Sanchez 4F185 MargoCook Sanchez 3F34 Ray

22

Page 23: 3 f6 8_databases

© 2012 Elena PunskayaCambridge University Engineering Department

Natural Join • A ‘natural join’ is a join followed by some selection and

projection:- Perform a join- Perform selection so that attributes with the same name must be equal- Perform projection to remove duplicated attributes

• If attributes with the same name are semantically the same, then the natural join is usually the correct kind of join to use. In addition to the ‘labs’ table, we also have a table listing lab sessions:

23

Database Systems II 9

Natural Join ��

A ‘natural join’ is a join followed by some selection and projec-tion:

• Perform a join.

• Perform selection so that attributes with the same name mustbe equal.

• Perform projection to remove duplicated attributes.

Note that there are no attribute ambiguities.

If attributes with the same name are semantically the same, thenthe natural join is usually the correct kind of join to use. In ad-dition to the ‘labs’ table, we also have a table listing lab sessions:

sessionsLab Title3F27 Mainframe filesystems3F27 Filesystem security3F89 Large vehicle control4F185 Networks for finance systems3F34 Magnetic storage forensics

The natural join matches up the shared attributes

sessions �� labs =

Demonstrator Lab TitleCook 3F27 Filesystem securityCook 3F27 Mainframe filesystemsLibby 3F89 Large vehicle controlMargo 4F185 Networks for finance systemsRay 3F34 Magnetic storage forensics

Database Systems II 9

Natural Join ��

A ‘natural join’ is a join followed by some selection and projec-tion:

• Perform a join.

• Perform selection so that attributes with the same name mustbe equal.

• Perform projection to remove duplicated attributes.

Note that there are no attribute ambiguities.

If attributes with the same name are semantically the same, thenthe natural join is usually the correct kind of join to use. In ad-dition to the ‘labs’ table, we also have a table listing lab sessions:

sessionsLab Title3F27 Mainframe filesystems3F27 Filesystem security3F89 Large vehicle control4F185 Networks for finance systems3F34 Magnetic storage forensics

The natural join matches up the shared attributes

sessions �� labs =

Demonstrator Lab TitleCook 3F27 Filesystem securityCook 3F27 Mainframe filesystemsLibby 3F89 Large vehicle controlMargo 4F185 Networks for finance systemsRay 3F34 Magnetic storage forensics

23

Page 24: 3 f6 8_databases

© 2012 Elena PunskayaCambridge University Engineering Department

Natural Join

24

10 Engineering Part IIA: 3F6 - Software Engineering and Design

More formally:

There are two relations r(R) and s(S).

The set of shared attributes is A:

A = {A1, · · · , An} = R ∩ S

where n = |A|. The set of all attributes with no duplicates is:

R ∪ S.

The natural join is therefore:

r �� s ≡ ΠR ∪ Sσr.A1=s.A1∧···∧r.An=s.An(r× s)

In SQL, natural joins are performed with NATURAL JOIN:

SELECT * FROM sessions NATURAL JOIN labs

In practice, you will usually design databases by considering the

type of data, how it is stored in tables and how to extract the

relevant information. Relation algebra will not crop up much in

day-to-day design, but it is essential for understanding how the

various operations in a relational database work.

24

Page 25: 3 f6 8_databases

© 2012 Elena PunskayaCambridge University Engineering Department

Example• Let’s consider an example of movies database for LOVEFiLM.com

• It is likely to have

• SQL:CREATE TABLE movie (Title text, Year int, Actor text)

INSERT INTO movie VALUES ("Pulp Fiction", 1994, “John Travolta”)

INSERT INTO movie VALUES ("Hackers", 1995, “Angelina Jolie”)

etc.25

moviemoviemovie

Title Year Actor

Pulp Fiction 1994 John TravoltaHackers 1995 Angelina Jolie

The Matrix 1999 Keanu Reeves

The Devil’s Advocate 1997 Keanu Reeves

SQL:domain↓! Constraint↓TABLE course (Title text, Leader text, Lectures int, CHECK(Lectures > 0)) INTO course VALUES ("RISC Processors", "Sanchez", 10) course SET Lectures=8 WHERE Leader="Sanchez" FROM courCREATE INSERT UPDATE DELETE DROP TABLE course

25

Page 26: 3 f6 8_databases

© 2012 Elena PunskayaCambridge University Engineering Department

Example

• ProjectionSELECT Actor FROM movie

• SelectionSELECT * FROM movie WHERE Actor=”Keanu Reeves”

• Projection and Selection composedSELECT Title FROM movie WHERE Actor=”Keanu Reeves”

26

Actor

John TravoltaAngelina Jolie

Keanu Reeves

Keanu Reeves

• DistinctSELECT DISTINCT Actor FROM movie

ActorJohn TravoltaAngelina JolieKeanu Reeves

moviemoviemovieTitle Year Actor

The Matrix 1999 Keanu ReevesThe Devil’s Advocate 1997 Keanu Reeves

TitleThe Matrix

The Devil’s Advocate

26

Page 27: 3 f6 8_databases

© 2012 Elena PunskayaCambridge University Engineering Department

Example

• Selection may use AND, OR, NOT, BETWEEN, IN and etc.SELECT * FROM movie WHERE Year BETWEEN 1995 AND 1997

(BETWEEN 1995 AND 1997 Inclusive)

27

moviemoviemovieTitle Year Actor

Hackers 1995 Angelina Jolie

The Devil’s Advocate 1997 Keanu Reeves

27

Page 28: 3 f6 8_databases

© 2012 Elena PunskayaCambridge University Engineering Department

Example• Let us take now a simplified table

• Imagine we also have some info regarding the number of won Oscars

28

peoplepeople

Actor OscarsJohn Travolta 0

Angelina Jolie 1

moviemovieTitle Actor

Pulp Fiction John Travolta

Hackers Angelina Jolie

28

Page 29: 3 f6 8_databases

© 2012 Elena PunskayaCambridge University Engineering Department

Example• Cartesian product SELECT Title, Actor, Actor, Oscars FROM movie, people

- the only one that can create new record (if one doesn’t count renaming)- BUT it creates too many records!

• Natural join would give information on whether there are Oscar winning actors in the movie SELECT * FROM movie, people WHERE movie.Actor = people.Actor or SELECT * FROM movie NATURAL JOIN people

29

moviemoviemovieTitle Actor Oscars

Pulp Fiction John Travolta

0

Hackers Angelina Jolie

1

movie x peoplemovie x peoplemovie x peoplemovie x peoplemovie.Title movie.Actor people.Actor people.Oscars

Pulp Fiction John Travolta John Travolta 0Pulp Fiction John Travolta Angelina Jolie 1Hackers Angelina Jolie John Travolta 0Hackers Angelina Jolie Angelina Jolie 1

29

Page 30: 3 f6 8_databases

© 2012 Elena PunskayaCambridge University Engineering Department

Example• Let us consider two tables with Oscar and BAFTA nominations

• Union(SELECT * FROM Oscar) UNION (SELECT * FROM BAFTA)

30

OscarOscar

John Travolta Pulp Fiction

Angelina Jolie Girl, Interrupted

Angelina Jolie Changeling

BAFTABAFTA

John Travolta Pulp Fiction

Angelina Jolie Changeling

Jesse Eisenberg The Social Network

Oscar ∪ BAFTAOscar ∪ BAFTA

John Travolta Pulp Fiction

Angelina Jolie Girl, Interrupted

Angelina Jolie Changeling

30

Page 31: 3 f6 8_databases

© 2012 Elena PunskayaCambridge University Engineering Department

Example• Intersection

(SELECT * FROM Oscar) INTERSECT (SELECT * FROM BAFTA)

• Difference(SELECT * FROM Oscar) EXCEPT (SELECT * FROM BAFTA)

NOTE: some operators are treated differently in different databases, some may not be present

31

Oscar ∩ BAFTAOscar ∩ BAFTAJohn Travolta Pulp Fiction

Angelina Jolie Changeling

Oscar – BAFTAOscar – BAFTA

Angelina Jolie Girl, Interrupted

Jesse Eisenberg The Social Network

31

Page 32: 3 f6 8_databases

© 2012 Elena PunskayaCambridge University Engineering Department

Keys and Uniqueness• Rows in a relation can be uniquely identified by a key, which

can consist of one or more columns- A key must be able to uniquely identify all possible rows that relation could have in the domain of tuples, not just the rows that currently exist.

• Superkey- Any collection of columns which can uniquely identify a row. There may be more than one valid superkey.

• Candidate key- A minimal superkey, i.e. a superkey with the minimal number of columns. I.e. there is no subset of the columns in a candidate key which will also form a candidate key. There may be more than one candidate key.

• Primary key- A superkey or candidate key which has been selected to have a special status. A table can have at most one primary key. Should be small and constant.

• Foreign key-If two relations r and s share a key k, then r[k] is a foreign key if k is the primary key of s. Therefore, the foreign key k does not necessarily uniquely identify the rows of r

32

32

Page 33: 3 f6 8_databases

© 2012 Elena PunskayaCambridge University Engineering Department

Keys

33

Name Address DoB Gender Relationship

Email

John Smith 34 West rd, Cambridge

2 Jan 1981 Male Single [email protected]

Thomas Anderson

Flat 303, 11 March 1962

Male Single [email protected]

...

Mia Wallace

20 Sunset rd, Carlsbad

10 October 1994

Female Married [email protected]

33

Page 34: 3 f6 8_databases

© 2012 Elena PunskayaCambridge University Engineering Department

Keys

34

id Name Address DoB Gender Relationship

Email

1 John Smith 34 West rd, Cambridge

2 Jan 1981 Male Single [email protected]

2 Thomas Anderson

Flat 303, 101 Red st, Zion

11 March 1962

Male Single [email protected]

... ... ... ... ... ... ...

10001 Mia Wallace 20 Sunset rd, Carlsbad

10 October 1994

Female Married [email protected]

34

Page 35: 3 f6 8_databases

© 2012 Elena PunskayaCambridge University Engineering Department

Normalization• If a database has duplicated information then it is subject it

update anomalies, and the information can become inconsistent. Imagine adding contact details to the ‘course’ table to allow lecturers to be contacted easily:

• If the table is updated, for instance with the SQL command:

• Then the contact details will become incorrect. The process of normalizing a database involves splitting up large tables with only weakly related information into a number of smaller tables. Normalized data is then accessed by joining tables together and performing selections on the results.

35

12 Engineering Part IIA: 3F6 - Software Engineering and Design

NormalizationIf a database has duplicated information then it is subject it up-date anomalies, and the information can become inconsistent.Imagine adding contact details to the ‘course’ table to allow lec-turers to be contacted easily:

Title Leader Lectures TelephoneRISC Processors Sanchez 8 65960QAM for modems Sanchez 34 65960Introduction to Mainframes Belford 20 65536Low latency LCD screens Richard 1 32768

If the table is updated, for instance with the SQL command:

UPDATE course SET Leader="Libby" WHERE Title="RISC Processors"

Then the contact details will become incorrect. The process ofnormalizing a database involves splitting up large tables withonly weakly related information into a number of smaller tables.Normalized data is then accessed by joining tables together andperforming selections on the results.

The database above is not normalized because there is duplicateddata. More intuitively, the telephone number has merely beeninserted as a convenience and has nothing directly to do withcourses.

Much like type safety and object oriented design, database nor-malization allows databases to be designed such that certain er-rors (for instance data inconsistency) are impossible. Any errorwhich is reduced to an impossibility can never be a bug.

Normalization is the process of movind the database comply withnormal forms (1NF, 2NF, 3NF, BCNF, 4NF, 5NF and DKNF).

12 Engineering Part IIA: 3F6 - Software Engineering and Design

NormalizationIf a database has duplicated information then it is subject it up-date anomalies, and the information can become inconsistent.Imagine adding contact details to the ‘course’ table to allow lec-turers to be contacted easily:

Title Leader Lectures TelephoneRISC Processors Sanchez 8 65960QAM for modems Sanchez 34 65960Introduction to Mainframes Belford 20 65536Low latency LCD screens Richard 1 32768

If the table is updated, for instance with the SQL command:

UPDATE course SET Leader="Libby" WHERE Title="RISC Processors"

Then the contact details will become incorrect. The process ofnormalizing a database involves splitting up large tables withonly weakly related information into a number of smaller tables.Normalized data is then accessed by joining tables together andperforming selections on the results.

The database above is not normalized because there is duplicateddata. More intuitively, the telephone number has merely beeninserted as a convenience and has nothing directly to do withcourses.

Much like type safety and object oriented design, database nor-malization allows databases to be designed such that certain er-rors (for instance data inconsistency) are impossible. Any errorwhich is reduced to an impossibility can never be a bug.

Normalization is the process of movind the database comply withnormal forms (1NF, 2NF, 3NF, BCNF, 4NF, 5NF and DKNF).

35

Page 36: 3 f6 8_databases

© 2012 Elena PunskayaCambridge University Engineering Department

Normalization

• The database above is not normalised because there is duplicated data. More intuitively, the telephone number has merely been inserted as a convenience and has nothing directly to do with courses

• Much like type safety and object oriented design, database normalization allows databases to be designed such that certain errors (for instance data inconsistency) are less likely.

• Normalization is the process of designing the database comply with normal forms (1NF, 2NF, 3NF, BCNF, 4NF, 5NF and DKNF).

36

12 Engineering Part IIA: 3F6 - Software Engineering and Design

NormalizationIf a database has duplicated information then it is subject it up-date anomalies, and the information can become inconsistent.Imagine adding contact details to the ‘course’ table to allow lec-turers to be contacted easily:

Title Leader Lectures TelephoneRISC Processors Sanchez 8 65960QAM for modems Sanchez 34 65960Introduction to Mainframes Belford 20 65536Low latency LCD screens Richard 1 32768

If the table is updated, for instance with the SQL command:

UPDATE course SET Leader="Libby" WHERE Title="RISC Processors"

Then the contact details will become incorrect. The process ofnormalizing a database involves splitting up large tables withonly weakly related information into a number of smaller tables.Normalized data is then accessed by joining tables together andperforming selections on the results.

The database above is not normalized because there is duplicateddata. More intuitively, the telephone number has merely beeninserted as a convenience and has nothing directly to do withcourses.

Much like type safety and object oriented design, database nor-malization allows databases to be designed such that certain er-rors (for instance data inconsistency) are impossible. Any errorwhich is reduced to an impossibility can never be a bug.

Normalization is the process of movind the database comply withnormal forms (1NF, 2NF, 3NF, BCNF, 4NF, 5NF and DKNF).

36

Page 37: 3 f6 8_databases

© 2012 Elena PunskayaCambridge University Engineering Department

First Normal Form• Make sure that your database really obeys the relational

model:- No ordering over rows - No ordering over columns- No duplicates

• Each row/column intersection contains exactly one datum

• Consider storing multiple phone numbers for the Leader

37

Database Systems II 13

First Normal Form (1NF)

1. Make sure that your database really obeys the relationalmodel:

(a) No ordering over rows

(b) No ordering over columns

(c) No duplicates

2. Each row/column intersection contains exactly one datum.

Consider trying to extend the earlier design to allow for multiplephone numbers:

BAD

BAD

Title Lectures ID Numbers· · · 8 456 65950, 60294, 70231· · · 8 456 65950, 60294, 70231· · · 34 20 65536· · · 1 82 32768, 16384

Title Lectures ID Phone 1 Phone 2 Phone 3· · · 8 456 65960 60294 70231· · · 34 456 65960 60294 70231· · · 20 9 65536· · · 1 82 32768 16384

Note the use of IDs to avoid duplicates as names make bad keys:

employeesName ID PhoneSanchez 456 65960Belford 9 65536Richard 82 32768Sanchez 456 60294

The list of phone numbers for theleader of a particular course can nowbe extracted using relational algebra:ΠPhone(σTitle=“RISC Processors”(course �� employees))

37

Page 38: 3 f6 8_databases

© 2012 Elena PunskayaCambridge University Engineering Department

First Normal Form• Employees table is used for

details of course Leaders

• Adding a Phone number to storeemployee’s contact details

• To support multiple phone numbersneed to duplicate Name/ID data

• The list of phone numbers for the leader of a particular course can now be extracted using relational algebra:

or in SQL:

SELECT phone FROM course NATURAL JOIN employees WHERE Title=”RISC Processors”

38

Database Systems II 13

First Normal Form (1NF)

1. Make sure that your database really obeys the relationalmodel:

(a) No ordering over rows

(b) No ordering over columns

(c) No duplicates

2. Each row/column intersection contains exactly one datum.

Consider trying to extend the earlier design to allow for multiplephone numbers:

BAD

BAD

Title Lectures ID Numbers· · · 8 456 65950, 60294, 70231· · · 8 456 65950, 60294, 70231· · · 34 20 65536· · · 1 82 32768, 16384

Title Lectures ID Phone 1 Phone 2 Phone 3· · · 8 456 65960 60294 70231· · · 34 456 65960 60294 70231· · · 20 9 65536· · · 1 82 32768 16384

Note the use of IDs to avoid duplicates as names make bad keys:

employeesName ID PhoneSanchez 456 65960Belford 9 65536Richard 82 32768Sanchez 456 60294

The list of phone numbers for theleader of a particular course can nowbe extracted using relational algebra:ΠPhone(σTitle=“RISC Processors”(course �� employees))

Database Systems II 13

First Normal Form (1NF)

1. Make sure that your database really obeys the relationalmodel:

(a) No ordering over rows

(b) No ordering over columns

(c) No duplicates

2. Each row/column intersection contains exactly one datum.

Consider trying to extend the earlier design to allow for multiplephone numbers:

BAD

BAD

Title Lectures ID Numbers· · · 8 456 65950, 60294, 70231· · · 8 456 65950, 60294, 70231· · · 34 20 65536· · · 1 82 32768, 16384

Title Lectures ID Phone 1 Phone 2 Phone 3· · · 8 456 65960 60294 70231· · · 34 456 65960 60294 70231· · · 20 9 65536· · · 1 82 32768 16384

Note the use of IDs to avoid duplicates as names make bad keys:

employeesName ID PhoneSanchez 456 65960Belford 9 65536Richard 82 32768Sanchez 456 60294

The list of phone numbers for theleader of a particular course can nowbe extracted using relational algebra:ΠPhone(σTitle=“RISC Processors”(course �� employees))

38

Page 39: 3 f6 8_databases

© 2012 Elena PunskayaCambridge University Engineering Department

Second Normal Form (2NF)• A table is in second normal form if it satisfies:

- It is in first normal form (1NF).- All non-prime attributes depend on the whole candidate key.

• From the previous example, the complete relation, employees(E), is:

• The candidate key is C = (ID, Phone). The non prime attribute is therefore E − C =(Name). The employees’ names do not depend on the phone number, only the ID. Therefore the table is not in 2NF.

39

14 Engineering Part IIA: 3F6 - Software Engineering and Design

Second Normal Form (2NF)

A table is in second normal form if it satisfies:

1. It is in first normal form (1NF).

2. All non-prime attributes depend on the whole candidate key.

From the previous example, the complete relation, employees(E),is:

Lack of normalization allowsbuggy programs to create incon-sistencies:

Inserting the record (“Belford”,10, 131072) leads to a mismatchbetween the name and id.

An employee name change re-quires updates across multiplerows, which may be done incor-rectly. It also requires more lock-ing.

employeesName ID PhoneSanchez 456 65960Belford 9 65536Richard 82 32768Sanchez 456 60294Sanchez 456 70231Richard 82 16384

The candidate key is C = (ID, Phone). The non prime attributeis therefore E − C =(Name). The employees’ names do notdepend on the phone number, only the ID. Therefore the tableis not in 2NF. A 2NF design is:

employee namesName IDSanchez 456Belford 9Richard 82

contactsID Phone456 659609 6553682 32768456 60294456 7023182 16384

39

Page 40: 3 f6 8_databases

© 2012 Elena PunskayaCambridge University Engineering Department

Second Normal Form (2NF)• A 2NF design is:

• ID is the Primary Key in employee_names

• Phone is the Primary Key in contacts

• ID is a Foreign Key in contacts connecting employee names and their phone numbers

40

14 Engineering Part IIA: 3F6 - Software Engineering and Design

Second Normal Form (2NF)

A table is in second normal form if it satisfies:

1. It is in first normal form (1NF).

2. All non-prime attributes depend on the whole candidate key.

From the previous example, the complete relation, employees(E),is:

Lack of normalization allowsbuggy programs to create incon-sistencies:

Inserting the record (“Belford”,10, 131072) leads to a mismatchbetween the name and id.

An employee name change re-quires updates across multiplerows, which may be done incor-rectly. It also requires more lock-ing.

employeesName ID PhoneSanchez 456 65960Belford 9 65536Richard 82 32768Sanchez 456 60294Sanchez 456 70231Richard 82 16384

The candidate key is C = (ID, Phone). The non prime attributeis therefore E − C =(Name). The employees’ names do notdepend on the phone number, only the ID. Therefore the tableis not in 2NF. A 2NF design is:

employee namesName IDSanchez 456Belford 9Richard 82

contactsID Phone456 659609 6553682 32768456 60294456 7023182 16384

40

Page 41: 3 f6 8_databases

© 2012 Elena PunskayaCambridge University Engineering Department

Third Normal Form (3NF)• “I swear by Codd that each non-prime attribute shall depend

upon the key, the whole key and nothing but the key.”

• More formally a table over R is in 3NF if and only if:- It is in 2NF (and therefore 1NF)- Every non-prime attribute is directly dependent on every candidate key of R

• The candidate key is (Practical, Date), however, the table not fully normalized because there is repetition of data (the contact numbers and the pay rates). The table is not in 3NF because:- Pay rate depends on the key, but not the whole key. Specifically, it only depends on the date.

- Contact depends upon the whole key, but the dependence is transitive, not direct, that is: Contact → Demonstrator → (Practical, Date)

41

Database Systems II 15

Third Normal Form (3NF)

“I swear by Codd that each non-prime attribute shall dependupon the key, the whole key and nothing but the key.”

More formally a table over R is in 3NF iff:

1. It is in 2NF (and therefore 1NF)

2. Every non-prime attribute is directly dependent on everycandidate key of R.

Practical Date Demonstrator Contact Pay rateAcoustic coupling Mon 1 Feb Dade 45102 10Acoustic coupling Sat 7 Feb Dade 45102 15Self-propagating code Tue 2 Mar Joey 67822 10Self-propagating code Sun 9 Mar Kate 62341 15

The candidate key is:

(Practical,Date)

Table is not fully normalized because there is repetition of data(the contact numbers and the pay rates). The table is not in3NF because:

• Pay rate depends on the key, but not the whole key. Specif-ically, it only depends on the date.

• Contact depends upon the whole key, but the dependence istransitive, not direct, that is:

Contact → Demonstrator→ (Practical, Date)

Updating the date requires an update of the pay rate. Updating a demonstrator requires

an update of the contact number.

41

Page 42: 3 f6 8_databases

© 2012 Elena PunskayaCambridge University Engineering Department

SQL Constraints• In addition to normal forms, which can be represented in

relational algebra, SQL allows tables to be constructed with additional constraints which make the database more robust. Providing invalid data that break constraints causes transactions to abort, rather than make inconsistent data

• Types of constraints

• NOT NULL – ensures that the value of this column can not be omitted

• UNIQUE – ensures that the value of this column is unique

• PRIMARY/FOREIGN KEY – designates the column as a key

42

16 Engineering Part IIA: 3F6 - Software Engineering and Design

SQL Constraints

In addition to normal forms, which can be represented in rela-tional algebra, SQL allows tables to be constructed with addi-tional constraints which make the database more robust. Unlikenormalization, constraints do not make it impossible to constructerrors. However, constraints do make errors cause transactionsto abort, rather than make inconsistent data.

NOT NULL prevents missing attributes (helpful for 1NF)

CREATE TABLE course (Name string NOT NULL, ...)

A primary key can be specified. This will ensure that ID isunique, and therefore all rows are also unique.

CREATE TABLE people (Name string, ID int PRIMARY KEY)

Known candidate keys can be marked as unique:

CREATE TABLE r (a, b, c, d, UNIQUE(a, b),UNIQUE(a, c, d))

A particularly important constraint is FOREIGN KEY whichensures that an attribute is a primary key in another table:

CREATE TABLE course (Title string PRIMARY KEY, ID int,

Lectures int,

FOREIGN KEY (ID) REFERENCES employees)

The ID of the course leader is now constrained to be a valid em-ployee ID. The database will abort a transaction which attemptsto add an invalid ID, or change an ID to an invalid one. Addi-tionally the database will abort any transactions which invalidateexisting ID. For example, the database will not allow erasure ofemployees with courses still assigned.

16 Engineering Part IIA: 3F6 - Software Engineering and Design

SQL Constraints

In addition to normal forms, which can be represented in rela-tional algebra, SQL allows tables to be constructed with addi-tional constraints which make the database more robust. Unlikenormalization, constraints do not make it impossible to constructerrors. However, constraints do make errors cause transactionsto abort, rather than make inconsistent data.

NOT NULL prevents missing attributes (helpful for 1NF)

CREATE TABLE course (Name string NOT NULL, ...)

A primary key can be specified. This will ensure that ID isunique, and therefore all rows are also unique.

CREATE TABLE people (Name string, ID int PRIMARY KEY)

Known candidate keys can be marked as unique:

CREATE TABLE r (a, b, c, d, UNIQUE(a, b),UNIQUE(a, c, d))

A particularly important constraint is FOREIGN KEY whichensures that an attribute is a primary key in another table:

CREATE TABLE course (Title string PRIMARY KEY, ID int,

Lectures int,

FOREIGN KEY (ID) REFERENCES employees)

The ID of the course leader is now constrained to be a valid em-ployee ID. The database will abort a transaction which attemptsto add an invalid ID, or change an ID to an invalid one. Addi-tionally the database will abort any transactions which invalidateexisting ID. For example, the database will not allow erasure ofemployees with courses still assigned.

16 Engineering Part IIA: 3F6 - Software Engineering and Design

SQL Constraints

In addition to normal forms, which can be represented in rela-tional algebra, SQL allows tables to be constructed with addi-tional constraints which make the database more robust. Unlikenormalization, constraints do not make it impossible to constructerrors. However, constraints do make errors cause transactionsto abort, rather than make inconsistent data.

NOT NULL prevents missing attributes (helpful for 1NF)

CREATE TABLE course (Name string NOT NULL, ...)

A primary key can be specified. This will ensure that ID isunique, and therefore all rows are also unique.

CREATE TABLE people (Name string, ID int PRIMARY KEY)

Known candidate keys can be marked as unique:

CREATE TABLE r (a, b, c, d, UNIQUE(a, b),UNIQUE(a, c, d))

A particularly important constraint is FOREIGN KEY whichensures that an attribute is a primary key in another table:

CREATE TABLE course (Title string PRIMARY KEY, ID int,

Lectures int,

FOREIGN KEY (ID) REFERENCES employees)

The ID of the course leader is now constrained to be a valid em-ployee ID. The database will abort a transaction which attemptsto add an invalid ID, or change an ID to an invalid one. Addi-tionally the database will abort any transactions which invalidateexisting ID. For example, the database will not allow erasure ofemployees with courses still assigned.

42

Page 43: 3 f6 8_databases

© 2012 Elena PunskayaCambridge University Engineering Department

Entity-Relationship (E/R) Modelling• As in Object Oriented approach, designing a database schema

requires finding conceptual abstractions (that represent the data) and defining relationships between them

• Notation suggested by Peter Chen in “The Entity Relationship Model: Toward a Unified View of Data”, 1976- UML can also be used

• Relationships have cardinality- 1 to 1- 1 to Many- Many to Many etc.

43

Employee Course

Name

Leads

Title

No. lectures

attribute

entity set

relationshipset

43

Page 44: 3 f6 8_databases

© 2012 Elena PunskayaCambridge University Engineering Department

Entity-Relationship (E/R) Modelling

44

Employee

NameNumber

ISA

Mechanic SalesmanDoes

RepairJobNumber

Description

CostParts

Work

Repairs Car

License

ModelYear

Manufacturer

Buys

Price

Date

Value

Sells

Date

Value

Comission

Client ID

Name PhoneAddress

buyerseller

Pável Calado, http://www.texample.net/tikz/examples/entity-relationship-diagram/

44

Page 45: 3 f6 8_databases

© 2012 Elena PunskayaCambridge University Engineering Department

Objects and Databases• Relational Database Management Systems were mature

stable products by 1980s

• Object-Oriented approach reached wide adoption in 1990s

• Any large software system still needs to persist data, hence store it in databases

• Question: how we map Objects in a software system at runtime to Data stored in databases?

• Originally, two options emerged:- Object to Relationship Mapping – a software layer that can provide database persistent to OO system (e.g. Hibernate, TopLink) – commonly used

- Object Databases – a nice idea that failed to reach mainstream adoption

• Most recently, further developments included non-relationship approaches (NoSQL) to working with large distributed datasets, e.g. Hadoop (hadoop.apache.org)- Map/Reduce: distributed processing of large data sets on compute clusters- Hive: A data warehouse infrastructure that provides data summarization and ad hoc querying

- Cassandra: A scalable multi-master database with no single points of failure45

45

Page 46: 3 f6 8_databases

© 2012 Elena PunskayaCambridge University Engineering Department

No(t only)SQL at guardian.co.uk• The Guardian online, 1999

46

I bring you NEWS!!!App server App server App server

Web server Web server Web server

CMS Data feeds

Oracle

Memcached (20Gb)

Guardian journalism online: 1999

Matthew Wall, Simon Willison, www.slideshare.net/matwall/nosql-presentation

46

Page 47: 3 f6 8_databases

© 2012 Elena PunskayaCambridge University Engineering Department

No(t only)SQL at guardian.co.uk• The Guardian online, 2010

47

App server

Web servers

CMS Data feeds

Memcached (20Gb)

Solr

Core

Solr

Solr

Solr

Solr

Solr

Cloud, EC2

M/Q

Out

App

App

App

App

App

App

In

Proxy

external hostingapp engine etc

CouchDB?rdbms

Guardian journalism online: 2010

Matthew Wall, Simon Willison, www.slideshare.net/matwall/nosql-presentation

47

Page 48: 3 f6 8_databases

© 2012 Elena PunskayaCambridge University Engineering Department

Security and SQL Injection• Consider the following example

• What happens if the user enters: " ; DROP TABLE products; --

• The query becomes

• SQL Injection could be usedto steal data from a database

48

// allowing a user to search by product namestring name; cout << "Enter product name:" << endl; getline(cin, name); string query = "SELECT * FROM products WHERE name=\"" + name + "\""; do_sql(query);

// going to delete the table ProductsSELECT * FROM products WHERE name="" ; DROP TABLE products; -- "

news.bbc.co.uk/1/hi/8206305.stm

48