3 f6 8_databases
DESCRIPTION
TRANSCRIPT
© 2012 Elena PunskayaCambridge University Engineering Department
Big Data• Facebook stores over 100 petabytes of media (photos and
videos) uploaded by its 845 million users
• There are 762 billion objects stored in Amazon S3 that processes over 500,000 requests per second for these objects at peak times
2
aws.typepad.com AmazonBryce Durbin, Techcrunch
2
© 2012 Elena PunskayaCambridge University Engineering Department
Big Data• Storing large amounts of data requires managing complexity:
- mapping real world to data- providing concurrent access to creating, reading and changing of the data- providing distributed access and storage of the data
• Database Management Systems decouples business logic of applications working with data from the details of physical storage and transaction (operations on data) management
• Any non-trivial system needs to store its application data:- user/password- credit cards- product information- health records ...
• It is possible to store all data directly as files but a typicalfilesystem isn’t build for transaction management and high performance
3
Cambridge High Performance Computing Cluster Darwin
3
© 2012 Elena PunskayaCambridge University Engineering Department
Transaction Processing• Databases are a common component of many distributed systems.
They store records for a large number of distinct entities and will typically support a small set of operations to access and manipulate those entities. These operations can be assumed to be atomic i.e. they cannot be interrupted.
• External clients execute transactions which are sequences of operations applied to one or more database entities designed to achieve a single logical affect.
• The transaction manager ensures that transactions appear atomic to clients. Client receives an acknowledgement of every successful transaction.
4
Database Systems I 1
Transaction Processing
Databases are a common component of many distributed sys-tems. They store records for a large number of distinct entitiesand will typically support a small set of operations to access andmanipulate those entities. These operations can be assumed tobe atomic i.e. they cannot be interrupted.
External clients execute transactions which are sequences of op-erations applied to one or more database entities designed toachieve a single logical affect.
Client A
Client B
Transaction Manager
Recovery Log
Database
TB
TA
Transactions Atomic Operations
The transaction manager ensures that transactions appear atomicto clients. Client receives an acknowledgement of every successfultransaction.
4
© 2012 Elena PunskayaCambridge University Engineering Department
Example: Bank Transfer• Each account is represented by a different database object,
which guarantees that each operation is atomic
• A key issue is what happens if there is a failure part-way through the transaction?
5
class Account { // link to required account records DBaseAccessInfo dbinfo;public: // Constructor - open an account account(string account_name);// Atomic operations void debit(float amount); void credit(float amount); float read_balance();};
// A typical transaction would bevoid transfer(account& A, account& B, float amount) { float balance = A.read_balance(); if (balance >= amount) { A.debit(amount); B.credit(amount); }}
5
© 2012 Elena PunskayaCambridge University Engineering Department
System Crash• What happens if the system crashes in the middle of
transaction?
• Account A will have had its money debited, but it will never appear in account B! – invalid state
• The transaction manager (or any transaction processing system) must have a means of recovering from errors, and always leaving the system in a valid state
• Need to ensure that Credit/Debit is ATOMIC, i.e. can only be preformed as a WHOLE not in parts
6
// a typical transaction void transfer(account& A, account& B, float amount) { float balance = A.read_balance(); if(balance >= amount) { A.debit(amount); <-----------------------------CRASH!
6
© 2012 Elena PunskayaCambridge University Engineering Department
ACID• A transaction my fail in many different ways (e.g. two clients
try to access the same entity at the same time, temporary network failure, software fault, disk crash, etc). The transaction processor tries to ensure that transactions have the following properties
• Atomicity- Either all or none of the transaction’s operations are performed
• Consistency- Transactions transform the system from one consistent state to another
• Isolation- An incomplete transaction cannot reveal its result to other transactions before it is complete
• Durability- Once the transaction is committed, the system must guarantee that the results of its operations will persist, even if there are subsequent system failures
7
7
© 2012 Elena PunskayaCambridge University Engineering Department
Recovery• In order to maintain the ACID properties, a transaction
processor must be able to recover from errors by restoring the system to a consistent state.
• To achieve this, transactions are modelled on the following state machine
8
Database Systems I 5
Recovery
In order to maintain the ACID properties, a transaction processormust be able to recover from errors by restoring the system to aconsistent state.
To achieve this, transactions are modelled on the following statemachine:
Example: the transfer transaction
void transfer(account& A, account& B, float amount)
{
try {
int id = BeginTransaction(); // Record transaction start
float balance = A.read_balance();
if (balance >= amount) {
A.debit(amount);
B.credit(amount);
}
Commit(id); // success so commit
}
catch (...){
Abort(id); // failure so undo
}
}
←Transactionprocessor mightinvalidate thistransaction (seelast slide)
// A typical transaction with commitvoid transfer(account& A, account& B, float amount) { try { // Record transaction start int id = BeginTransaction(); float balance = A.read_balance(); if (balance >= amount) { A.debit(amount); B.credit(amount); } // success, so commit (finish) Commit(id); } catch(..) { // transaction failed, recover/revert Abort(id); }}
8
© 2012 Elena PunskayaCambridge University Engineering Department
Concurrency• In practice, a database transaction processor will be receiving
a stream of transaction requests, and will need to execute transactions in parallel in order to provide acceptable response times.
• When two transactions reference the same account, uncontrolled interleaving of operations can produce an incorrect result. There are three classes of concurrency problem
• In this case, transaction 1 reads an updated account value, but transaction 2 aborts undoing the effect of the update. Transaction 1 is then left holding an incorrect account value
9
Database Systems I 9
Concurrency
In practice, a database transaction processor will be receiving astream of transaction requests, and will need to execute transac-tions in parallel in order to provide acceptable response times.
When two transactions reference the same account, uncontrolledinterleaving of operations can produce an incorrect result. Thereare three classes of concurrency problem:
• The uncommitted dependency problem
Time Transaction 1 Transaction 2t1 – A.write()t2 A.read() –t3 – abort()
In this case, transaction 1 reads an updated account value, buttransaction 2 aborts undoing the effect of the update. Transac-tion 1 is then left holding an incorrect account value.
Note: A.read() indicates any operation which reads a value from ac-
count A but does not change it (eg A.read_balance() ), A.write()
indicates any operation which changes account A (eg A.credit() or
A.debit()) .
9
© 2012 Elena PunskayaCambridge University Engineering Department
Concurrency
• The change made to account A at t3 by transaction 1 is lost because it is overwritten at time t4 by transaction 2
• Transaction 2 updates account A after transaction 1 has read its value
• Hence, transaction 1 is left holding an incorrect value for account A
10
10 Engineering Part IIA: 3F6 - Software Engineering and Design
• The lost update problem
Time Transaction 1 Transaction 2
t1 A.read() –
t2 – A.read()
t3 A.write() –
t4 – A.write()
In this case, the change made to account A at t3 by transac-
tion 1 is lost because it is overwritten at time t4 by transac-
tion 2.
• The inconsistent analysis problem
Time Transaction 1 Transaction 2
t1 A.read() –
t2 – A.read()
t3 – A.write()
t4 – commit()
In this case, transaction 2 updates account A after transac-
tion 1 has read its value. Hence, transaction 1 is left holding
an incorrect value for account A.
10 Engineering Part IIA: 3F6 - Software Engineering and Design
• The lost update problem
Time Transaction 1 Transaction 2
t1 A.read() –
t2 – A.read()
t3 A.write() –
t4 – A.write()
In this case, the change made to account A at t3 by transac-
tion 1 is lost because it is overwritten at time t4 by transac-
tion 2.
• The inconsistent analysis problem
Time Transaction 1 Transaction 2
t1 A.read() –
t2 – A.read()
t3 – A.write()
t4 – commit()
In this case, transaction 2 updates account A after transac-
tion 1 has read its value. Hence, transaction 1 is left holding
an incorrect value for account A. 10
© 2012 Elena PunskayaCambridge University Engineering Department
Managing Concurrency• The problems discussed can be managed by applying a
Pessimistic or Optimistic concurrency control
• Pessimistic- When a transaction wishes to access an account it first secures a lock on that account, when it has finished it releases the lock. If a lock is already taken, the transaction must wait until it is released.
- Locking could be on the whole table or a single row and could be declared at different levels of exclusivity (e.g. no one else can access data or some access is allowed)
- Could cause deadlocks, e.g. Tx1 and Tx2 require two resources R1 and R2 to proceed:‣ T1 holds R1 and is waiting for R2 ‣ T2 holds R2 and is waiting for R1
- Useful when there is a lot of data that is often updated by many users
• Optimistic- Allows uncontrolled access to accounts, and then simply abort any transactions which might have suffered a conflict
- Implemented by creating a new copy of the data that maybe be updated and when the update is completed checks if the master copy hasn’t changed in meantime‣ if changed – aborted‣ if not – complete
- Useful when most operations are reading data and changes occur rarely
11
11
© 2012 Elena PunskayaCambridge University Engineering Department
Relational Databases• By late 1960s, the “Software Crisis” was
already declared and data storage wasn’tdoing much better
• In 1970, Edgar Codd, an English mathematician working for IBM, published a paper “"A Relational Model of Data for Large Shared Data Banks", it started with the following words:“Future users of large data banks must be protected from having to know how the data is organised in the machine...”
12
Edgar 'Ted' Codd, 1923-2003image © IBM
Computer calculations cost hundreds of dollars a minute, so great human effort was spent to make programs as efficient as possible before they were run. Early databases used either a rigid hierarchical structure or a complex navigational plan of pointers to the physical locations of the data on magnetic tapes. Teams of programmers were needed to express queries to extract meaningful information. While such databases could be efficient in handling the specific data and queries they were designed for, they were absolutely inflexible. New types of queries required complex reprogramming, and adding new types of data forced a total redesign of the database itself.
IBM Research News, www.research.ibm.com/resources/news/20030423_edgarpassaway.shtml
12
© 2012 Elena PunskayaCambridge University Engineering Department
Relational Databases• Codd suggested to move away from hierarchical or
navigational structure of early databases to simple tables with rows and columns
• Based on Relational Algebra, this approach allowed to greatly simplify database queries (ability to access and analyse data)
• Many relational database management systems (RDBMS) are accessed using SQL (Structured Query Language)- SQL is defined by industry standards and has been developed over many revisions from SQL-87 to SQL 2008
• There are many free and commercial databases available:- Free: PostgreSQL, MySQL, SQLite...- Commercial: Oracle, DB2, SQL Server...
• SQLite is the easiest database to start using as it requires no setup, and is available on the teaching system. - Type: sqlite3 <db-name>- Then enter SQL commands followed by a ‘;’. The database will be stored in a file called <db-name> which will be created if it does not already exist.
13
13
© 2012 Elena PunskayaCambridge University Engineering Department
The Relational Model• The relational model is related to set theory. A relation is a
table. A relation contains a set of tuples (rows).
- The meaning of the data is described by the scheme, which is a set of column names. Column names are known as attributes.
- There is no ordering or grouping of attributes. The table is a relation over this scheme. A relation r over a scheme R is written as r(R). Each column has a domain, D.
14
2 Engineering Part IIA: 3F6 - Software Engineering and Design
The relational modelThe relational model is related to set theory. A relation is a ta-ble. A relation contains a set of tuples (rows).relation coursescheme Title Leader Lecturest1 RISC Processors Sanchez 8t2 QAM for modems Sanchez 34t3 Introduction to Mainframes Belford 20t4 Fast refresh LCDs Richard 1t5 t5[Title] t5[Leader] t5[Lectures]t6 t6[Title, Leader]
The meaning of the data is described by the scheme, which is aset of column names. Column names are known as attributes.
course scheme = (Title, Leader, Lectures)
There is no ordering or grouping of attributes. The table is arelation over this scheme. A relation r over a scheme R is writtenas r(R). Each column has a domain, D. So:
DTitle = strings, DLectures = Z+
So each element ti[j] ∈ Dj and ti ∈ D1 ×D2 × · · ·×Dn
For example, the scheme (x, y) with Dx = Dy = R, the domainof the tuples is the domain of all two dimensional vectors.
SQL:CREATE TABLE course (Title text, Leader text, Lectures int, CHECK(Lectures > 0))INSERT INTO course VALUES ("RISC Processors", "Sanchez", 10)UPDATE course SET Lectures=8 WHERE Leader="Sanchez"DELETE FROM course WHERE Lectures=8 AND Leader="Sanchez"DROP TABLE course
domain↓ Constraint↓
SQL allows domain of tuples: Dt ⊆ D1 ×D2 × · · ·×Dn.
2 Engineering Part IIA: 3F6 - Software Engineering and Design
The relational modelThe relational model is related to set theory. A relation is a ta-ble. A relation contains a set of tuples (rows).relation coursescheme Title Leader Lecturest1 RISC Processors Sanchez 8t2 QAM for modems Sanchez 34t3 Introduction to Mainframes Belford 20t4 Fast refresh LCDs Richard 1t5 t5[Title] t5[Leader] t5[Lectures]t6 t6[Title, Leader]
The meaning of the data is described by the scheme, which is aset of column names. Column names are known as attributes.
course scheme = (Title, Leader, Lectures)
There is no ordering or grouping of attributes. The table is arelation over this scheme. A relation r over a scheme R is writtenas r(R). Each column has a domain, D. So:
DTitle = strings, DLectures = Z+
So each element ti[j] ∈ Dj and ti ∈ D1 ×D2 × · · ·×Dn
For example, the scheme (x, y) with Dx = Dy = R, the domainof the tuples is the domain of all two dimensional vectors.
SQL:CREATE TABLE course (Title text, Leader text, Lectures int, CHECK(Lectures > 0))INSERT INTO course VALUES ("RISC Processors", "Sanchez", 10)UPDATE course SET Lectures=8 WHERE Leader="Sanchez"DELETE FROM course WHERE Lectures=8 AND Leader="Sanchez"DROP TABLE course
domain↓ Constraint↓
SQL allows domain of tuples: Dt ⊆ D1 ×D2 × · · ·×Dn.
2 Engineering Part IIA: 3F6 - Software Engineering and Design
The relational modelThe relational model is related to set theory. A relation is a ta-ble. A relation contains a set of tuples (rows).relation coursescheme Title Leader Lecturest1 RISC Processors Sanchez 8t2 QAM for modems Sanchez 34t3 Introduction to Mainframes Belford 20t4 Fast refresh LCDs Richard 1t5 t5[Title] t5[Leader] t5[Lectures]t6 t6[Title, Leader]
The meaning of the data is described by the scheme, which is aset of column names. Column names are known as attributes.
course scheme = (Title, Leader, Lectures)
There is no ordering or grouping of attributes. The table is arelation over this scheme. A relation r over a scheme R is writtenas r(R). Each column has a domain, D. So:
DTitle = strings, DLectures = Z+
So each element ti[j] ∈ Dj and ti ∈ D1 ×D2 × · · ·×Dn
For example, the scheme (x, y) with Dx = Dy = R, the domainof the tuples is the domain of all two dimensional vectors.
SQL:CREATE TABLE course (Title text, Leader text, Lectures int, CHECK(Lectures > 0))INSERT INTO course VALUES ("RISC Processors", "Sanchez", 10)UPDATE course SET Lectures=8 WHERE Leader="Sanchez"DELETE FROM course WHERE Lectures=8 AND Leader="Sanchez"DROP TABLE course
domain↓ Constraint↓
SQL allows domain of tuples: Dt ⊆ D1 ×D2 × · · ·×Dn.
14
© 2012 Elena PunskayaCambridge University Engineering Department
The Relational Model
• Built on principles of Relational Algebra- Projection, Selection, Union, Intersection, Subtraction, Join
15
2 Engineering Part IIA: 3F6 - Software Engineering and Design
The relational modelThe relational model is related to set theory. A relation is a ta-ble. A relation contains a set of tuples (rows).relation coursescheme Title Leader Lecturest1 RISC Processors Sanchez 8t2 QAM for modems Sanchez 34t3 Introduction to Mainframes Belford 20t4 Fast refresh LCDs Richard 1t5 t5[Title] t5[Leader] t5[Lectures]t6 t6[Title, Leader]
The meaning of the data is described by the scheme, which is aset of column names. Column names are known as attributes.
course scheme = (Title, Leader, Lectures)
There is no ordering or grouping of attributes. The table is arelation over this scheme. A relation r over a scheme R is writtenas r(R). Each column has a domain, D. So:
DTitle = strings, DLectures = Z+
So each element ti[j] ∈ Dj and ti ∈ D1 ×D2 × · · ·×Dn
For example, the scheme (x, y) with Dx = Dy = R, the domainof the tuples is the domain of all two dimensional vectors.
SQL:CREATE TABLE course (Title text, Leader text, Lectures int, CHECK(Lectures > 0))INSERT INTO course VALUES ("RISC Processors", "Sanchez", 10)UPDATE course SET Lectures=8 WHERE Leader="Sanchez"DELETE FROM course WHERE Lectures=8 AND Leader="Sanchez"DROP TABLE course
domain↓ Constraint↓
SQL allows domain of tuples: Dt ⊆ D1 ×D2 × · · ·×Dn.
15
© 2012 Elena PunskayaCambridge University Engineering Department
Relational algebra: Projection Π
16
Database Systems II 3
Relational algebra: Projection Π
The projection operator, Π, removes columns by listing the ones
to be retained. The operator is written as:
Πcolumn1,column2,. . . (relation).
An example of applying projection is:
ΠLeader,Lectures (course) =
Leader Lectures
Sanchez 8
Sanchez 34
Belford 20
Richard 1
Consider a relation, r(R) where R=(x,y,z) and x, y, z ∈ R. Each
row represents a 3D vector. The relation Πx,y(r) contains the
projection of the vectors onto the x, y plane.
In SQL the SELECT statement performs all of the primitive
relational algebra funcionality. The selection above is rendered
as:
SELECT Leader,Lectures FROM course
The general form being:
SELECT Col1[, Col2, [· · · ]] FROM table
Note that SQL is not entirely relational and the expression:
SELECT Leader FROM course
has duplicate rows. To remove duplicates, use:
SELECT DISTINCT Leader FROM course
The there is a shorthand for the identity projection:
SELECT * FROM table
16
© 2012 Elena PunskayaCambridge University Engineering Department
Relational algebra: Projection Π
17
Database Systems II 3
Relational algebra: Projection Π
The projection operator, Π, removes columns by listing the ones
to be retained. The operator is written as:
Πcolumn1,column2,. . . (relation).
An example of applying projection is:
ΠLeader,Lectures (course) =
Leader Lectures
Sanchez 8
Sanchez 34
Belford 20
Richard 1
Consider a relation, r(R) where R=(x,y,z) and x, y, z ∈ R. Each
row represents a 3D vector. The relation Πx,y(r) contains the
projection of the vectors onto the x, y plane.
In SQL the SELECT statement performs all of the primitive
relational algebra funcionality. The selection above is rendered
as:
SELECT Leader,Lectures FROM course
The general form being:
SELECT Col1[, Col2, [· · · ]] FROM table
Note that SQL is not entirely relational and the expression:
SELECT Leader FROM course
has duplicate rows. To remove duplicates, use:
SELECT DISTINCT Leader FROM course
The there is a shorthand for the identity projection:
SELECT * FROM table
17
© 2012 Elena PunskayaCambridge University Engineering Department
Relational algebra: Selection σ
18
4 Engineering Part IIA: 3F6 - Software Engineering and Design
Relational algebra: Selection σ
The selection operator accepts a predicate, Θ and a relation.Rows matching the predicate are retained:
σLeader=”Sanchez”(course) =Title Leader LecturesRISC Processors Sanchez 8QAM for modems Sanchez 34
The general form of the resulting relation can be written in setbuilder notation
σΘ(r) = {t|t ∈ r, Θ(t)}
That is, the result consists of all tuples t such that each tuple isboth in the relation r and for which the predicate applied to thetuple, i.e. Θ(t), is true.
In SQL, selection is also performed with the select statement withthe predicate being specified by the WHERE clause:
SELECT * FROM course WHERE Leader=”Sanchez”
Predicates can contain expressions involving any or all of therows. SQL has more or less the same set of numeric operators asC and also AND, OR, NOT, BETWEEN:SELECT * FROM course WHERE Lectures BETWEEN 2 AND 10
and IN: WHERE Leader IN ("Belford", "Richard")
Projection and selection can be readily composed, so in general:
ΠS(σΘ(r)) translates to SELECT S FROM r WHERE Θ
18
© 2012 Elena PunskayaCambridge University Engineering Department
Relational algebra: Selection σ
19
4 Engineering Part IIA: 3F6 - Software Engineering and Design
Relational algebra: Selection σ
The selection operator accepts a predicate, Θ and a relation.Rows matching the predicate are retained:
σLeader=”Sanchez”(course) =Title Leader LecturesRISC Processors Sanchez 8QAM for modems Sanchez 34
The general form of the resulting relation can be written in setbuilder notation
σΘ(r) = {t|t ∈ r, Θ(t)}
That is, the result consists of all tuples t such that each tuple isboth in the relation r and for which the predicate applied to thetuple, i.e. Θ(t), is true.
In SQL, selection is also performed with the select statement withthe predicate being specified by the WHERE clause:
SELECT * FROM course WHERE Leader=”Sanchez”
Predicates can contain expressions involving any or all of therows. SQL has more or less the same set of numeric operators asC and also AND, OR, NOT, BETWEEN:SELECT * FROM course WHERE Lectures BETWEEN 2 AND 10
and IN: WHERE Leader IN ("Belford", "Richard")
Projection and selection can be readily composed, so in general:
ΠS(σΘ(r)) translates to SELECT S FROM r WHERE Θ
19
© 2012 Elena PunskayaCambridge University Engineering Department
Union, intersection, subtraction• In SQL, union intersection and subtraction behave much more
like set theory than relational algebra. For these operations it is the order of the attributes not the names of the attributes which have significance.
• Set union, ∪ aggregates the rows of two sets together. If there are two relations, r(R) and s(R), then the union, r ∪ s can be computed:
20
Database Systems II 5
Union, intersection, subtraction
In SQL, union intersection and subtraction behave much morelike set theory than relational algebra. For these operations it isthe order of the attributes not the names of the attributes whichhave significance.
Set union, ∪ aggregates the rows of two sets together. If thereare two relations, r(R) and s(R), then the union, r ∪ s can becomputed:
SELECT * FROM r UNION SELECT * FROM s
Likewise, intersection can be computed using:
SELECT * FROM r INTERSECT SELECT * FROM s
Set differencing is either MINUS or EXCEPT depending on thedatabase.
s
r!s
r
SELECT * FROM r EXCEPT SELECT * FROM s
Since ordering, not naming matters, with the schema R=(a,b),S=(b,a) and the tables r(R), s(S):
r sa b b a1 2 3 53 4 1 2
r - s =a b3 4
20
© 2012 Elena PunskayaCambridge University Engineering Department
Join / cartesian product ו The cartesian product is the only primitive operator which
combines two tables with different schemes. Joining two relations, a × b generates a new relation with every row in in a paired with every row in b. Joining is very useful for extracting related information.
• Joining students and labs:
21
6 Engineering Part IIA: 3F6 - Software Engineering and Design
Join / cartesian product ×
The cartesian product is the only primitive operator which com-bines two tables with different schemes. Joining two relations,a × b generates a new relation with every row in in a pairedwith every row in b. Joining is very useful for extracting relatedinformation.
studentsStudent SupervisorGibson SanchezMurphy BelfordLibby GoldsteinCook Sanchez
labsLab Demonstrator3F27 Cook3F89 Libby4F185 Margo3F34 Ray
The table students× labs is on the next page. Note that theattributes get augmented with the table name to avoid ambiguity.The table name may be omitted if it is not ambiguous. SQL:
SELECT * FROM students, labs
Find all students of “Sanchez” who are demonstrating:
ΠStudent(σStudent=Demonstrator∧Supervisor=“Sanchez”(students× labs))
SELECT Student FROM students, labs
WHERE Student=Demonstrator AND
Supervisor="Sanchez"
The result is Cook . Selection is often composed with joining,
so it is given the non primitive operator, the theta join:
a ��Θ b ≡ σΘ(a× b).
Database Systems II 7
students × labsstudents.Student students.Supervisor labs.Lab labs.Demonstrator
Gibson Sanchez 3F27 CookGibson Sanchez 3F89 LibbyGibson Sanchez 4F185 MargoGibson Sanchez 3F34 RayMurphy Belford 3F27 CookMurphy Belford 3F89 LibbyMurphy Belford 4F185 MargoMurphy Belford 3F34 RayLibby Goldstein 3F27 CookLibby Goldstein 3F89 LibbyLibby Goldstein 4F185 MargoLibby Goldstein 3F34 RayCook Sanchez 3F27 CookCook Sanchez 3F89 LibbyCook Sanchez 4F185 MargoCook Sanchez 3F34 Ray
6 Engineering Part IIA: 3F6 - Software Engineering and Design
Join / cartesian product ×
The cartesian product is the only primitive operator which com-bines two tables with different schemes. Joining two relations,a × b generates a new relation with every row in in a pairedwith every row in b. Joining is very useful for extracting relatedinformation.
studentsStudent SupervisorGibson SanchezMurphy BelfordLibby GoldsteinCook Sanchez
labsLab Demonstrator3F27 Cook3F89 Libby4F185 Margo3F34 Ray
The table students× labs is on the next page. Note that theattributes get augmented with the table name to avoid ambiguity.The table name may be omitted if it is not ambiguous. SQL:
SELECT * FROM students, labs
Find all students of “Sanchez” who are demonstrating:
ΠStudent(σStudent=Demonstrator∧Supervisor=“Sanchez”(students× labs))
SELECT Student FROM students, labs
WHERE Student=Demonstrator AND
Supervisor="Sanchez"
The result is Cook . Selection is often composed with joining,
so it is given the non primitive operator, the theta join:
a ��Θ b ≡ σΘ(a× b).21
© 2012 Elena PunskayaCambridge University Engineering Department
Join / cartesian product ×
22
6 Engineering Part IIA: 3F6 - Software Engineering and Design
Join / cartesian product ×
The cartesian product is the only primitive operator which com-bines two tables with different schemes. Joining two relations,a × b generates a new relation with every row in in a pairedwith every row in b. Joining is very useful for extracting relatedinformation.
studentsStudent SupervisorGibson SanchezMurphy BelfordLibby GoldsteinCook Sanchez
labsLab Demonstrator3F27 Cook3F89 Libby4F185 Margo3F34 Ray
The table students× labs is on the next page. Note that theattributes get augmented with the table name to avoid ambiguity.The table name may be omitted if it is not ambiguous. SQL:
SELECT * FROM students, labs
Find all students of “Sanchez” who are demonstrating:
ΠStudent(σStudent=Demonstrator∧Supervisor=“Sanchez”(students× labs))
SELECT Student FROM students, labs
WHERE Student=Demonstrator AND
Supervisor="Sanchez"
The result is Cook . Selection is often composed with joining,
so it is given the non primitive operator, the theta join:
a ��Θ b ≡ σΘ(a× b).
Database Systems II 7
students × labsstudents.Student students.Supervisor labs.Lab labs.Demonstrator
Gibson Sanchez 3F27 CookGibson Sanchez 3F89 LibbyGibson Sanchez 4F185 MargoGibson Sanchez 3F34 RayMurphy Belford 3F27 CookMurphy Belford 3F89 LibbyMurphy Belford 4F185 MargoMurphy Belford 3F34 RayLibby Goldstein 3F27 CookLibby Goldstein 3F89 LibbyLibby Goldstein 4F185 MargoLibby Goldstein 3F34 RayCook Sanchez 3F27 CookCook Sanchez 3F89 LibbyCook Sanchez 4F185 MargoCook Sanchez 3F34 Ray
22
© 2012 Elena PunskayaCambridge University Engineering Department
Natural Join • A ‘natural join’ is a join followed by some selection and
projection:- Perform a join- Perform selection so that attributes with the same name must be equal- Perform projection to remove duplicated attributes
• If attributes with the same name are semantically the same, then the natural join is usually the correct kind of join to use. In addition to the ‘labs’ table, we also have a table listing lab sessions:
23
Database Systems II 9
Natural Join ��
A ‘natural join’ is a join followed by some selection and projec-tion:
• Perform a join.
• Perform selection so that attributes with the same name mustbe equal.
• Perform projection to remove duplicated attributes.
Note that there are no attribute ambiguities.
If attributes with the same name are semantically the same, thenthe natural join is usually the correct kind of join to use. In ad-dition to the ‘labs’ table, we also have a table listing lab sessions:
sessionsLab Title3F27 Mainframe filesystems3F27 Filesystem security3F89 Large vehicle control4F185 Networks for finance systems3F34 Magnetic storage forensics
The natural join matches up the shared attributes
sessions �� labs =
Demonstrator Lab TitleCook 3F27 Filesystem securityCook 3F27 Mainframe filesystemsLibby 3F89 Large vehicle controlMargo 4F185 Networks for finance systemsRay 3F34 Magnetic storage forensics
Database Systems II 9
Natural Join ��
A ‘natural join’ is a join followed by some selection and projec-tion:
• Perform a join.
• Perform selection so that attributes with the same name mustbe equal.
• Perform projection to remove duplicated attributes.
Note that there are no attribute ambiguities.
If attributes with the same name are semantically the same, thenthe natural join is usually the correct kind of join to use. In ad-dition to the ‘labs’ table, we also have a table listing lab sessions:
sessionsLab Title3F27 Mainframe filesystems3F27 Filesystem security3F89 Large vehicle control4F185 Networks for finance systems3F34 Magnetic storage forensics
The natural join matches up the shared attributes
sessions �� labs =
Demonstrator Lab TitleCook 3F27 Filesystem securityCook 3F27 Mainframe filesystemsLibby 3F89 Large vehicle controlMargo 4F185 Networks for finance systemsRay 3F34 Magnetic storage forensics
23
© 2012 Elena PunskayaCambridge University Engineering Department
Natural Join
24
10 Engineering Part IIA: 3F6 - Software Engineering and Design
More formally:
There are two relations r(R) and s(S).
The set of shared attributes is A:
A = {A1, · · · , An} = R ∩ S
where n = |A|. The set of all attributes with no duplicates is:
R ∪ S.
The natural join is therefore:
r �� s ≡ ΠR ∪ Sσr.A1=s.A1∧···∧r.An=s.An(r× s)
In SQL, natural joins are performed with NATURAL JOIN:
SELECT * FROM sessions NATURAL JOIN labs
In practice, you will usually design databases by considering the
type of data, how it is stored in tables and how to extract the
relevant information. Relation algebra will not crop up much in
day-to-day design, but it is essential for understanding how the
various operations in a relational database work.
24
© 2012 Elena PunskayaCambridge University Engineering Department
Example• Let’s consider an example of movies database for LOVEFiLM.com
• It is likely to have
• SQL:CREATE TABLE movie (Title text, Year int, Actor text)
INSERT INTO movie VALUES ("Pulp Fiction", 1994, “John Travolta”)
INSERT INTO movie VALUES ("Hackers", 1995, “Angelina Jolie”)
etc.25
moviemoviemovie
Title Year Actor
Pulp Fiction 1994 John TravoltaHackers 1995 Angelina Jolie
The Matrix 1999 Keanu Reeves
The Devil’s Advocate 1997 Keanu Reeves
SQL:domain↓! Constraint↓TABLE course (Title text, Leader text, Lectures int, CHECK(Lectures > 0)) INTO course VALUES ("RISC Processors", "Sanchez", 10) course SET Lectures=8 WHERE Leader="Sanchez" FROM courCREATE INSERT UPDATE DELETE DROP TABLE course
25
© 2012 Elena PunskayaCambridge University Engineering Department
Example
• ProjectionSELECT Actor FROM movie
• SelectionSELECT * FROM movie WHERE Actor=”Keanu Reeves”
• Projection and Selection composedSELECT Title FROM movie WHERE Actor=”Keanu Reeves”
26
Actor
John TravoltaAngelina Jolie
Keanu Reeves
Keanu Reeves
• DistinctSELECT DISTINCT Actor FROM movie
ActorJohn TravoltaAngelina JolieKeanu Reeves
moviemoviemovieTitle Year Actor
The Matrix 1999 Keanu ReevesThe Devil’s Advocate 1997 Keanu Reeves
TitleThe Matrix
The Devil’s Advocate
26
© 2012 Elena PunskayaCambridge University Engineering Department
Example
• Selection may use AND, OR, NOT, BETWEEN, IN and etc.SELECT * FROM movie WHERE Year BETWEEN 1995 AND 1997
(BETWEEN 1995 AND 1997 Inclusive)
27
moviemoviemovieTitle Year Actor
Hackers 1995 Angelina Jolie
The Devil’s Advocate 1997 Keanu Reeves
27
© 2012 Elena PunskayaCambridge University Engineering Department
Example• Let us take now a simplified table
• Imagine we also have some info regarding the number of won Oscars
28
peoplepeople
Actor OscarsJohn Travolta 0
Angelina Jolie 1
moviemovieTitle Actor
Pulp Fiction John Travolta
Hackers Angelina Jolie
28
© 2012 Elena PunskayaCambridge University Engineering Department
Example• Cartesian product SELECT Title, Actor, Actor, Oscars FROM movie, people
- the only one that can create new record (if one doesn’t count renaming)- BUT it creates too many records!
• Natural join would give information on whether there are Oscar winning actors in the movie SELECT * FROM movie, people WHERE movie.Actor = people.Actor or SELECT * FROM movie NATURAL JOIN people
29
moviemoviemovieTitle Actor Oscars
Pulp Fiction John Travolta
0
Hackers Angelina Jolie
1
movie x peoplemovie x peoplemovie x peoplemovie x peoplemovie.Title movie.Actor people.Actor people.Oscars
Pulp Fiction John Travolta John Travolta 0Pulp Fiction John Travolta Angelina Jolie 1Hackers Angelina Jolie John Travolta 0Hackers Angelina Jolie Angelina Jolie 1
29
© 2012 Elena PunskayaCambridge University Engineering Department
Example• Let us consider two tables with Oscar and BAFTA nominations
• Union(SELECT * FROM Oscar) UNION (SELECT * FROM BAFTA)
30
OscarOscar
John Travolta Pulp Fiction
Angelina Jolie Girl, Interrupted
Angelina Jolie Changeling
BAFTABAFTA
John Travolta Pulp Fiction
Angelina Jolie Changeling
Jesse Eisenberg The Social Network
Oscar ∪ BAFTAOscar ∪ BAFTA
John Travolta Pulp Fiction
Angelina Jolie Girl, Interrupted
Angelina Jolie Changeling
30
© 2012 Elena PunskayaCambridge University Engineering Department
Example• Intersection
(SELECT * FROM Oscar) INTERSECT (SELECT * FROM BAFTA)
• Difference(SELECT * FROM Oscar) EXCEPT (SELECT * FROM BAFTA)
NOTE: some operators are treated differently in different databases, some may not be present
31
Oscar ∩ BAFTAOscar ∩ BAFTAJohn Travolta Pulp Fiction
Angelina Jolie Changeling
Oscar – BAFTAOscar – BAFTA
Angelina Jolie Girl, Interrupted
Jesse Eisenberg The Social Network
31
© 2012 Elena PunskayaCambridge University Engineering Department
Keys and Uniqueness• Rows in a relation can be uniquely identified by a key, which
can consist of one or more columns- A key must be able to uniquely identify all possible rows that relation could have in the domain of tuples, not just the rows that currently exist.
• Superkey- Any collection of columns which can uniquely identify a row. There may be more than one valid superkey.
• Candidate key- A minimal superkey, i.e. a superkey with the minimal number of columns. I.e. there is no subset of the columns in a candidate key which will also form a candidate key. There may be more than one candidate key.
• Primary key- A superkey or candidate key which has been selected to have a special status. A table can have at most one primary key. Should be small and constant.
• Foreign key-If two relations r and s share a key k, then r[k] is a foreign key if k is the primary key of s. Therefore, the foreign key k does not necessarily uniquely identify the rows of r
32
32
© 2012 Elena PunskayaCambridge University Engineering Department
Keys
33
Name Address DoB Gender Relationship
John Smith 34 West rd, Cambridge
2 Jan 1981 Male Single [email protected]
Thomas Anderson
Flat 303, 11 March 1962
Male Single [email protected]
...
Mia Wallace
20 Sunset rd, Carlsbad
10 October 1994
Female Married [email protected]
33
© 2012 Elena PunskayaCambridge University Engineering Department
Keys
34
id Name Address DoB Gender Relationship
1 John Smith 34 West rd, Cambridge
2 Jan 1981 Male Single [email protected]
2 Thomas Anderson
Flat 303, 101 Red st, Zion
11 March 1962
Male Single [email protected]
... ... ... ... ... ... ...
10001 Mia Wallace 20 Sunset rd, Carlsbad
10 October 1994
Female Married [email protected]
34
© 2012 Elena PunskayaCambridge University Engineering Department
Normalization• If a database has duplicated information then it is subject it
update anomalies, and the information can become inconsistent. Imagine adding contact details to the ‘course’ table to allow lecturers to be contacted easily:
• If the table is updated, for instance with the SQL command:
• Then the contact details will become incorrect. The process of normalizing a database involves splitting up large tables with only weakly related information into a number of smaller tables. Normalized data is then accessed by joining tables together and performing selections on the results.
35
12 Engineering Part IIA: 3F6 - Software Engineering and Design
NormalizationIf a database has duplicated information then it is subject it up-date anomalies, and the information can become inconsistent.Imagine adding contact details to the ‘course’ table to allow lec-turers to be contacted easily:
Title Leader Lectures TelephoneRISC Processors Sanchez 8 65960QAM for modems Sanchez 34 65960Introduction to Mainframes Belford 20 65536Low latency LCD screens Richard 1 32768
If the table is updated, for instance with the SQL command:
UPDATE course SET Leader="Libby" WHERE Title="RISC Processors"
Then the contact details will become incorrect. The process ofnormalizing a database involves splitting up large tables withonly weakly related information into a number of smaller tables.Normalized data is then accessed by joining tables together andperforming selections on the results.
The database above is not normalized because there is duplicateddata. More intuitively, the telephone number has merely beeninserted as a convenience and has nothing directly to do withcourses.
Much like type safety and object oriented design, database nor-malization allows databases to be designed such that certain er-rors (for instance data inconsistency) are impossible. Any errorwhich is reduced to an impossibility can never be a bug.
Normalization is the process of movind the database comply withnormal forms (1NF, 2NF, 3NF, BCNF, 4NF, 5NF and DKNF).
12 Engineering Part IIA: 3F6 - Software Engineering and Design
NormalizationIf a database has duplicated information then it is subject it up-date anomalies, and the information can become inconsistent.Imagine adding contact details to the ‘course’ table to allow lec-turers to be contacted easily:
Title Leader Lectures TelephoneRISC Processors Sanchez 8 65960QAM for modems Sanchez 34 65960Introduction to Mainframes Belford 20 65536Low latency LCD screens Richard 1 32768
If the table is updated, for instance with the SQL command:
UPDATE course SET Leader="Libby" WHERE Title="RISC Processors"
Then the contact details will become incorrect. The process ofnormalizing a database involves splitting up large tables withonly weakly related information into a number of smaller tables.Normalized data is then accessed by joining tables together andperforming selections on the results.
The database above is not normalized because there is duplicateddata. More intuitively, the telephone number has merely beeninserted as a convenience and has nothing directly to do withcourses.
Much like type safety and object oriented design, database nor-malization allows databases to be designed such that certain er-rors (for instance data inconsistency) are impossible. Any errorwhich is reduced to an impossibility can never be a bug.
Normalization is the process of movind the database comply withnormal forms (1NF, 2NF, 3NF, BCNF, 4NF, 5NF and DKNF).
35
© 2012 Elena PunskayaCambridge University Engineering Department
Normalization
• The database above is not normalised because there is duplicated data. More intuitively, the telephone number has merely been inserted as a convenience and has nothing directly to do with courses
• Much like type safety and object oriented design, database normalization allows databases to be designed such that certain errors (for instance data inconsistency) are less likely.
• Normalization is the process of designing the database comply with normal forms (1NF, 2NF, 3NF, BCNF, 4NF, 5NF and DKNF).
36
12 Engineering Part IIA: 3F6 - Software Engineering and Design
NormalizationIf a database has duplicated information then it is subject it up-date anomalies, and the information can become inconsistent.Imagine adding contact details to the ‘course’ table to allow lec-turers to be contacted easily:
Title Leader Lectures TelephoneRISC Processors Sanchez 8 65960QAM for modems Sanchez 34 65960Introduction to Mainframes Belford 20 65536Low latency LCD screens Richard 1 32768
If the table is updated, for instance with the SQL command:
UPDATE course SET Leader="Libby" WHERE Title="RISC Processors"
Then the contact details will become incorrect. The process ofnormalizing a database involves splitting up large tables withonly weakly related information into a number of smaller tables.Normalized data is then accessed by joining tables together andperforming selections on the results.
The database above is not normalized because there is duplicateddata. More intuitively, the telephone number has merely beeninserted as a convenience and has nothing directly to do withcourses.
Much like type safety and object oriented design, database nor-malization allows databases to be designed such that certain er-rors (for instance data inconsistency) are impossible. Any errorwhich is reduced to an impossibility can never be a bug.
Normalization is the process of movind the database comply withnormal forms (1NF, 2NF, 3NF, BCNF, 4NF, 5NF and DKNF).
36
© 2012 Elena PunskayaCambridge University Engineering Department
First Normal Form• Make sure that your database really obeys the relational
model:- No ordering over rows - No ordering over columns- No duplicates
• Each row/column intersection contains exactly one datum
• Consider storing multiple phone numbers for the Leader
37
Database Systems II 13
First Normal Form (1NF)
1. Make sure that your database really obeys the relationalmodel:
(a) No ordering over rows
(b) No ordering over columns
(c) No duplicates
2. Each row/column intersection contains exactly one datum.
Consider trying to extend the earlier design to allow for multiplephone numbers:
BAD
BAD
Title Lectures ID Numbers· · · 8 456 65950, 60294, 70231· · · 8 456 65950, 60294, 70231· · · 34 20 65536· · · 1 82 32768, 16384
Title Lectures ID Phone 1 Phone 2 Phone 3· · · 8 456 65960 60294 70231· · · 34 456 65960 60294 70231· · · 20 9 65536· · · 1 82 32768 16384
Note the use of IDs to avoid duplicates as names make bad keys:
employeesName ID PhoneSanchez 456 65960Belford 9 65536Richard 82 32768Sanchez 456 60294
The list of phone numbers for theleader of a particular course can nowbe extracted using relational algebra:ΠPhone(σTitle=“RISC Processors”(course �� employees))
37
© 2012 Elena PunskayaCambridge University Engineering Department
First Normal Form• Employees table is used for
details of course Leaders
• Adding a Phone number to storeemployee’s contact details
• To support multiple phone numbersneed to duplicate Name/ID data
• The list of phone numbers for the leader of a particular course can now be extracted using relational algebra:
or in SQL:
SELECT phone FROM course NATURAL JOIN employees WHERE Title=”RISC Processors”
38
Database Systems II 13
First Normal Form (1NF)
1. Make sure that your database really obeys the relationalmodel:
(a) No ordering over rows
(b) No ordering over columns
(c) No duplicates
2. Each row/column intersection contains exactly one datum.
Consider trying to extend the earlier design to allow for multiplephone numbers:
BAD
BAD
Title Lectures ID Numbers· · · 8 456 65950, 60294, 70231· · · 8 456 65950, 60294, 70231· · · 34 20 65536· · · 1 82 32768, 16384
Title Lectures ID Phone 1 Phone 2 Phone 3· · · 8 456 65960 60294 70231· · · 34 456 65960 60294 70231· · · 20 9 65536· · · 1 82 32768 16384
Note the use of IDs to avoid duplicates as names make bad keys:
employeesName ID PhoneSanchez 456 65960Belford 9 65536Richard 82 32768Sanchez 456 60294
The list of phone numbers for theleader of a particular course can nowbe extracted using relational algebra:ΠPhone(σTitle=“RISC Processors”(course �� employees))
Database Systems II 13
First Normal Form (1NF)
1. Make sure that your database really obeys the relationalmodel:
(a) No ordering over rows
(b) No ordering over columns
(c) No duplicates
2. Each row/column intersection contains exactly one datum.
Consider trying to extend the earlier design to allow for multiplephone numbers:
BAD
BAD
Title Lectures ID Numbers· · · 8 456 65950, 60294, 70231· · · 8 456 65950, 60294, 70231· · · 34 20 65536· · · 1 82 32768, 16384
Title Lectures ID Phone 1 Phone 2 Phone 3· · · 8 456 65960 60294 70231· · · 34 456 65960 60294 70231· · · 20 9 65536· · · 1 82 32768 16384
Note the use of IDs to avoid duplicates as names make bad keys:
employeesName ID PhoneSanchez 456 65960Belford 9 65536Richard 82 32768Sanchez 456 60294
The list of phone numbers for theleader of a particular course can nowbe extracted using relational algebra:ΠPhone(σTitle=“RISC Processors”(course �� employees))
38
© 2012 Elena PunskayaCambridge University Engineering Department
Second Normal Form (2NF)• A table is in second normal form if it satisfies:
- It is in first normal form (1NF).- All non-prime attributes depend on the whole candidate key.
• From the previous example, the complete relation, employees(E), is:
• The candidate key is C = (ID, Phone). The non prime attribute is therefore E − C =(Name). The employees’ names do not depend on the phone number, only the ID. Therefore the table is not in 2NF.
39
14 Engineering Part IIA: 3F6 - Software Engineering and Design
Second Normal Form (2NF)
A table is in second normal form if it satisfies:
1. It is in first normal form (1NF).
2. All non-prime attributes depend on the whole candidate key.
From the previous example, the complete relation, employees(E),is:
Lack of normalization allowsbuggy programs to create incon-sistencies:
Inserting the record (“Belford”,10, 131072) leads to a mismatchbetween the name and id.
An employee name change re-quires updates across multiplerows, which may be done incor-rectly. It also requires more lock-ing.
employeesName ID PhoneSanchez 456 65960Belford 9 65536Richard 82 32768Sanchez 456 60294Sanchez 456 70231Richard 82 16384
The candidate key is C = (ID, Phone). The non prime attributeis therefore E − C =(Name). The employees’ names do notdepend on the phone number, only the ID. Therefore the tableis not in 2NF. A 2NF design is:
employee namesName IDSanchez 456Belford 9Richard 82
contactsID Phone456 659609 6553682 32768456 60294456 7023182 16384
39
© 2012 Elena PunskayaCambridge University Engineering Department
Second Normal Form (2NF)• A 2NF design is:
• ID is the Primary Key in employee_names
• Phone is the Primary Key in contacts
• ID is a Foreign Key in contacts connecting employee names and their phone numbers
40
14 Engineering Part IIA: 3F6 - Software Engineering and Design
Second Normal Form (2NF)
A table is in second normal form if it satisfies:
1. It is in first normal form (1NF).
2. All non-prime attributes depend on the whole candidate key.
From the previous example, the complete relation, employees(E),is:
Lack of normalization allowsbuggy programs to create incon-sistencies:
Inserting the record (“Belford”,10, 131072) leads to a mismatchbetween the name and id.
An employee name change re-quires updates across multiplerows, which may be done incor-rectly. It also requires more lock-ing.
employeesName ID PhoneSanchez 456 65960Belford 9 65536Richard 82 32768Sanchez 456 60294Sanchez 456 70231Richard 82 16384
The candidate key is C = (ID, Phone). The non prime attributeis therefore E − C =(Name). The employees’ names do notdepend on the phone number, only the ID. Therefore the tableis not in 2NF. A 2NF design is:
employee namesName IDSanchez 456Belford 9Richard 82
contactsID Phone456 659609 6553682 32768456 60294456 7023182 16384
40
© 2012 Elena PunskayaCambridge University Engineering Department
Third Normal Form (3NF)• “I swear by Codd that each non-prime attribute shall depend
upon the key, the whole key and nothing but the key.”
• More formally a table over R is in 3NF if and only if:- It is in 2NF (and therefore 1NF)- Every non-prime attribute is directly dependent on every candidate key of R
• The candidate key is (Practical, Date), however, the table not fully normalized because there is repetition of data (the contact numbers and the pay rates). The table is not in 3NF because:- Pay rate depends on the key, but not the whole key. Specifically, it only depends on the date.
- Contact depends upon the whole key, but the dependence is transitive, not direct, that is: Contact → Demonstrator → (Practical, Date)
41
Database Systems II 15
Third Normal Form (3NF)
“I swear by Codd that each non-prime attribute shall dependupon the key, the whole key and nothing but the key.”
More formally a table over R is in 3NF iff:
1. It is in 2NF (and therefore 1NF)
2. Every non-prime attribute is directly dependent on everycandidate key of R.
Practical Date Demonstrator Contact Pay rateAcoustic coupling Mon 1 Feb Dade 45102 10Acoustic coupling Sat 7 Feb Dade 45102 15Self-propagating code Tue 2 Mar Joey 67822 10Self-propagating code Sun 9 Mar Kate 62341 15
The candidate key is:
(Practical,Date)
Table is not fully normalized because there is repetition of data(the contact numbers and the pay rates). The table is not in3NF because:
• Pay rate depends on the key, but not the whole key. Specif-ically, it only depends on the date.
• Contact depends upon the whole key, but the dependence istransitive, not direct, that is:
Contact → Demonstrator→ (Practical, Date)
Updating the date requires an update of the pay rate. Updating a demonstrator requires
an update of the contact number.
41
© 2012 Elena PunskayaCambridge University Engineering Department
SQL Constraints• In addition to normal forms, which can be represented in
relational algebra, SQL allows tables to be constructed with additional constraints which make the database more robust. Providing invalid data that break constraints causes transactions to abort, rather than make inconsistent data
• Types of constraints
• NOT NULL – ensures that the value of this column can not be omitted
• UNIQUE – ensures that the value of this column is unique
• PRIMARY/FOREIGN KEY – designates the column as a key
42
16 Engineering Part IIA: 3F6 - Software Engineering and Design
SQL Constraints
In addition to normal forms, which can be represented in rela-tional algebra, SQL allows tables to be constructed with addi-tional constraints which make the database more robust. Unlikenormalization, constraints do not make it impossible to constructerrors. However, constraints do make errors cause transactionsto abort, rather than make inconsistent data.
NOT NULL prevents missing attributes (helpful for 1NF)
CREATE TABLE course (Name string NOT NULL, ...)
A primary key can be specified. This will ensure that ID isunique, and therefore all rows are also unique.
CREATE TABLE people (Name string, ID int PRIMARY KEY)
Known candidate keys can be marked as unique:
CREATE TABLE r (a, b, c, d, UNIQUE(a, b),UNIQUE(a, c, d))
A particularly important constraint is FOREIGN KEY whichensures that an attribute is a primary key in another table:
CREATE TABLE course (Title string PRIMARY KEY, ID int,
Lectures int,
FOREIGN KEY (ID) REFERENCES employees)
The ID of the course leader is now constrained to be a valid em-ployee ID. The database will abort a transaction which attemptsto add an invalid ID, or change an ID to an invalid one. Addi-tionally the database will abort any transactions which invalidateexisting ID. For example, the database will not allow erasure ofemployees with courses still assigned.
16 Engineering Part IIA: 3F6 - Software Engineering and Design
SQL Constraints
In addition to normal forms, which can be represented in rela-tional algebra, SQL allows tables to be constructed with addi-tional constraints which make the database more robust. Unlikenormalization, constraints do not make it impossible to constructerrors. However, constraints do make errors cause transactionsto abort, rather than make inconsistent data.
NOT NULL prevents missing attributes (helpful for 1NF)
CREATE TABLE course (Name string NOT NULL, ...)
A primary key can be specified. This will ensure that ID isunique, and therefore all rows are also unique.
CREATE TABLE people (Name string, ID int PRIMARY KEY)
Known candidate keys can be marked as unique:
CREATE TABLE r (a, b, c, d, UNIQUE(a, b),UNIQUE(a, c, d))
A particularly important constraint is FOREIGN KEY whichensures that an attribute is a primary key in another table:
CREATE TABLE course (Title string PRIMARY KEY, ID int,
Lectures int,
FOREIGN KEY (ID) REFERENCES employees)
The ID of the course leader is now constrained to be a valid em-ployee ID. The database will abort a transaction which attemptsto add an invalid ID, or change an ID to an invalid one. Addi-tionally the database will abort any transactions which invalidateexisting ID. For example, the database will not allow erasure ofemployees with courses still assigned.
16 Engineering Part IIA: 3F6 - Software Engineering and Design
SQL Constraints
In addition to normal forms, which can be represented in rela-tional algebra, SQL allows tables to be constructed with addi-tional constraints which make the database more robust. Unlikenormalization, constraints do not make it impossible to constructerrors. However, constraints do make errors cause transactionsto abort, rather than make inconsistent data.
NOT NULL prevents missing attributes (helpful for 1NF)
CREATE TABLE course (Name string NOT NULL, ...)
A primary key can be specified. This will ensure that ID isunique, and therefore all rows are also unique.
CREATE TABLE people (Name string, ID int PRIMARY KEY)
Known candidate keys can be marked as unique:
CREATE TABLE r (a, b, c, d, UNIQUE(a, b),UNIQUE(a, c, d))
A particularly important constraint is FOREIGN KEY whichensures that an attribute is a primary key in another table:
CREATE TABLE course (Title string PRIMARY KEY, ID int,
Lectures int,
FOREIGN KEY (ID) REFERENCES employees)
The ID of the course leader is now constrained to be a valid em-ployee ID. The database will abort a transaction which attemptsto add an invalid ID, or change an ID to an invalid one. Addi-tionally the database will abort any transactions which invalidateexisting ID. For example, the database will not allow erasure ofemployees with courses still assigned.
42
© 2012 Elena PunskayaCambridge University Engineering Department
Entity-Relationship (E/R) Modelling• As in Object Oriented approach, designing a database schema
requires finding conceptual abstractions (that represent the data) and defining relationships between them
• Notation suggested by Peter Chen in “The Entity Relationship Model: Toward a Unified View of Data”, 1976- UML can also be used
• Relationships have cardinality- 1 to 1- 1 to Many- Many to Many etc.
43
Employee Course
Name
Leads
Title
No. lectures
attribute
entity set
relationshipset
43
© 2012 Elena PunskayaCambridge University Engineering Department
Entity-Relationship (E/R) Modelling
44
Employee
NameNumber
ISA
Mechanic SalesmanDoes
RepairJobNumber
Description
CostParts
Work
Repairs Car
License
ModelYear
Manufacturer
Buys
Price
Date
Value
Sells
Date
Value
Comission
Client ID
Name PhoneAddress
buyerseller
Pável Calado, http://www.texample.net/tikz/examples/entity-relationship-diagram/
44
© 2012 Elena PunskayaCambridge University Engineering Department
Objects and Databases• Relational Database Management Systems were mature
stable products by 1980s
• Object-Oriented approach reached wide adoption in 1990s
• Any large software system still needs to persist data, hence store it in databases
• Question: how we map Objects in a software system at runtime to Data stored in databases?
• Originally, two options emerged:- Object to Relationship Mapping – a software layer that can provide database persistent to OO system (e.g. Hibernate, TopLink) – commonly used
- Object Databases – a nice idea that failed to reach mainstream adoption
• Most recently, further developments included non-relationship approaches (NoSQL) to working with large distributed datasets, e.g. Hadoop (hadoop.apache.org)- Map/Reduce: distributed processing of large data sets on compute clusters- Hive: A data warehouse infrastructure that provides data summarization and ad hoc querying
- Cassandra: A scalable multi-master database with no single points of failure45
45
© 2012 Elena PunskayaCambridge University Engineering Department
No(t only)SQL at guardian.co.uk• The Guardian online, 1999
46
I bring you NEWS!!!App server App server App server
Web server Web server Web server
CMS Data feeds
Oracle
Memcached (20Gb)
Guardian journalism online: 1999
Matthew Wall, Simon Willison, www.slideshare.net/matwall/nosql-presentation
46
© 2012 Elena PunskayaCambridge University Engineering Department
No(t only)SQL at guardian.co.uk• The Guardian online, 2010
47
App server
Web servers
CMS Data feeds
Memcached (20Gb)
Solr
Core
Solr
Solr
Solr
Solr
Solr
Cloud, EC2
M/Q
Out
App
App
App
App
App
App
In
Proxy
external hostingapp engine etc
CouchDB?rdbms
Guardian journalism online: 2010
Matthew Wall, Simon Willison, www.slideshare.net/matwall/nosql-presentation
47
© 2012 Elena PunskayaCambridge University Engineering Department
Security and SQL Injection• Consider the following example
• What happens if the user enters: " ; DROP TABLE products; --
• The query becomes
• SQL Injection could be usedto steal data from a database
48
// allowing a user to search by product namestring name; cout << "Enter product name:" << endl; getline(cin, name); string query = "SELECT * FROM products WHERE name=\"" + name + "\""; do_sql(query);
// going to delete the table ProductsSELECT * FROM products WHERE name="" ; DROP TABLE products; -- "
news.bbc.co.uk/1/hi/8206305.stm
48