summary-booklet - jenna tutorials

INFO2120 SUMMARY BOOKLET By the awesome people in Jenna’s tutorials

These notes are merged from multiple groups summarizing the same chapters

Topics missing from this booklet

Week 1: Introduction to Databases and Transactions

SEMESTER 1, 2014

Week 2: Conceptual DB Design (ER diagrams) Conceptual design - A technique for understanding and capturing business information requirement

graphically, Facilitate planning, operation and maintenance of various data resources

Entities

Entity – a person, place, object, event, or concept about which you ant gather and store data. It must

be distinguishable from other entities. Eg John Doe, unit COMP5138, account 4711

Entity type (set) is a collection of entities that share common properties or characteristics eg :

student, courses, account. (rectangle represent) NOTE: entity sets need not to be disjoint

Attribute describes one aspect of an entity type eg people have name and address

Relationships

Relationship – relates two or more entities, number of entities is also known as the degree of the

relationship eg – John is enrolled in INFO2120

Relationship type (R.ship Set) – set of similar relationships, eg Student (entity type) related to

UnitOfStudy(entity type) by EnrolledIn (relationship type)

Distinction – Relation (relational model) –set of tuples, relationship (E-R model) – describes

relationship between entities, both entity sets and relationship sets (E-R model) may be represented

as relations (in the relational model)

Schema of relationship types

- The combination of the primary keys of the participating entity types forms a super key of a

relationship

- Relationship Set Schema , relationship name ,role names ,relationship attributes and their

types ,key

Key Constraints – if for a particular participant entity type, each entity participates in at most one

relationship, the corresponding role is a key of relationship type eg employee role is unique in workIn

Participation constraint – if every entity participates in at least one relationship participation

constraint holds - A participation constraint of entity type E having role ρ in relationship

type R states that for e in E there is an r in R such that ρ(r) = e. (representation in E-R diagram ;thick

line)

Cardinality Constraints – Generalisation of key and participation constraints, A cardinality

constraint for the participation of an entity set E in a relationship R specifies how often an entity of

set E participates in R at least (minimum cardinality) and at most (maximum cardinality).

Weak entities – An entity type that does not have a primary key eg child from parents payment of loan. The primary key of a weak entity type is formed by the primary key of the strong entity type(s)

on which the weak entity type is existence dependent, plus the weak entity type�s discriminator.

Constraints On ISA Hierarchies

- Overlap Constraints – disjoint : an entity can belong to only one lower-level entity set Overlapping : an entity can belong to more than one lower level entity set

- Covering Constraints – total : an entity must belong to one of the lower level entity sets - Partial (the default) – An entity need not belong to one of the lower level entity set

Week 3: The Relational Data Model (NULLs, keys, referential integrity) The relation data model of data is based on the mathematical concept of Relation.

The strength of the relational approach to data management comes from its simple way of structuring data. Data Model vs. Schema Data model: a collection of concepts for describing data Schema: a description of a particular collection of data at some abstraction level, using a given data model. Relational data model is most widely used model today Definition of Relation: A relation is a named, two-dimensional table of data -> consists of rows (records) and columns (attribute or field). Relation schema vs. Relation instance A relation R has a relation schema: specifies name of relation and name and date type of each attribute. A relation instance: a set of tuples (table) for a schema Creating and Deleting Relations in SQL

Create of table (relations): create table name (list of columns)

Deletion of table (relation): Drop table name Base Data types of SQL SMALLINT/INTEGER/BIGIT integer values DECIMAL/NUMERIC Fixed-point number FLOAT/REAL Floating point number with precision p CHAR/VHARCHAR/CLOB alphanumerical character string types Null ‘value’ RDBMS allows special entry NULL in a column to represent facts that are not relevant, or not yet known. PRO: NULL is useful because using an ordinary value with special meaning does not always work. Con: NULL causes complications in the definition of mane operations Modifying relations using SQL Insert of new data into a table: insert into table (list of columns) values (list of expression) Updating of tuples in a table: update table set column = expression {, column=expression} Deleting of tuples from a table: delete from table [where search_codition] Relational database Data structure: a relational database is a set of relations with tuples and fields- a simple and consistent structure. Data manipulation: powerful operators to manipulate the data stored in relations Data integrity: facilities to specify a variety of rules to maintain the integrity of data when it is manipulated. Integrity constraints It is condition that must be true for any instance of the database. A legal instance of relation is one that satisfies all specified ICs. Non-null columns One domain constraint is to insist that no value in a given column can be null. In SQL-based RDBMS, it is possible to insert a row where every attributes has the same value as an existing row. Relational key Primary keys are unique, minimal identifiers in a relation. i.e. CONSTRINT Student_PK primary key (Sid). Foreign keys are identifies that enable a dependent relation to refer to its parent relation i.e. FOREIGN KEY (lecturer) references Lecturer (empid). Mapping E_R diagrams into relations Each entity type becomes a relation; simple attributes map directly the relation. Composite attributes are flattened out by creating a separate field for each component attribute Weak Entity type Become a separate relation with a foreign key taken from the superior entity. Mapping of relationship type Many to many – create a new relation with the primary keys of the two entity type as its primary key One to many – primary keys on the one side becomes a foreign key on many side One to one – primary key on the mandatory side becomes a foreign key on the optional side Relationship – becomes fields of either the dependent, respectively new relation Relational views A view is a virtual relation. Syntax: create view name as <query expression>. <query expression> is any legal query expression (can even combine multiple relations)

Week 4A: Introduction to Declarative Querying - Relational Algebra 1. Set Operations

Union ( ∪ ) tuples in relation 1 OR in relation 2. o Example: R ∪ S o Definition: R U S = { t | t ∈ R ∨ t ∈ S }

Intersection ( ∩ ) tuples in relation 1, AND in relation 2.

o Example: R ∩ S

o Definition: R ∩ S = { t | t ∈ R ∧ t ∈ S }

Difference ( - ) tuples in relation 1, but not in relation 2. o Example: R – S o Definition: R - S = { t | t ∈ R ∧ t ∉ S }

Important: R and S have the same schema

R and S have the same arity (same number of fields)

Corresponding fields must have the same names and domains

2. Operations that remove parts of a relation

Selection ( σ ) selects a subset of rows from relation. o Example: σ country=‘AUS’ (Student)

Projection ( π ) deletes unwanted columns from relation. i.e. o Example: ∏ name, country (Student)

3. Operations that combine tuples from two relations

Cross-product ( X ) allows us to fully combine two relations. Also called the Cartesian product

Join ( ><) to combine matching tuples from two relations. o family_name=last_name Lecturer

Natural Join (><) Equijoin on all common fields o Example: R><S o Result schema similar to cross-product, but only one copy of fields for which equality is

specified. 4. A schema-level ‘rename’ operation

Rename ( ρ ) allows us to rename one field to another name. o Example: ρ Classlist(2-> cid, 4-> uos_code) ( Enrolled X UnitOfStudy )

Six basic operations

We can distinguish between basic and derived RA operations

1. Union ( U ) 2. Set Difference ( - ) 3. Selection ( σ )

4. Projection ( π ) 5. Cross-product( X ) 6. Rename ( ρ )

Additional (derived) operations:

Intersection, join division: o Not essential, but VERY useful

Cf. Join Composition and equivalence rules

Commutation rules

1. πA( σp( R ) ) = σp( πA ( R ) )

2. R >< S = S >< R

Association rule

1. R >< (S >< T) = (R >< S) >< T

Idempotence rules

1. πA( πB ( R ) ) = πA( R ) if A ⊆ B

2. σp1 (σp2 ( R )) = σp1 ∧ p2 ( R )

Distribution rules

1. πA ( R ∪ S ) = πA( R ) ∪ πA ( S )

2. σP ( R ∪ S ) = σP ( R ) ∪ σP ( S )

3. σP ( R >< S ) = σP (R) >< S if P only references R

4. πA,B(R >< S ) = πA (R) >< πB ( S ) if join-attr. in (A ∩ B)

5. R >< ( S ∪ T ) = ( R S ) ∪ ( R >< T )

Week 4B: Introduction to SQL (+ Joins) 1. DDL (Data Definition Language) - Create, drop, or alter the relation schema

specifiy integrity constraints, PK, FK, NULL/not NULL constraints DML (Data Manipulation Language) - Query, insert, delete and modify information in the DB INSERT INTO, UPDATE, DELETE FROM DCL (Data Control Language) - Control the DB, like administering privileges and users SELECT - Lists the columns (and expressions) that should be returned from the query DISTINCT removes duplicates, * = all, can have +,-,*,/ as arithmetic operators FROM - Indicate the table(s) from which data will be obtained, lists the relations involved in the

query. AS renames relations and attributes WHERE - Indicate the conditions to include a tuple in the result comparison operators: = , > , >= , < , <= , != , <>. Combine with AND, OR, and NOT

BETWEEN allows a range query, LIKE used for string matching, % = any substring, _ = any character, || = concatenate

GROUP - BY Indicate the categorization of tuples HAVING - Indicate the conditions to include a category ORDER BY - Sorts the result according to specified criterial, ASC (default), DESC

Date and Time: 4 Types: DATE, TIME, TIMESTAMP, INTERVAL. Can use CURRENT_DATE,

CURRRENT_TIME as constraints, normal time-order comparisons apply:=, >, <, <=, >= Main Operations:EXTRACT( component FROM date ),DATE string, +/- INTERVAL

2. Join:

You can join two or more table using the attribute conditions

Type of join: NATURAL JOIN, INNER JOIN and OUTER JOIN

R NATURAL JOIN S

R INNER JOIN S ON <join condition>

R INNER JOIN S USING (<list of attributes>)

R LEFT OUTER JOIN S

R RIGHT OUTER JOIN S

R FULL OUTER JOIN S

3. NULL Value and Three Valued logic:

Three-valued logic uses three different result values for logical expressions:

TRUE if a condition holds;

FALSE if a condition does not hold; and

UNKNOWN if a comparison includes a NULL

The use of three-valued logic is needed because of possible NULL values in databases

and because a logical condition to be decidable needs all values to be known.

4. Set operator: The set operations union, intersect, and except (Oracle: minus) operate on relations and correspond to the relational algebra operations union, intersect and except. Example: (select customer_name from depositor)

union (select customer_name from borrower)

Week 5: Nested Subqueries, Grouping, and Relational Division Nested Subqueries

A sub query is a SELECT-FROM-WHERE expression that is nested within another query

Common use : set membership , set comparisons and set cardinality

Non-correlated sub queries Correlated sub queries

Don’t depend on data from the outer query

Execute once for the entire outer query

Make use of data from the outer query

Execute once for each row of the outer query

Can use the EXISTS operator

IN ( Non-correlated sub query) EXISTS (Correlated sub query)

A comparison operation that compares a value v with a set/multi-set of values V, and evaluate to true if v is one of the elements in V

Used to check whether the results of a correlated nested query is empty(contains no tuples) or not

The following checks for each student S whether there is at least one entry in the Enrolled table for that student in INFO2120: SELECT sid, name

FROM Student

WHERE sid IN ( SELECT E.sid

FROM Enrolled E

WHERE E.uos_code

= 'INFO2120' )

SELECT sid, name FROM Student S WHERE EXISTS ( SELECT * FROM Enrolled E WHERE E.sid = S.sid AND uos_code = 'INFO2120' )

Grouping

A group is a set of tuples that have the same value for all attributes in grouping list

NOTE : an attribute in the SELECT clause must be in the GROUP BY clause as well

SYNTAX – it must follow this order

SELECT target-list

FROM relation-list

WHERE qualification

GROUP BY grouping-list

HAVING group-qualification

Relational Division

Definition

R (a1, … an, b1, …bm)

S (b1 …bm)

R/S, with attributes a1, …an, is the set of all tuples <a> such that for every tuple <b> in S, there is an <a,b> tuple in R

It is not an essential operator: just a useful shorthand

EXAMPLE – What was the average mark of each course?

SELECT uos_code as unit_of_study , AVG (mark)

FROM Assessment

GROUP BY uos_code

Week 6: Schema Normalization (including BCNF) Motivation Most important requirement of DB Design is adequacy – every important process can be done using the data in the

database.

If a design is adequate, seek to avoid redundancy in the data – same information repeated in several places.

Redundancy is at the root of several problems associated with relational schemas: Redundant storage, Insertion

anomaly, Deletion anomaly, Update.

Functional Dependencies and Normal Forms Functional Dependency (“FD”): the value of one attribute (the determinant) determines the value of another attribute.

X → Y means “X functionally determines Y” and “Y is functionally dependent on X”.

If you know the FDs, you can check whether a column (or set) is a key for the relation. There may be several candidate

keys. Choose one candidate key as the primary key. A superkey is a column/set that includes a candidate key.

Schema Normalisation (“SN”): Only allow FDs of the form of key constraints. SN is the process of validating and

improving a logical design so that it satisfies certain constraints (Normal Forms) that avoid unnecessary duplication of

data.

First normal form (“1NF”): domains of all attributes are atomic

Second normal form (“2NF”): 1NF + no partial dependencies

Third normal form (“3NF”): 2NF + no transitive dependencies

BCNF: the only non-trivial FDs that hold are key constraints

Table Decomposition A decomposition of R consists of replacing R by two or more relations such that: Each new relation scheme contains a

subset of the attributes of R (and no attributes that do not appear in R), every attribute of R appears as an attribute of

one of the new relations, and all new relations differ. Example: R ( A, B, C, D ) with FDs: {A -> B D and B -> C}.

Overall Design Process: Consider a proposed schema | Find out application domain properties expressed as

functional dependencies | See whether every relation is in BCNF | If not, use a bad FD to decompose one of the

relations; start with partial dependencies (Replace the original relation by its decomposed tables) | Repeat the above,

until you find that every relation is in BCNF.

Making it Precise It is essential that all decompositions used to deal with redundancy be lossless! Dependency-preserving: If R is decomposed into S and T, then all FDs that were given to hold on R must also hold

on S and/or T. (Dependency preserving does not imply lossless join & vice-versa!)

Must consider whether all FDs are preserved. If a dependency-preserving decomposition into BCNF is not possible

(or unsuitable, given typical queries), should consider decomposition into 3NF.

Candidate Key: Main Idea -> only allow FDs of form of a key constraint. Each non-key field is functionally dependent

on every candidate key. Candidate Key Identification: Identifying all FDs that hold on our data set | Then reasoning

over those FDs using a set of rules to on how we can combine FDs to infer candidate keys | Or alternatively, using

these FDs top verify whether a given set of attributes is a candidate key or not.

From FDs to Keys: Candidate keys are defined by functional dependencies | Consequently, FDs help us to identify

candidate keys. From the Attribute Closure to Keys: The set of Functional Dependencies can be used to find

candidate keys.

Week 7: Database Security and Integrity (+ Triggers) Every database security needs to be managed at some level that’s why there is database access control. There are two types of

access control namely authentication and authorization. Authentication make use of logins and passwords to make sure the

person who tries to login is really the owner of the database. Authorization on the other hand can make the owner of the

database give some rights to other people on their tables and views using the syntax

GRANT event on tablename to personname.

And revoke access using the syntax:

REVOKE event on tablename from personname.

Event: insert/ delete/ select /update

Now we can generate some views and grant or revoke access to some people. We can add some constraints on the database

like “ON DELETE NO ACTION” so that if a parent table’s tuple is deleted the child table tuple’s is not deleted. Why are all these

measures taken? To protect the private data of an individual. There has been an introduction of semantic integrity constraint so

that there are no losses of data consistency when changes are done to the database. One example of semantic integrity

constraint is the UNIQUE keyword on the student ID. Integrity constraints are conditions that must be satisfied for every

instance of the database. Integrity constraints are specified in the database schema and are checked when the database is

modified. If the conditions are not satisfied then the integrity constraint will abort the transaction. There are 2 types of integrity

constraints namely static integrity constraint and dynamic integrity constraint. Static integrity constraint is a condition that every

legal instance of a database must satisfy, some examples of the static integrity constraints are domain constraints, key

constraints and assertions. The dynamic integrity constraint is a condition that a legal database state change must satisfy

e.g. triggers. Let’s say you have a database containing many Varchar data types and you don’t want to rewrite the

same thing again and again then you can use domain constraint to create a varchar which will be available to all the

tables in the database and a check will be made to verify that it is within limit. E.g.

CREATE DOMAIN domain name check (value in ());

DEFRERING constraint let the transaction be completed first then check the constraint and NON-DEFERABBLE check

the constraint immediately afterwards every time the database is modified after the database gets modified.

ASSERTIONs are schema objects and are static integrity constraints that will make the database always satisfy a

condition. E.g. CREATE TABLE student { Sid INTEGER PRIMARY KEY name varchar};

CREATE ASSERTION checksid CHECK (select count (Sid) <=100) to check that the number of students must not

exceed 100.

One example of the dynamic integrity constraint is the trigger. Trigger is a statement that automatically fires if some

specific modifications occur on the database. E.g.

CREATE TRIGGER

AFTER/BEFORE insert OR update OF tuple on tablename BEGIN action END;

Week 8: DB Application Development Database Application Architectures -Data-intensive systems: Three types of functionality - presentation logic, processing logic, data management -System architectures can be 1, 2, or 3 tiered depending on presence of client, DB server and web/application server -Interactive SQL refers to SQL statements input directly to the terminal, DBMS outputs to screen, Non-interaction SQL refers to SQL statements included in an application program Client-side Database Application Development To integrate SQL with host language (e.g. Java, C), can either embed SQL in language (Statement-level Interface), or create an API to call SQL commands (Call-level interface) PHP – scripting language for dynamic websites that is embedded into HTML Variables: begin with $, value must belong to a class, but can be declared without giving a type Strings: double quotes replace substrings with variables, single quotes do not Arrays: numeric arrays are indexed 0,1, etc. associative arrays are paired PDO – PHP Data Objects, extension to PHP that provides a database abstraction layer (used to connect PHP and database). Five problems with interfacing with SQL: 1. Establishing database connection. $conn = new PDO( DSN, $userid, $passwd [,$params] ); a. PDO is DBMS independent, when creating new connection, need to insert DBMS prefix b. New connections take some time, so should only be done once in a program 2. Executing SQL statements. Three different ways of executing SQL statements: semi-static(PDO::query(sql)), parameterized (PDO::prepare(sql)), or immediately run (PDO::exec(sql)) Placeholders: Anonymous placeholders are represented as a ? inside a query and linked using $stmt->bindValue(1, $variable) the ‘1’ represents the first ? in the query Named placeholders are represented in the query using the format :name and are linked using $stmt->bindValue(‘:name’, $variable) NULL PHP supports NULL by default isset($var) checks if var exists and is not NULL empty($var) if var exists and has a non-empty, nonzero value Error Handling - Never show Database errors to end user. Exception Handling - PDOException::getMessage() - returns the exception message - PDOException::getCode() - returns the exception code SQL Injection attacks most frequently occur when an unauthorised user exploits the unchecked user input or buffer overflows in the database. Often when the user specifies a static query in their code this holds potential for an SQL injection attack, for this reason dynamic queries are a better choice to avoid this. Stored Procedures are when application logic is run from within the database server. There are many advantages to stored procedures: improved maintainability, reduced data transfer, fewer locks that are being held for long periods, abstraction layer (programmers need not know the schema).

Week 9: Transaction Management (ACID, serialisability) Transaction – a collection of one or more operations on one or more databases, which reflects a discrete unit of work.

Transaction does:

Return information from database

Update the database to reflect the occurrence of a real world event

Cause the occurrence of a real world event

ACID Properties: Atomicity: Transaction should either complete or have no effect at all Consistency: Execution of a transaction in isolation preserves the consistency of the database Isolation: Although multiple transactions may execute concurrently, each transaction must be unaware of other concurrently executing transactions Durability: The effect of a transaction on the database state should not be lost once the transaction has committed Commit: if the transaction successfully completes Abort: if transaction does not successfully complete Database is consistent if all static integrity constraints are satisfied

A sequence of database operations is serializable if it is equivalent to a serial execution of the involved transactions. A serializable execution guarantees correctness in that it moves a database from one consistent state to another consistent state. Basically – Each transaction preserves database consistency. Thus it follows ACID and fulfils the consistency component. Concurrency control is the protocol that manages simultaneous operations against a database so that serializability is assured. Locking Protocol => Two-phase Locking Protocol (2PL) – A transaction must obtain either:

o S (shared) lock - Data can only be read but is shared. o X (exclusive) lock – Data can be only read and write by one transaction.

Transactions must release locks once complete and cannot request additional locks afterwards. Be careful of deadlocks, cycle of transactions waiting for locks to be released by each other. Versioning / Snapshot Isolation => A new version of the items (snapshot) being accessed are created on update. All queries are performed on this new version and then applied to the old version. There are different levels of serialization and different databases require different levels. From lowest to highest:

o Read uncommitted- Uncommitted records may be read. o Read committed- Only committed records can be read but successive reads of record may return different

values. Most used level in practice. o Repeatable read- Only committed records can be read, repeated reads of same record must return same

value. Doesn’t mean a transaction is always 100% serializable. o Serializable- Default according to SQL-standard. Means that all transactions are serialized and follow ACID.

Week 10: Indexing and Tuning A database is a collection of relations. Each relation is a set of records. A record is a sequence of attributes. 1. Indexes - data structures to organize records via trees or hashing

1.1. Two examples: 1.1.1. Ordered index : search keys are stored in sorted order 1.1.2. Hash index : search keys are distributed uniformly across ’buckets’ using hash function

An index is an access path to efficiently locate row(s) via search key fields without having

to scan the entire table.

Primary index:

Secondary index:

index whose search key specifies the sequential order of file . Also called main index or integrated index

An index whose structure is separated from the data file and whose search key typically specifies an order different from the sequential order of the file.

In SQL, index is: CREATE INDEX name ON relation-name (<attributelist>)

Clustered index Unclustered (secondary) index Good for range searches over a range of

search key values

index entries and rows are ordered in the same way

There can be at most one clustered index on a table CREATE TABLE generally creates an integrated, clustered (main) index on primary key

index entries and rows are not ordered in the same way

There can be many unclustered indices on a table

Unclustered isn’t ever as good as clustered, but may be necessary for attributes other than the primary key

Types of Indexes:

Tree-based Indexes:B +-Tree o Very flexible, only indexes to support point queries, range queries and prefix searches

Hash-based Indexes o Fast for equality searches

Special Indexes o Such as Bitmap Indexes for OLAP or R-Tree for spatial databases

Week 11: Data Analysis - OLAP and Data Warehousing The Problem/Motivation:

- Data such as currency and historical data are being analyzed to identify useful patterns and support strategies.

- Businesses aim to create complex, interactive and exploratory analysis of datasets by integrating data collected across the enterprise.

- Internet helps the sharing of big data sets and correlating the data with own data becomes more important. - Data visualization turns large amount of data into useful information that businesses can understand easily

and have decision based on them. - Example: Google Fusion Tables, Maptd.com - Data needs to be gathered in a form suitable for analysis

Data Warehousing: Issues and the ETL Process

- Three Complementary Trends of data analysis in enterprise includes - Data Warehousing: Consolidate data from many sources in one large repository - OLAP: Interactive and “online” queries based on spreadsheet-style operations and “multidimensional”

view of data - Data Mining: Exploratory search for interesting trends and anomalies

OLTP vs OLAP vs Data Mining

- OLTP (On Line Transaction Processing) maintains a database of some real-world enterprise and supports day-to-day operations. They are short simple transactions with frequent updates that only access small fraction of the database at a time.

- OLAP (On Line Analytic Processing) uses mainly historic data in database to guide strategic decisions. They contain complex queries with infrequent updates. Transactions access large fraction of the database.

- Traditionally OLAP query data collected in its OLTP system but newer applications such as Internet companies prefer gathering data that it needs and potentially even purchasing them. Data are query more sophistically and in more specific ways.

- Data mining attempt to find pattern and extract useful information from a database and not setting a strict guideline in the query.

Data Warehouse

- Data (often derived from OLTP) for OLAP and data mining applications is usually stored in a special database called a data warehouse

- Data warehouse contain large amount of read-only data that has been gathered at different times spanning long periods provided by different vendors and with different schemas.

- Populating such warehouses in non-trivial (data integration etc.) - Issues in data warehousing includes semantic integration (eliminate mismatches from different sources e.g.

Different attribute names or domains), Heterogeneous Sources (access data from variety of source formats), Load Refresh Purge (load data, refresh periodically, and purge old data) and Metadata Management (Keep track of source, loading time, and other information for data in warehouse)

- Must include a metadata repository which is information about physical and logical organization of data Populating a data Warehouse: ETL Process (Capture/Extract, Transform, and Load)

- Typical operational data is transient, not comprehensive and potentially contain inconsistencies and errors. After ETL data should be detailed, periodic, and comprehensive

New techniques for database design, indexing, and analytical querying need to be supported.

- Star Schema - CUBE, ROLLUP and GROUPING SET - Window and Ranking Queries - ROLAP/MOLAP

Week 12: Introduction to Data Exchange with XML XML has 4 core specifications: XML Documents, Document Type Definitions (DTDs), Namespaces, XML Schema

SQL can be ignorant of how data is stored, but a schema is still required! But, how do we transport semi-structured data? XML!

Semistructured Data: “Self-describing, irregular data, no a priori structure”

Origins: Integration of heterogeneous sources; Data sources with non-rigid structure (Biological or Web) Characteristics: Missing or additional attributes; Multiple attributes; different types in different objects, and heterogeneous

collections XML describes content whereas HTML describes presentation. Specifics for XML: syntactic structure, elements & attributes,

character set; and has a logical / physical structure - DTD with ‘entities’.

Database Issues: Model XML using graphs, Store XML, Query XML using XQuery, Processing XML. Paradigm Shifts: Web - From HTML to XML; from information retrieval to data management

Databases - from relational model to semistructured data, from data processing to data/query

translation, from storage to transport XML vs. JSON JSON: JavaScript Object Notation - text-based, semi-structure for data interchange, originates from object

serialization a la Javascript. Low-overhead format opposed to XML.

XML <person name="John Smith"> <address street="1 Cleveland Street" city="Sydney" state="NSW" zipcode="2006" /> </person>

JSON { "name": "John Smith", "address": { "street": "1 Cleveland Street", "city": "Sydney", "state": "NSW", "zipcode": 2006 } }

DTD (Document Type Definition) XML Schema <!ELEMENT book (title)> <xsd:simpleType name="Score">

<xsd:restriction base="xsd:integer"> <xsd:minInclusive value="0"/> <xsd:maxInclusive value="100"/> </xsd:restriction> </xsd:simpleType>

Grammar Structure and Typing Elements + attributes Elements, attributes, simple and complex types,

groups Only “Part of” relationships Supports “includes” relationships - inheritance Specified as part of the prologue of an XML document

Specified as attribute of the document elements

Modern databases support SQL/XML ● Provide XML datatype to store XML in database - stored in native tree form.

● Integrates XML support functions for querying and inserting XML data:

● XMLPARSE()parses XML fragments or documents so that it can be stored in SQL.

● XPATH(xpath, xml): Selects the XML content specified by the xpath expression from the xml data.

● XMLEXTRACT and XMLEXISTS: Tell whether the set of nodes returned by XPath expression is empty (not supported

by PostgreSQL – will be added in upcoming version 9.3)

● XMLELEMENT() produces a single nested XML element

● XMLATTRIBUTES() only as optional part of an XMLELEMENT call, adds attribute(s) to a new XML element.

● XMLCONCAT() concatenates individual XML values

● XMLAGG() an aggregate function that concatenates several input xml rows to a single XML output value

● XMLCOMMENT creates an XML comment element containing text

An SQL query does not return XML directly. Produces tables that can have columns of type XML.

Jenna’s Super Summary 1. Basic database stuff - Types of relations - Relational schema (e.g. ER diagram) - Relational schema instance (e.g. DB) - Relation (e.g. foreign key field) - Relational instance (e.g. foreign key value, e.g. 3) - Types of fields - INTEGER - VARCHAR - CHAR - TEXT - ENUM: CREATE TYPE x AS ENUM - Types of keys 1. Primary 2. Candidate 3. Super 4. Foreign - personid INTEGER REFERENCES person (id) ON DELETE NO ACTION - ON DELETE: CASCADE, NO ACTION (default, post-triggers), RESTRICT (pre-triggers), SET NULL, SET DEFAULT - Types of constraints - Integrity constraints (all constraints, enforces data integrity) - Static constraints: - Domain constraints (fields must be of correct data domain) (constraint on ONE attribute) 1. Null/not null 2. ENUM checks 3. Unique and Unique checks - Key constraints 1. Keys (including foreign keys) - Semantic integrity constraints (constraints on MULTIPLE attributes) 1. Checks - anonym: status VARCHAR CHECK (status = 'A' OR STATUS = 'B') - named: CONSTRAINT chk_status CHECK (status = 'A' OR STATUS = 'B') 2. Assertions - CREATE ASSERTION x CHECK ( NOT EXISTS ( SELECT ... ) ) 3. Functional dependencies - Dynamic constraints 1. Triggers 2. ER diagrams - syntax - entity: square - attribute: ellipse - keys are underlined - double-ellipses: multi-valued attributes - ellipses to ellipses: related attributes - relationships: diamonds - arrow: at most one - thick line: at least one - thick arrow: exactly one - relationship applies to FURTHEST entity - Employee works in AT MOST ONE department Employee --> WorksIn --- Department - Employee works in AT LEAST ONE department Employee === WorksIn --- Department - Employee works in EXACTLY ONE department Employee ==> WorksIn --- Department - Employee works in 1 TO 3 departments Employee ==1..3== WorksIn --0..*-- Department - Weak entity types: double rectangles - Weak entity relationship: double rectangles - Discriminator (aka partial key): discriminates among all entities related to one of the other entity - Superclass/subclass - Triangle - superclass at tip of triangle - overlapping: default (can belong to 1 or more) - disjoint: write disjoint (can belong to only 1) - total: default (an entity must belong to one) - partial: thick line (an entity doesn't have to belong to any) 3. SQL - CREATE TABLE

- SELECT FROM WHERE GROUP BY HAVING ORDER BY - SELECT stuff - SELECT ALL (keep dups - default) or SELECT DISTINCT (remove dups) - SELECT * (all columns) - SELECT 3 * 4 (arithmetic operations) - SELECT x AS y (rename operator) - WHERE stuff - = , > , >= , < , <= , != , <> - AND, OR, NOT - BETWEEN 75 AND 100 - LIKE 'POST%' (and lots of other string/regex operations) - CURRENT_DATE and CURRENT_TIME - EXTRACT(year FROM enrolmentDate) - DATE ‘2012-03-01’ - ‘2012-04-01’ + INTERVAL ’36 HOUR’ - JOIN stuff - join: combine fields - Equi-join: when fields are equal - Natural join: duplicate column names - Outer join: non-matches included as NULL - left outer join: joined table can have null attributes - right outer join: non-joined table can have null attributes - full oiuter join: both tables can have null attributes - Union join: all columns included, all rows included (cartesian join) - e.g. - R natural join S - R inner join S on <join condition> - R inner join S using (<list of attributes>) - Aggregate functons - Avg, min, max, sum, count - select count(*) - select count(distinct sid) from Enrolled - select avg(mark) - set operations - UNION (add rows) - INTERSECT (duplicate rows only) - EXCEPT (minus duplicate rows) - Subqueries - Correlated vs uncorrelated - IS NULL, not = null - 5 + null returns null - most aggr. functions ignore nulls - three-valued logic - OR, AND, NOT 4. Relational Algebra 1. set operations - union - OR - intersection - AND - difference - MINUS 2. remove parts - selection (sigma) - WHERE clause (select rows) - projection (pi) - SELECT clause (select only specified cols) 3. combine parts - cross-product (X) - fully combine relations (col x row = for each thing in the col, that plus the row) - natural join: join on all equal fields - conditional join: join on specified fields - join (triangular-infinity thing) - combine matching tuples (col x row = same as col, but with extra row-part for matches) 4. rename parts - rename (row, looks like rounded p) - rename one field to another 5. Functional dependencies - Why is a table called a relation? Relation from primary key to every column - Data redundancy causes anomalies 1. insertion: duplicate data or null values 2. delete: loss of data needed for future rows 3. update: changes in one row cause changes to all rows (biggest problem) - A --> B if 'A functionally determines B', or 'B is functionally dependent on A' - a primary key functionally determines the whole row - a candidate key determines every column - a superkey is a set of columns that contains the candidate key - Attribute closure X^+ of some attributes X is 'all attributes that are determined by X' (functionally dependent on X), including X itself 1. Initialise result with the given set of attributes: X={A1, …, An} 2. Repeatedly search for some FD: A1 A2 … Am -> C

such that all A1, …, Am are already in the set of attributes result, but C is not. Add C to the set result. 3. Repeat step 2 until no more attributes can be added to result 4. The set result is the correct value of X^+ (the closure of attributes) - To find all candidate keys, look at each set of attributes K and calculate the attribute closure K^+ - If K+ contains all columns, K is a superkey - Check each subset of K to see if it is also a superkey - Find the candidate keys (the smallest subset that is still a superkey) - Pick one candidate key to be the primary key 6. Normalisation ('decomposing' into normal forms) - 1NF, all attributes are atomic (no multivalued or composite attributes) - 2NF no partial dependencies (not important) - 3NF no transitive dependencies (not important) - BCNF no remaining anomalies from functional dependencies (good!) - The only non-trivial FDs are key constraints - trivial FDs is X --> Y and Y is a subset of X (you determine yourself) - formally: for every FD A --> B, either the FD is trivial or A is a superkey (primary key, candidate key, or more) - 4NF no multivalued dependencies (not important) - 5NF no remaining anomalies (not important) - Decomposition attributes - Lossless-join decomposition - When you join the decomposed relations, you get the original relation - Not lossless-join doesn't usually mean whole rows are lost, it could mean that meaningless rows are added - If R(A, B, C) has A -> B, then the decomposition L(A, B) and L(A, C) is always lossless-join - Dependency-preserving decomposition - Every dependency from the original is still in the decomposed relations - Often, we say every original dependency is in exactly ONE of the decomposed relations 7. Serialisability - ACID - Atomicity (all or nothing) - Consistency (db always in valid state: triggers, cascading deletes, CHECKs, etc) - Isolation (transactions do not interfere) - Durability (committing MEANS committed, once a commit returns, any crashes can return to that commit) - a Transaction is a list of SQL statements that are ACID, one logical 'unit of work' - they happen in order, together, if one fails they all fail - 'Auto-commit' means every SQL statement is an entire transaction - Serialisability means interleaved execution is the same as batch execution: given 2 transactions, the final state is the same regardless of the order - dirty read (reading uncommitted data, WR conflict) T1: R(A),W(A), R(B),W(B),Abort T2: R(A),W(A),Commit - unrepeatable read (two reads in a transaction give different results, RW conflict) T1: R(A), R(A),W(A),Commit T2: R(A),W(A),Commit - lost update (overwriting uncommitted data, WW conflict) T1: W(A), W(B),Commit T2: W(A),W(B),Commit - 2-phase-locking ensures serializable executions, but can mean some operations are blocked - Before reading, take shared lock - Before writing, take exclusive lock - Hold lock until transaction commits/aborts

8. Indexing - records are stored in pages, each page contains a maximum number of records - an index is a type of page - Types of indexes - sorted (uncommon, tree is better) - tree (like sorted but better, good for range, equality and prefix searches) - multileveled, e.g. 2 levels mean records are 2 indexes away at most - hash (good for equality and thats it) - special (e.g. bitmap indexes, r-trees for spatial data) - With an index, selecting takes less time, inserting takes more time - A covering index (for a query) means all fields in the query are indexed, so the records are not accessed at all - An "access path" is the journey you take to reach the data (e.g. query --> table scan --> record) - A "search key" is a sequence of attributes that are indexed, includes primary key - Properties of indexes 1. Main [or primary] (indexes contain the whole row) vs secondary (indexes contain a pointer) 2. Unique (index over a candidate key) vs nonunique 3. Clustered (data records are ordered the same way as indexes) vs unclustered - There can be at most one clustered index on a table - Clustered is good for "range searches" (key is between two limits) 4. Single- vs multi-attribute - CREATE TABLE usually creates a unique, clustered, main index on the primary key - CREATE INDEX usually creates a secondary, unclustered index - CREATE INDEX name ON table (field) - Space and time problems - how much space per row? add up space per field (e.g. 20 byte record) - how many records per block? divide the space of a block by this amount (calculate records per block, round down!) (e.g. 4K block) - how many blocks? divide total # of records by # of blocks (e.g. 50 blocks) - how long does the query take? times the number of blocks by the time an access takes (reading a disk block into memory) - assumptions: - if a field has 3 possible values, there are an equal number of records with each value - 10% of the records with A = a also have B = b 9. OLAP - OLAP stands for "online analytical processing" - Data warehousing - db needs to be optimised for SELECT queries - UPDATE, DELETE etc can be slow - LOTS of tricks used: indexes, redundant fields, etc - maximise (to a point) redundancy - Star schema - 1 central fact table, n tables with FKs from the fact table - for each dimension, we have a hierarchy - getting totals and subtotals for the hierarchies: - CUBE(x, y, z) does GROUP BY (nothing), GROUP BY (every combination) - ROLLUP(x, y, z) does GROUP BY (x, y, z), GROUP BY (y, z), GROUP BY (z), GROUP BY (nothing) - WINDOW queries SELECT AGG(…) OVER name FROM ... WINDOW name AS ( [ PARTITION BY attributelist ] (attributes to select) [ ORDER BY attributelist ] (attributes to order by) [(RANGE|ROWS) BETWEEN v1 PRECEEDING AND v2 FOLLOWING] ) (rows to look at) 10. XML - Not summarised