introduction to relational database systems€¦ · ra is expressive : all sql (dml) queries as we...
TRANSCRIPT
INTRODUCTION TO
RELATIONAL DATABASE SYSTEMS
DATENBANKSYSTEME 1 (INF 3131)
Torsten GrustUniversität Tübingen
Winter 2015/16
1
THE RELATIONAL ALGEBRA
The Relational Algebra (RA) is a query language for the relational data model.
The definition of RA is concise: the core of RA consist of five basic operators.More operators can be defined in terms of the core but this does not add to RA’s expressiveness.
RA is expressive: all SQL (DML) queries as we have studied them in this course have an equivalentRA form (if we omit ORDER BY, GROUP BY, aggregate functions).
RA is the original query language for the relational model. Proposed by E.F. Codd in 1970. Querylanguages that are as expressive as RA are called relationally complete.(SQL is relationally complete.)
There are no RDBMSs that expose RA as a user-level query language but almost all systems useRA as an internal representation of queries.
Knowledge of RA will help in understanding SQL, relational query processing, and theperformance of relational queries in general (→ course “Datenbanksysteme II”).
‐‐
‐
‐
‐
‐
2
THE RELATIONAL ALGEBRA
RA is an algebra in the mathematical sense: an algebra is a system that comprises a
a set (the carrier) and
operations, of given arity, that are closed with respect to the set.
Example: (ℕ, {*, +}) forms an algebra with two binary (arity 2) operations.
In the case of RA,
the carrier is the set of all finite relations (= sets of tuples ⚠),
the five operations are (selection), (projection), (Cartesian product), (set union), and (set difference).
Closure: Any RA operator
takes as input one or two relations (the unary operators , take additional parameters) and
returns one relation.
Relations and operators may be composed to form complex expression (or queries).
‐1.
2.
‐‐
1.
2. σ π × ∪ ∖
‐‐ σ π
‐‐
3
RELATIONAL ALGEBRA: SELECTION ( )
SelectionIf the unary selection operator is applied to input relation , the output relation holds thesubset of tuples in that satisfy predicate .
Example: apply (also consider: , , , ⚠ ) to relation
AA BB CC1 true 201 true 102 false 10
to obtainAA BB CC1 true 201 true 10
Selection does not alter the input relation schema, i.e., .
σ
σp R
R p
‐ σA=1 σB σA=A σC>20 σD=0R
‐ sch( (R)) = sch(R)σp
4
RELATIONAL ALGEBRA: SELECTION ( )
Predicate in is restricted: is evaluated for each input tuple in isolation and thus must exclusivelybe expressed in terms of
literals,
attribute references (occurring in of input relation ),
arithmetic, comparison ( , , , …), and Boolean operators ( , , ).
In particular, quantitifiers ( , ) or nested algebra expressions are not allowed in .
In PyQL, has the following implementation:
def select(p, r): """Return those rows of relation r that satisfy predicate p.""" return [ row for row in r if p(row) ]
select(), and thus , are a higher-order functions.
σ
‐ p σp p
1.
2. sch(R) R
3. = < ⩽ ∧ ∨ ¬
‐ ∃ ∀ p
‐ (r)σp
‐ σ
5
RELATIONAL ALGEBRA: PROJECTION ( )
ProjectionIf the unary operator is applied to input relation , it applies function to all tuples in .The resulting tuples form the output relation.
Function argument (the “projection list”) computes one output tuple from one input tuple. constructs new tuples that may
discard input attributes (DB slang: “attributes are projected away”)
contain newly created output attributes whose value is derived in terms of expressions over inputattributes, literals, and pre-defined (arithmetic, string, comparison, Boolean) operators.
Note:
in general
⚠ (Why?)
π
πℓ R ℓ R
‐ ℓ ℓ
1.
2.
‐‐ sch( (R)) ≠ sch(R)πℓ
‐ | (R)| ⩽ |R|πℓ
6
RELATIONAL ALGEBRA: PROJECTION ( )
PyQL implementation of :
def project(l, r): """Apply function l to relation r and return resulting relation.""" return dupe([ l(row) for row in r ]) # dupe(): eliminate duplicate list elements
is a higher-order function (in functional programming languages, is known as map).
Common cases and notation for (refer to relation shown above):
Retain some attributes of the input relation, i.e., project (throw) away all others:
Rename the attributes of the input relation, leaving their value unchanged:
Compute new attribute values:
π
‐ πℓ
‐ πl π
‐ π R(A, B, C)
‐(R)πC,A
‐(R)πX←A,Y←B
‐(R)πX←A+C, Y←¬B, Z←"LEGO"
7
RELATIONAL ALGEBRA: PROJECTION ( )
Examples:
Apply to relation
AA BB CC1 true 201 true 102 false 10
to obtainXX YY ZZ21 false LEGO11 false LEGO12 true LEGO
Apply to obtainXX YY1 true2 false
π
‐1. πX←A+C, Y←¬B, Z←"LEGO"
R
2. πX←A,Y←B
8
REL. ALGEBRA: CARTESIAN PRODUCT ( )
Cartesian ProductIf the binary operator is applied to two input relations , , the output relation containsall possible concatenations of all tuples in both inputs.
The schemata of inputs and must not share any attribute names. (This is no real restrictionbecause we have .) We have:
PyQL implementation:
def cross(r1, r2): """Return the Cartesian product of relations r1, r2.""" assert(not(schema(r1) & schema(r2))) # &: set intersection return [ row1 ⨁ row2 for row1 in r1 for row2 in r2 ] # ⨁: dict merging
×
× R1 R2
‐ R1 R2
π
sch(R×S) = sch(R) sch(S) |R×S| = |R| ⋅ |S|∪⋅
‐
9
REL. ALGEBRA: CARTESIAN PRODUCT ( )
Example: Given graph adjacency (edge) relation , compute the paths of length two(“Where can I go in two hops?”):
fromfrom totoA BB AB CA D
Algebraic query:
Result:fromfrom totoA AA CB BB D
×
‐ G(from, to)
‐
( (G× (G)))πfrom,to←to2 σto=from2 πfrom2←from,to2←to
‐
10
RELATIONAL ALGEBRA: JOIN ( )
The algebraic two-hop query relied on a combination of - that is typical: (1) generate all possible(arbitrary) combinations of tuples, then (2) filter for the interesting/sensible combinations.
JoinThe join of input relations , with respect to predicate is defined as
Note:
Join does not add to the expressiveness of RA ( is a derived operator or an “RA macro” in asense).
comes with the same preconditions as and : and may onlyrefer to attributes in .
⋈p
‐ σ ×
R1 R2 p
:= ( × )R1 ⋈p R2 σp R1 R2
‐‐ ⋈p
‐ ⋈p σ × sch( ) ∩ sch( ) = ∅R1 R2 p
sch( ) sch( )R1 ∪⋅
R2
11
RELATIONAL ALGEBRA: JOIN ( )
RDBMS are equipped with an entire family of algorithms that efficiently compute joins. In particular,the potentially large intermediate result (after ) is not materialized.
Consider a join implementation in PyQL. Equational reasoning:
join(p, r1, r2)
# definition of join ≣ select(p, cross(r1, r2))# definition of cross ≣ select(p, [ row1 ⨁ row2 for row1 in r1 for row2 in r2 ])# definition of select ≣ [ row for row in [ row1 ⨁ row2 for row1 in r1 for row2 in r2 ] if p(row) ]# [ e1(y) for y in [ e2(x) for x in xs ] ] = [ e1(y) for x in xs for y in [e2(x)] ] ≣ [ row for row1 in r1 for row2 in r2 for row in [row1 ⨁ row2] if p(row) ]# [ e1(y) for x in xs for y in [e2(x)] ] = [ e1(e2(x)) for x in xs ]
≣ [ row1 ⨁ row2 for row1 in r1 for row2 in r2 if p(row1 ⨁ row2) ]
⋈p
‐×
‐
12
PLANS (OPERATOR TREES)
Depict RA queries (or plans) as data-flow trees.Tuples flow from the leaves (relations) to the root which represents the final result.
Example: Two-hop query using and :
‐
‐ × ⋈p
13
RELATIONAL ALGEBRA: NATURAL JOIN ( )
Idiomatic relational database design often leads to joins between input relations , in which thejoin predicate performs equality comparisons of attributes of the same name (think of key–foreign keyjoins).
Example: Consider the LEGO database:setssetset namename catcat xx yy zz weightweight yearyear imgimg containssetset piecepiece colorcolor extraextra quantityquantity
We have sets contains set and attribute set exactly determines the joincondition.
Associated RA key–foreign key join query:
⋈
‐ R1 R2
‐
‐ sch( )∩sch( )= { }
‐
(πset,name,cat,x,y,z,weight,year,img,piece,color,extra,quantity
sets (contains))⋈set2=set πset2←set,piece,color,extra,quantity14
RELATIONAL ALGEBRA: NATURAL JOIN ( )
Natural JoinThe natural join of input relations , performs a operation where is a conjunctionof equality comparisons between the attributes :
Note: the final (top-most) projection ensures that the shared attributes only occur oncein the result schema (we have anyway).
Terminology:Joins with a conjunctive all-equalities predicates are also known as equi-joins. Otherwise, isalso referred to as -join (theta-join).
⋈
R1 R2 ⋈p p
{ , … , } = sch( ) ∩ sch( )a1 an R1 R2
⋈ :=R1 R2 (πsch( )∪sch( )R1 R2
( ))R1 ⋈ = ∧⋯∧ =a1 a′1 an a′
nπ ← ,…, ← ,sch( )∖{ ,…, }a′
1 a1 a′n an R2 a1 an
R2
‐ { , … , }a1 an
=ai a′i
‐⋈p p ⋈p
θ
15
RELATIONAL ALGEBRA: NATURAL JOIN ( )
Natural Join Quiz: Consider relations
AA BB CC1 true 201 true 102 false 10
BB CC DDtrue 20 Xfalse 10 Ytrue 30 Z
and natural joins
⋈
‐R1
R2
1. ⋈ R1 R2
2. ( ) ⋈ ( )πB,C R1 πB,C R2
3. ⋈ R1 R1
4. ( ) ⋈ ( )πA,C R1 πB,D R216
RELATIONAL ALGEBRA: LAWS
Much like for the algebra (ℕ, {*, +}), the operators of RA obey laws, i.e. strict equivalences that holdregardless of the state of the input relations.
RA Laws (⚠ Excerpt Only)
(and ) are associative and commutative: and
may be pushed down into , provided that … ⟨fill in precondition⟩ …:
and may be merged:
Among these laws, selection pushdown is considered essential for query optimization. Why?
‐
‐ ⋈ ⋈p
( ⋈ ) ⋈ = ⋈ ( ⋈ )R1 R2 R3 R1 R2 R3 ⋈ = ⋈ R1 R2 R2 R1
‐ σp ⋈q
( ) = ( )σp R1 ⋈q R2 R1 ⋈q σp R2
‐ σp σq
( (R)) = (R)σp σq σp∧q
‐
17
REL. ALGEBRA: A COMMON QUERY PATTERN
This plan has the SQL equivalent:
‐
18
RELATIONAL ALGEBRA: - - QUERIES
Quiz: Find piece ID and name of all LEGO bricks that belong to a category related to animals.
Relevant relations:brickspiecepiece namename catcat weightweight imgimg xx yy zz
categoriescatcat namename
Animal
Algebraic plan:
π σ ⋈
‐‐
‐
19
REL. ALGEBRA: SET OPERATIONS ( , )
Relations are sets of tuples. The usual family of binary set operations applies:
Union ( ), Difference ( )The binary set operations and compute the union and difference of two input relations
, . The schemata of the input relations must match:
The two set operations complete the operator core of RA ( , , , , ). Any query language that isexpressive as this core is relationally complete.
⚠ Set intersection ( ) is not considered a core RA operator. There is more than one way to expressintersection as an RA macro:
∪ ∖
‐
∪ ∖∪ ∖
R1 R2
sch( ) sch( ) = sch( ∪ ) = sch( ∖ ).R1 =!R2 R1 R2 R1 R2
‐ σ π × ∪ ∖
‐ ∩
∩ :=R1 R2
20
REL. ALGEBRA: SET OPERATIONS ( , )
def union(r1, r2): """Return the union of relations r1, r2.""" assert(matches(schema(r1), schema(r2))) return dupe([ row1 for row1 in r1 ] + [ row2 for row2 in r2 ])
def difference(r1, r2): """Return the difference of r1, r2 (a subset of r1).""" assert(matches(schema(r1), schema(r2))) return [ row1 for row1 in r1 if row1 not in r2 ] # ⚠ Note the negation (not)
More RA Laws
is associative and commutative: and
The empty relation is the neutral element for :
∪ ∖
‐ ∪( ∪ ) ∪ = ∪ ( ∪ )R1 R2 R3 R1 R2 R3 ∪ = ∪ R1 R2 R2 R1
‐ ∅ ∪R ∪ ∅ = ∅ ∪ R = R
21
REL. ALGEBRA: SET OPERATIONS ( , )
In RA queries, can straightforwardly express case distinction.
Example: Categorize LEGO sets according to their size (volume measured in stud³).setssetset namename catcat xx yy zz weightweight yearyear imgimg
x y z
∪ ∖
‐ ∪
‐
22
SQL: CASECASE…WHENWHEN (MULTI-WAY CONDITIONAL)
Conditional Expression (CASECASE…WHENWHEN)
The SQL expression CASE…WHEN implements case distinction ("multi-way if…else"):
CASE WHEN ⟨condition₁⟩ THEN ⟨expression₁⟩ [WHEN …] [ELSE ⟨expression₀⟩]END
CASE…WHEN evaluates to the first ⟨expressionᵢ⟩ whose Boolean ⟨conditionᵢ⟩ evaluatesto TRUE. If no WHEN clause is satisfied return the value of the fall-back ⟨expression₀⟩, ifpresent (otherwise return NULL).
Expression ⟨expressionᵢ⟩ never evaluated if ⟨conditionᵢ⟩ evaluates to FALSE.
The types of the ⟨expressionᵢ⟩ must be convertible into a single output type.
‐‐
23
SQL: SET OPERATIONS
The family of set operations is available in SQL as well. Since SQL operates over unordered lists (or:bags) of rows, modifiers control the inclusion/removal of duplicates:
UNIONUNION, EXCEPTEXCEPT, INTERSECTINTERSECT
The binary set (bag) operations connect two SQL SFW blocks. Schemata must match(columns of the same name have convertible types). Modifier DISTINCT (i.e. setsemantics) is the default:
⟨SFW⟩ { UNION | EXCEPT | INTERSECT } [ ALL | DISTINCT ] ⟨SFW⟩
Bag semantics (ALL) with , duplicate rows contributed by the two SFW blocks:
OperationOperation Duplicates in resultDuplicates in resultUNION ALLEXCEPT ALL
INTERSECT ALL
‐
‐ m n
m + nmax(m − n, 0)
min(m, n)24
REL. ALGEBRA: SET OPERATIONS ( , )
With , (and ) now being available, we may be even more restrictive with respect to theadmissable predicates in :
literals, attribute references, arithmetics, comparisons are OK,
the Boolean connectives ( , , ) are not allowed.
=
=
=
∪ ∖
‐ ∪ ∖ ∩p σp
1.
2. ∧ ∨ ¬
‐ (R)σp∧q
‐ (R)σp∨q
‐ (R)σ¬p
25
RELATIONAL ALGEBRA: MONOTONICITY
Monotonic Operators and QueriesAn algebraic operator is monotonic if a growing input relation implies that the outputrelation grows, too: .
An RA query is monotonic if it exclusively uses monotonic operators.
Operator (Operator ( : Input): Input) Monotonic?Monotonic?✓✓
, ✓, ✓
✓❌
def difference(r1, r2): ⋮ return [ row1 for row1 in r1 if row1 not in r2 ] # ⚠ If r2 grows, result may shrink
⊗ R ⊆ S ⇒ ⊗(R) ⊆ ⊗(S)
□(□)σp
(□)πℓ□ × _ _ × □□ ∪ _ _ ∪ □
□ ∖ __ ∖ □
26
RELATIONAL ALGEBRA: MONOTONICITY
It follows that we require the difference operator whenever a non-monotonic query is to be answered:
“Find those LEGO sets that do not contain LEGO bricks with stickers.”
“Find those LEGO sets in which all bricks are colored yellow.”
“Find the LEGO set that is the heaviest.”
More general: Typical non-monotonic query problems are those that …
… check for the non-existence of an element in a set ,
… check that a condition holds for all elements in a set ,
… find an element that is minimal/maximal among all other elements in a set .
⚠ Note that cases 2 and 3 in fact indeed are instances of case 1. How?
Insertion into may invalidate tuples that were valid query responses before grew.
‐‐‐‐
‐1. S
2. S
3. S
‐ S S
27
REL. ALGEBRA: NON-MONOTONIC QUERY“Find those LEGO sets that do not contain LEGO bricks with stickers.”
setssetset namename
containssetset piecepiece
brickspiecepiece namename
Sticker
Identify the LEGO bricks with stickers. (bricks)
Find the sets that contain these bricks. (contains)
These are exactly those sets that do not interest us. (contains)
Attach set name and other set information of interest. (sets)
catcat xx yy zz weightweight yearyear imgimgs
colorcolor extraextra quantityquantitys p
catcat weightweight imgimg xx yy zzp
1.
2.
3.
4.28
REL. ALGEBRA: NON-MONOTONIC QUERY“Find those LEGO sets in which all bricks are colored yellow.”
setssetset namename imgimg
containssetset colorcolor
colorscolorcolor namename
Yellow
Query/problem indeed is non-monotonic:Insertion into relation ____________ can invalidate a formerly valid result tuple.
Plan of attack resembles the no stickers query (see following slide).
catcat xx yy zz weightweight yearyears
piecepiece extraextra quantityquantitys c
finishfinish rgbrgb from_yearfrom_year to_yearto_yearc
‐
‐
29
REL. ALGEBRA: NON-MONOTONIC QUERY“Find those LEGO sets in which all bricks are colored yellow.”
➊ Lookup the colors of the individual bricks (identify yellow in relation colors).
➋ Select the bricks that are not yellow.
➊setset colorcolor
██████
➋setset colorcolor
███
‐‐
s1s1s2s2s3s3
s1s3s3
30
REL. ALGEBRA: NON-MONOTONIC QUERY“Find those LEGO sets in which all bricks are colored yellow.”
➌ Identify the sets that contain (at least one) such non-yellow brick.
Among all sets ➍, the sets of ➌ are exactly those we are not interested in.
➎ Thus, return all other sets.
➌setset
➍setset
➎setset
‐‐‐
s1s3
s1s2s3
s2 31
RELATIONAL ALGEBRA: SEMIJOIN ( )
Join used as a filter: above , only attributes of contains are relevant.
(Left) SemijoinThe left semijoin of input relations , returns those tuples of for which at leastone join partner in exists:
acts like a filter on : . Can be used to implement the semantics of existentialquantitifcation ( ) in RA.
⋉p
⋈
⋉p R1 R2 R1
R2
:= ( )R1 ⋉p R2 πsch( )R1 R1 ⋈p R2
‐ ⋉p R1 ⊆R1 ⋉p R2 R1
∃32
RELATIONAL ALGEBRA: ANTIJOIN ( )
(Left) AntijoinThe left antijoin of input relations , returns those tuples of for which there doesnot exist any join partner in :
can be used to implement the semantics of universal quantification:
Example: Use self-left-antijoin on to compute :
⊳p
⊳p R1 R2 R1
R2
:= ∖ ( ) = ∖ ( )R1 ⊳p R2 R1 R1 ⋉p R2 R1 πsch( )R1 R1 ⋈p R2
‐ ⊳p
= {x | x ∈ , ¬∃y ∈ : p(x, y)} = {x | x ∈ , ∀y ∈ : ¬p(x, y)}R1 ⊳p R2 R1 R2 R1 R2
‐ S(A) max(S)
S (S) ⊳A<A′ π ←AA
′ = {x | x ∈ S, ¬∃y ∈ (S) : x. A < y. }π ←AA′ A′
= {x | x ∈ S, ∀y ∈ (S) : x. A ⩾ y. }π ←AA′ A′
= {max(S)}33
RELATIONAL ALGEBRA: DIVISION ( )
Certain query scenarios involving quantifiers can be concisely formulated using the derived RAoperator division ( ):
DivisionThe relational division ( ) of input relation by returns those values of
such that for every value in there exists a tuple in . Let :
Notes:
Schemata in general: and .
Division? Division!If then and .
÷
‐÷
÷ (A, B)R1 (B)R2 A a
R1 B b R2 (a, b) R1
s = sch( ) ∖ sch( )R1 R2
÷ := ( ) ∖ (( ( ) × ) ∖ )R1 R2 πs R1 πs πs R1 R2 R1
‐‐ sch( ) ⊂ sch( )R2 R1 sch( ÷ ) = sch( ) ∖ sch( )R1 R2 R1 R2
‐ × = SR1 R2 S ÷ =R1 R2 S ÷ =R2 R1
34
RELATIONAL ALGEBRA: DIVISION ( )
Example: Divide by :
AA BB 1 a1 c2 b2 a2 c3 b3 c3 a3 d
BBac
BBabc
÷
‐ (A, B)R1 (B)R2R1
R2 R2
35
RELATIONAL ALGEBRA: OUTER JOINFind the bricks of LEGO Set 336–1 along with possible replacement bricks
containssetset piecepiece quantityquantity (= 336-1)
matchespiece1piece1 piece2piece2 setset
brickspiecepiece namename
Relation matches: In set , piece is considered a replacement for original piece .
Query (⚠ unexpectedly returns way too few rows…):
colorcolor extraextras p1
p1 p2 s
catcat weightweight imgimg xx yy zzpi
‐ s p2 p1
‐
(πpiece,name,piece2
(bricks) ⋈ ( (contains)) ⋈ (matches))πpiece,name πset,piece σset=336–1 πpiece←piece1,piece2,set
36
RELATIONAL ALGEBRA: OUTER JOIN
(Left) Outer JoinThe (left) outer join of input relations and returns all tuples of the join of , plus those tuples of that did not find a join partner (padded with ▢). Let
:
Notes:
A Cartesian product with a singleton relation can conveniently implement the required padding.
is non-monotonic: insertion into or may invalidate a former ▢-padded result tuple.
The variants right ( ) and full outer join ( ) are defined in the obvious fashion.
⟕p R1 R2 R1 R2
R1
sch( ) = { , … , }R2 a1 an
:= ( ) ∪ (( ) × {( : ▢, … , : ▢)})R1 ⟕p R2 R1 ⋈p R2 R1 ⊳p R2 a1 an
‐1.
2. ⟕p R1 R2
3. ⟖ ⟗
37
SQL: ALTERNATIVE JOIN SYNTAX
Alternative SQL Join Syntax
In the FROM clause, two ⟨from_item⟩s may be joined using an RA-inspired syntax as follows:
⟨from_item⟩ ⟨join_type⟩ JOIN ⟨from_item⟩ [ ON ⟨condition⟩ | USING (⟨column⟩ [, …]) ]
where ⟨join_type⟩ is defined as
{ [NATURAL] { [INNER] | { LEFT | RIGHT | FULL } [OUTER] } | CROSS }
to indicate , , , , and respectively.
For all join types but CROSS, exactly one of NATURAL, ON, or USING must be specified.
USING ( ,…, ) abbreviates a conjunctive equality condition over the columns.
⋈ ⟕ ⟖ ⟗ ×
‐‐ a1 an n
38
SQL: USING OUTER JOINOUTER JOIN
List all LEGO colors ordered by popularity (= number of bricks available in that color)
colorscolorcolor namename finishfinish
available_incolorcolor brickbrick
Recall: the relationship between colors and bricks was described in the LEGO database ER diagram asfollows, i.e., there may be unpopular colors that are not used (anymore):
[brick]&&(0,*)&&⟨available in⟩&&(0,*)&&[color]
Formulate the query with output columns: name, finish, popularity (= ⟨brick count⟩ ⩾ 0) …
… using SQL’s alternative join syntax,
… using all SQL constructs but the alternative join syntax.
rgbrgb from_yearfrom_year to_yearto_yearc
c
‐
‐1.
2. 39
40