database systems 236363 introduction. bureaucratic info (1/2) lecturer: prof. roy friedman –...
TRANSCRIPT
Database Systems236363
Introduction
Bureaucratic Info (1/2)
• Lecturer: Prof. Roy Friedman– Office: Taub 605– E-mail: [email protected]
• TA in Charge: Roni Licher– Office: Taub 225– E-mail: [email protected]
• TA: Hadar Levy– Office: Taub 315– E-mail: [email protected]
Bureaucratic Info (2/2)
• Web site: http://webcourse.cs.technion.ac.il/236363/
• Home assignments: 3 dry, 1 wet
• Grade structure:– 80% final exam, 20% home assignments (takef)– 8% for the wet assignment, 4% each dry one– Home assignments are mandatory – no final grade without a
pass grade in each home assignment• Those who repeat the course must submit all home assignments• Exceptions to this rule according to post-”Zuk Eitan” regulations
– Only applies to students that took the course in Spring 2014 semester
Topics
• Introduction• Entity Relationship Diagrams (ERD)• Relational Algebra• Relational Calculus• Database Design Theory
– Functional Dependencies– Schemas Decomposition and Normal Forms
• XML and the Query Language Xpath• NoSQL and graph databases
• In the recitation: The SQL Query Language
In general, this course is about using DBs
rather than implem
entation details
Let the Fun Begin…
• Database– A (persistent) collection of data– Often with some logical structure– Examples include, e.g., bank accounts, students listed to courses
and their grades, geographical data used by a map/navigation service, customers of an online web-site, dentist’s patients
• A Query Language– Enables querying and manipulating the database– Examples include, e.g., Structured Query Language (SQL),
Datalog, Cassandra Query Language (CQL), Cypher• Database Management System (DBMS)
– The system that manages the database and supports the execution of queries on the database
Why Do We Need a DBMS?
• After all, we have operating systems and file systems…• DBMS provides a data oriented abstraction for
manipulating the data• Enables direct manipulation of the data without
worrying about storage and execution issues– Frees the programmer from worrying about many low level
details such as:• Serializing and de-serializing the data to the storage• Organizing the data in the storage and masking storage latencies
and inefficiencies• How to enable concurrent access to the database?• What to do when the database is larger than physical memory?• Split/cached database operation in combined mobile/cloud
DBMS Functionality
• Storage management• Query processor– Optimizes query processing for efficient information
retrieval and query execution• Concurrency control and recovery• Data integrity– E.g., that an ID number is a unique 9 digit number
• Security– Access control, authentication and encryption
Data Model
• The data model defines the framework for how data is represented– For example, in the relational model, data is
represented as tables (relations)• The query language enables extracting data
from the database according to the given data model– For example, SQL can express queries on tables
and the results are tables as well
• To obtain independence between the data model and the physical storage, we separate between three levels– User view, logical level, and physical level
Data Representation Independence
View
Logical
Physical
Independence of physicallayout beyond this level
Independence of logicallayout beyond this levelData is organized here
according to the datamodel (relational)
Upper Database Layer
• In the end-user layer, each user is provided with a (potentially partial) view that may be different from the actual data layout
Operations on a Database
• Database structure definition (following an analysis of the application’s needs): Includes the logical structures for representing the data and their relations– Data Definition Language (DDL)
• Query execution to retrieve data from the database
• Data manipulation: adding, deleting, and updating– Data Manipulation Language (DML)
• Administrative operations:– Defining views, indexes, etc.
Database Administrator (DBA)
• Responsible for– Planning the logical database layout and adapting
it to the physical layer– Security and access control– Recovery management (after failures)– Performance fine tuning
Data Models (1/2)
• Relational data model– Data is represented using tables– Correspondence between tables is obtained by using same
values in specific columns (keys) with the same name– The main focus of this course
• Entity Relation– A tool for analyzing the requirements of a database and
designing its schema– This model is an abstract one and has no actual direct
implementation
Data Models (2/2)
• Object Oriented data model– A model in which the data is represented as objects
similarly to what is done in OOP• ERD can be mapped to OO
• Semi-structured data model– Data is represented as a graph (independent of the
physical layout)• Extensible Markup Language (XML)– A specific instance of the semi-structured model in
which the graph is a tree
The Relational Data Model• We wish to represent a collection of objects of a given type, each
characterized by a fixed set of properties– For example, student’s name, date of birth, department
• How would we do this in C or Java?
• In the relational data model, we maintain these objects in a table– Each row is used to store one object– Each column represents one of the properties
• The table is a logical structure– Might be physically stored in a completely different manner– Each row must be different than each other row in at least one attribute
• The table represents a set rather than a multi-set
Department Date of Birth Student Name
Computer Science 01/01/1990 John Doe
Electrical Engineering 07/07/1992 Jane Roe
… … …
Relational Model: Terminology
Department Date of Birth Student Name
Computer Science 01/01/1990 John Doe
Electrical Engineering 07/07/1992 Jane Roe
… … …
A relation (the entire table)A record (an entire row)
Schema (title of table) An attribute (column name)
A Formal Definition
• For a given set of attributes A1,…,An and a set of corresponding domains D1,…,Dn (each domain is a set of values)
• Denote R(A1,…,An) the relational schema that contains the attributes A1,…,An
• A relation r over R is a subset of the domain product r ⊆ D1xD2x…xDn
The Formal Definition - Visualized
• Instead of viewing each record as a line in a table, it is viewed as a point in the space of all possible value assignments. The relation is a finite subset of “all possible records”.
Name
Birth Date
John Doe, 01/01/1990
Jane Roe, 07/07/1992
Keys
• A superkey of a relation r is a subset of attributes of r’s schema such that specific values of these attributes identify a single record in r. In other words, there are no two records in r whose values in all attributes of the superkey are the same.
• A relation may have several superkeys. A superkey is called minimal if none of its subsets is a superkey. Such a key is also called a candidate key.
• One of the superkeys can be selected as the primary key. The primary key is used to identify a row in the implementation of the database.
Yet, in PostgreSQL, when no key is defined, a table can include multiple records with the same values in all attributes
Simple Databases
• In the simplest case, all objects of interest are of the same “kind”, meaning that they all have the same attributes list – they are only distinguished by their specific attributes values
• For example, the list of songs on my computer – each such object is characterized by the name of the song, the format (mp3/wma/…), playing time, and size
• In these cases, all objects are organized in a single table
More Involved Databases
• Suppose we wish to design a database for the faculty’s administrative assistants. Here, we can identify at least two types of entities:– Students: student name, id, address– Courses: course name, catalogue number, lecturer
• If a student is registered to a given course, we should be able to know about it and be able to retrieve the student’s final grade– Hence, each student’s participation in a given course should be
recorded somewhere
• The question is how to organize the database for this?
Possible Organization
• A simple option: Have a single large table – for each student’s registration, we will hold the student’s name, id, address, course name, catalogue number, lecturer, and final grade
• Drawbacks:– Redundancy: Why should the student’s address be stored in each
course she takes?– Inadequacy: How can we maintain the details of a student that
does not take any course?– Difficult to update: If a student changes his address, we will need
to update all records of all courses he is registered to. This is both expensive and a source of inconsistency
Another Option
• We can maintain one table for students, one for courses, and one for registration
• The registration table schema can include a primary key for the students (e.g., student id) and a primary key for the courses (e.g., catalogue number) as well as the final grade
• Now each student’s data is independent of each course’s data and vice versa
Another Option
Address Student ID Student Name
Technion City 12345678 Jane Roe
…
Lecturer Catalogue Number Course Name
Rony Superstar 236363 Database Systems
…
Final Grade Student ID Catalogue Number
95 12345678 236363
…
Students
Courses
Registrations
Life is Full of Difficult Choices
• What if we wish to retrieve the names of all lecturers who taught a given student?
• If in the registrations table we only maintain the course’ catalogue number, this query will require a long time to compute
• Further, if the lecturers change over time, the reply will not be accurate– How do we fix this?
• Should we add a lecturer attribute to the registrations table?• Should we define a new “lecturers” table and corresponding relations?
• The database organizational design choices are not trivial – this is the subject of much of this course