cmpsci 645 database design &...

32
CMPSCI 645 Database Design & Implementation Instructor: Gerome Miklau Welcome to

Upload: others

Post on 15-Mar-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CMPSCI 645 Database Design & Implementationavid.cs.umass.edu/courses/645/s2008/slides/01-Intro.pdfDatabases & DBMS’s • A database is a large, integrated collection of data. •

CMPSCI 645Database Design & Implementation

Instructor:Gerome Miklau

Welcome to

Page 2: CMPSCI 645 Database Design & Implementationavid.cs.umass.edu/courses/645/s2008/slides/01-Intro.pdfDatabases & DBMS’s • A database is a large, integrated collection of data. •

Overview of Databases

Gerome MiklauCMPSCI 645 – Database Design & Implementation

UMass AmherstJan 29, 2008

Some slide content courtesy of Zack Ives, Ramakrishnan & Gehrke, Dan Suciu, Ullman & Widom

Page 3: CMPSCI 645 Database Design & Implementationavid.cs.umass.edu/courses/645/s2008/slides/01-Intro.pdfDatabases & DBMS’s • A database is a large, integrated collection of data. •

Today

• Student information form• Overview of databases• Course topics• Course requirements

Page 4: CMPSCI 645 Database Design & Implementationavid.cs.umass.edu/courses/645/s2008/slides/01-Intro.pdfDatabases & DBMS’s • A database is a large, integrated collection of data. •

Databases & DBMS’s• A database is a large, integrated collection of

data.

• A database management system (DBMS) is a collection of software designed to store and manage databases, allowing:– Define the kind of data stored– Querying/updating interface– Reliable storage & recovery of 100s of GB– Control access to data from many concurrent users

Page 5: CMPSCI 645 Database Design & Implementationavid.cs.umass.edu/courses/645/s2008/slides/01-Intro.pdfDatabases & DBMS’s • A database is a large, integrated collection of data. •

Can filesystems do it?

• Schema for files is limited• No query language for data in files• Files can store large amounts of data, but

– no efficient access to items within file– no recovery from failure

• Concurrent access not safe

No

Page 6: CMPSCI 645 Database Design & Implementationavid.cs.umass.edu/courses/645/s2008/slides/01-Intro.pdfDatabases & DBMS’s • A database is a large, integrated collection of data. •

Evolution

• Early DBMS’s (1960’s), evolved from file systems.

• Data with many small items & many queries or modifications:– Airline reservations– Banking

Page 7: CMPSCI 645 Database Design & Implementationavid.cs.umass.edu/courses/645/s2008/slides/01-Intro.pdfDatabases & DBMS’s • A database is a large, integrated collection of data. •

Early DB systems

• Tree-based hierarchical data model• Graph-based network data model

• Encouraged users to think about data the way it was stored.

• No high level query language

Data model The data model includes basic assumptions about what’s

an “item” of data, how to represent it and interpret it.

Page 8: CMPSCI 645 Database Design & Implementationavid.cs.umass.edu/courses/645/s2008/slides/01-Intro.pdfDatabases & DBMS’s • A database is a large, integrated collection of data. •

The Relational Model•The relational data model (Codd, 1970):

– Data independence: details of physical storage are hidden from users

– High-level declarative query language• say what you want, not how to compute it. • mathematical foundation

– A theory of normalization guides the design of relations

Side-note: Turing Awards in Databases1973: Bachman, networked data model 1981: Codd, relational model1998: Jim Gray, transaction processing

Page 9: CMPSCI 645 Database Design & Implementationavid.cs.umass.edu/courses/645/s2008/slides/01-Intro.pdfDatabases & DBMS’s • A database is a large, integrated collection of data. •

DBMS Benefit #1: Generality and Declarativity

• The programmer or user does not need to know details like indices, sort orders, machine speeds, disk speeds, concurrent users, etc.

• Instead, the programmer/user programs with a logical model in mind

• The DBMS “makes it happen” based on an understanding of relative costs of different methods

Page 10: CMPSCI 645 Database Design & Implementationavid.cs.umass.edu/courses/645/s2008/slides/01-Intro.pdfDatabases & DBMS’s • A database is a large, integrated collection of data. •

Benefit #2: Efficiency and Scale

• Efficient storage of hundreds of GBs of data

• Efficient access to data

• Rapid processing of transactions

Page 11: CMPSCI 645 Database Design & Implementationavid.cs.umass.edu/courses/645/s2008/slides/01-Intro.pdfDatabases & DBMS’s • A database is a large, integrated collection of data. •

Benefit #3: Management of Concurrency and Reliability

• Simultaneous transactions handled safely.• Recovery of system data after system failure.

• More formally: the ACID properties– Atomicity - all or nothing– Consistency - sensible state not violated– Isolation - separated from effects– Durability - once completed, never lost

Page 12: CMPSCI 645 Database Design & Implementationavid.cs.umass.edu/courses/645/s2008/slides/01-Intro.pdfDatabases & DBMS’s • A database is a large, integrated collection of data. •

How Does One Build a Database?

• Start with a conceptual model• Design & implement schema• Write applications using DBMS and other

tools– Many ways of doing this (DBMS, API writers,

library authors, web server, etc.)– Common applications include PHP/JSP/servlet-

driven web sites• The DBMS takes care of query optimization

and execution

Page 13: CMPSCI 645 Database Design & Implementationavid.cs.umass.edu/courses/645/s2008/slides/01-Intro.pdfDatabases & DBMS’s • A database is a large, integrated collection of data. •

Conceptual Design

STUDENT COURSETakes

namesid cid name

PROFESSOR

Teaches

semester

fid name

Page 14: CMPSCI 645 Database Design & Implementationavid.cs.umass.edu/courses/645/s2008/slides/01-Intro.pdfDatabases & DBMS’s • A database is a large, integrated collection of data. •

Designing a Schema (Set of Relations)

• Convert to tables +constraints

• Then need to do “physical” design: the layout on disk, indices, etc.

sid name1 Jill2 Bo3 Maya

fid name1 Diao2 Saul8 Weems

sid cid1 6451 6833 635

cid name sem645 DB F05683 AI S05635 Arch F05

fid cid1 6452 6838 635

STUDENT Takes COURSE

PROFESSOR Teaches

Page 15: CMPSCI 645 Database Design & Implementationavid.cs.umass.edu/courses/645/s2008/slides/01-Intro.pdfDatabases & DBMS’s • A database is a large, integrated collection of data. •

Queries

• Find all courses that “Mary” takes

• What happens behind the scene ?– Query processor figures out how to answer

the query efficiently.

SELECT C.nameFROM Students S, Takes T, Courses CWHERE S.name=“Mary” and S.sid = T.sid and T.cid = C.cid

Page 16: CMPSCI 645 Database Design & Implementationavid.cs.umass.edu/courses/645/s2008/slides/01-Intro.pdfDatabases & DBMS’s • A database is a large, integrated collection of data. •

Queries, behind the scene

Query execution plan:

SELECT C.nameFROM Students S, Takes T, Courses CWHERE S.name=“Mary” and S.sid = T.sid and T.cid = C.cid

Declarative SQL query

Students Takes

sid=sid

sname

name=“Mary”

cid=cid

Courses

The optimizer chooses the best execution plan for a query

Page 17: CMPSCI 645 Database Design & Implementationavid.cs.umass.edu/courses/645/s2008/slides/01-Intro.pdfDatabases & DBMS’s • A database is a large, integrated collection of data. •

An Issue: 80% of the World’s Data is Not in a DB!

Examples: – Scientific data

(large images, complex programs that analyze the data) – Personal data– WWW and email

(some of it is stored in something resembling a DBMS)Data management is expanding to tackle these

problems

Page 18: CMPSCI 645 Database Design & Implementationavid.cs.umass.edu/courses/645/s2008/slides/01-Intro.pdfDatabases & DBMS’s • A database is a large, integrated collection of data. •

DBMSs in the Real WorldA huge industry for 20% of the world’s data!• Big, mature relational databases

– IBM DB2, Oracle, Microsoft SQL Server– Adding advanced features, including “native XML” support

• “Middleware” above these systems– SAP, Siebel, PeopleSoft, dozens of special-purpose apps

• Integration and warehousing systems– BEA AquaLogic, DB2 Information Integrator

• Current trends:– Web services; XML everywhere– Smarter, self-tuning systems– Distributed databases, column-stores

Page 19: CMPSCI 645 Database Design & Implementationavid.cs.umass.edu/courses/645/s2008/slides/01-Intro.pdfDatabases & DBMS’s • A database is a large, integrated collection of data. •

Database Research

• One of the broadest, most exciting areas in CS!• A microcosm of CS in general

• languages, operating systems, concurrent programming, data structures, algorithms, theory, distributed systems, statistical techniques.

• Theory and systems well-integrated.

Page 20: CMPSCI 645 Database Design & Implementationavid.cs.umass.edu/courses/645/s2008/slides/01-Intro.pdfDatabases & DBMS’s • A database is a large, integrated collection of data. •

Recent Trends in Databases

• XML– Relational databases with XML support– Middleware between XML and relational databases– Large-scale XML message systems

• Main memory database systems• Peer data management• Stream data management• Model management, provenance• Security and privacy• Modeling uncertainty, probabilistic databases

Page 21: CMPSCI 645 Database Design & Implementationavid.cs.umass.edu/courses/645/s2008/slides/01-Intro.pdfDatabases & DBMS’s • A database is a large, integrated collection of data. •

What is the Field of Databases ?

• To an applied researcher (SIGMOD/VLDB/ICDE)– Query optimization– Query processing (yet-another join algorithm)– Transaction processing, recovery (but most stuff is already

done)– Novel applications: data mining, high-dimensional search

• To a theoretical researcher (PODS/ICDT/LICS)– Focus on the query languages– Query language = logic = complexity classes

Page 22: CMPSCI 645 Database Design & Implementationavid.cs.umass.edu/courses/645/s2008/slides/01-Intro.pdfDatabases & DBMS’s • A database is a large, integrated collection of data. •

Course topics

• Fundamentals: relational design, query languages.

• Theory: expressiveness of query languages, static analysis, complexity.

• Database internals: storage, indexing, query processing, query optimization, transaction management.

• XML and semi-structured data models.• Security: access control, privacy.• Advanced topics: incomplete/probabilistic

DBs, parallel and distributed DBs.

Page 23: CMPSCI 645 Database Design & Implementationavid.cs.umass.edu/courses/645/s2008/slides/01-Intro.pdfDatabases & DBMS’s • A database is a large, integrated collection of data. •

Prerequisites

• Official: undergrad course in DB or OS• Also:

– Elementary complexity theory

Page 24: CMPSCI 645 Database Design & Implementationavid.cs.umass.edu/courses/645/s2008/slides/01-Intro.pdfDatabases & DBMS’s • A database is a large, integrated collection of data. •

Grading

• Homework: 30%• Paper reviews & participation: 15%• Midterm: 30% • Project: 25%

Page 25: CMPSCI 645 Database Design & Implementationavid.cs.umass.edu/courses/645/s2008/slides/01-Intro.pdfDatabases & DBMS’s • A database is a large, integrated collection of data. •

Homework: 30%

• ~ 4 assignments throughout the course– written problem sets– practical experience with SQL, XQuery

Page 26: CMPSCI 645 Database Design & Implementationavid.cs.umass.edu/courses/645/s2008/slides/01-Intro.pdfDatabases & DBMS’s • A database is a large, integrated collection of data. •

Paper Reviews & Participation: 15%

• Approximately 5 classic papers will be assigned

• Short written reviews are due before the day of class. Email to: – [email protected]

First paper review:Read thru 1.4 of Codd’s paper Due Wed Feb 5th

Page 27: CMPSCI 645 Database Design & Implementationavid.cs.umass.edu/courses/645/s2008/slides/01-Intro.pdfDatabases & DBMS’s • A database is a large, integrated collection of data. •

Project: 25%• General theme: apply database principles to a new

problem• Suggested topics will be discussed next Tuesday• Groups of 2 preferred. 3 possible.• Project work will include:

– Reading some of the research literature– Implementation– Written report– In-class presentation

• Periodic consultation with the instructor

Page 28: CMPSCI 645 Database Design & Implementationavid.cs.umass.edu/courses/645/s2008/slides/01-Intro.pdfDatabases & DBMS’s • A database is a large, integrated collection of data. •

Midterm Exam (30%)

• Midterm scheduled for Apr 17th at 7pm• (No Final!)

Page 29: CMPSCI 645 Database Design & Implementationavid.cs.umass.edu/courses/645/s2008/slides/01-Intro.pdfDatabases & DBMS’s • A database is a large, integrated collection of data. •

Textbook

Database Management Systems Ramakrishnan and Gehrke

Page 30: CMPSCI 645 Database Design & Implementationavid.cs.umass.edu/courses/645/s2008/slides/01-Intro.pdfDatabases & DBMS’s • A database is a large, integrated collection of data. •

Other useful resources• Database systems: the complete book (Ullman,

Widom and Garcia-Molina)• Readings in Database Systems (Stonebraker and

Hellerstein)• Foundations of Databases (Abiteboul, Hull, Vianu)• Data on the Web (Abiteboul, Buneman, Suciu)• Parallel and Distributed DBMS (Ozsu and Valduriez)• Transaction Processing (Gray and Reuter)• Data and Knowledge based Systems (volumes I, II)

(Ullman)• Proceedings of SIGMOD, VLDB, PODS conferences.

Page 31: CMPSCI 645 Database Design & Implementationavid.cs.umass.edu/courses/645/s2008/slides/01-Intro.pdfDatabases & DBMS’s • A database is a large, integrated collection of data. •

Communication

• Instructor– Office hours: by appointment– Email: miklau at cs dot umass dot edu

• Check the course webpage often• You should have been added to the

mailing list.

31

Page 32: CMPSCI 645 Database Design & Implementationavid.cs.umass.edu/courses/645/s2008/slides/01-Intro.pdfDatabases & DBMS’s • A database is a large, integrated collection of data. •

Questions about the course?

32