advanced database techniqueshomepages.cwi.nl/~mk/onderwijs/adt2007/lectures/lecture...advanced...
TRANSCRIPT
Advanced Database Techniques
Martin.Kersten @ cwi.nlStefan [email protected]
Sandor Heman @ cwi.nlJennie Zhang @ cwi.nl
Romulo Goncalves @cwi.nl
Administrative details• The website evolves as during the course• Exam material is marked explicitly• Lab work deadlines are strict
• Email is the preferred way to communicate• Tomorrow the assistants will be available in
person between 11:00-12:00, room REC-P.123
Relational systems• A database system should simplify the
organization, validation, sharing, and bookkeeping of information
• Prerequisite knowledge– Relational data model and algebra– Data structures (B-tree, hash)– Operating system concepts– Using a SQL database system
• What is your practical experience?[Ruby on Rails expertise needed]
Applications• Bread-and-butter applications?
– Web-shop– Banking systems– Inventory systems– Production systems– Shopping systems– Government systems– Health systems– Multimedia systems– Science systems …
Advanced Applications• Bread-and-butter applications ???
– Banking systems• What happens if you install a stock trading system
which should handle >100K transactions/minute• How to derive trading advice using compute
intensive applications• How to warn thousands of users about their trading
opportunity
– …. Need for parallel, distributed main-memory database technology…
Advanced application requirements• Bread-and-butter applications
– Inventory applications• How to install a battlefield inventory systems• How to deliver goods just in time?• How to keep track of moving objects/persons ?
• … need for sensor-based database support and RFID tags … need for a new DBMS ?…
Advanced Applications• Production systems
– How to interact with component suppliers– How to manage the production workflow– How to avoid bad production steps– How to maintain a database with 12000 tables
(SAP)
• … need for interoperability between autonomous systems… datamining and knowledge discovery…
Advanced Applications• Health information systems
– How to monitor your health over 30 years– How to enable quick response to a heart attack
• …need for interoperable database systems …
HELP
The Ambient Home
HELP
The Ambient Home
911 called
MonetDB DataCell
MonetDB DataCell
911 called
nucleus
A Shared Tuple Spaceusing an SQL DBMS
MonetDB DataCell
911 called
receptors emittersnucleus
HELP
MonetDB DataCell
Recall
receptors emittersnucleus
MonetDB DataCell
Keep
911 called
receptors emittersnucleus
HELP
MonetDB DataCell
forget
receptors emittersnucleus
MonetDB DataCell
Aggregate
911 called
receptors emittersnucleus
MonetDB DataCell
911 called
receptors emittersnucleus
Recall
Aggregate
Keep
Forget
SQL work load-- SQL-queries
insert into hospital select ‘John’,* from medic where temp>40.0;
insert into epdselect * from medic where temp>=38.0;
delete from medic ;
Recall
Aggregate
Keep
Forget
SQL work load
insert into hospital select ‘John’,* from medic where temp>40.0;
insert into epdselect * from medic where temp>=38.0;
delete from medic ;
Start End
Query optimizationThe queries in a datacell have
- a soft/hard deadline- strong flow dependency
The operands to the queries are small tables:
- empty- single value- a few values
Traditional query optimizers are biased towards large operands.
Recall
Aggregate
Keep
Forget
Query optimizationChallenges:
• How to optimize the individual SQL programs to select the proper QEP ?
•How to weave the collection of SQL programs to create an optimal multi-query version?
Recall
Aggregate
Keep
Forget
Advanced Applications• Multimedia Systems
– Narrow/broad casting, selective dissemination of volumetric information
– Searching in multimedia storage
• … need for P2P infrastructure …search facilities over feature spaces…
Advanced applications• Government systems
– Security• Biometric data management issues, finger/image
matching
– Public safety• Forensics, manipulate complex objects using
proprietary algorithms
• …need for extensible database technology…need to support unstructured data…
Advanced Applications• Science systems
– The new accelerator in CERN • how to handle >1PTByte files
– The Sloan Digital Skyserver schema is 200 pages and the catalogued data 2.5Tb
• How to query this efficiently
– ..need for P2P and … a novel way to organize data…
LOFAR central processor specs• Streaming Data
– Input: 320 Gbit/s– Internally within correlator: 20 Tbit/s– Into storage: 25 Gbit/s = 250 TByte/day– Final products: 1-3 TByte/day
• High Performance Computing– Correlation: 15 Tflops– Pre processing and filtering: 5 Tflops– Off-line processing (calibration, analysis): 5-10 Tflops– Visualisation, control, scheduling etc: 2 Tflops
• Storage– On-line temporal storage: 500 TByte– Archive: PByte range of data stored in Grid
Technological challenges• Data is often not structured as tables
– XML and XQuery
• Data does not always fit on one system– Distributed and parallel databases
• Querying is more like world-wide searching– Continuous and streaming queries
• A database tells more than facts– Datamining and knowledge discovery
Code bases• Database management systems are BIG software
systems– Oracle, SQL-server, DB2 >1 M lines– PostgreSQL 300K lines– MySQL 500 K lines– MonetDB 200-800 K lines – SQLite 40K lines
• Programmer teams for DBMS kernels range from a few to a few hundred
Performance components• Hardware platform• Data structures• Algebraic optimizer• SQL parser• Application code
– What is the total cost of execution ?– How many tasks can be performed/minute ?– How good is the optimizer?– What is the overhead of the datastructures ?
Not all are equal
0.400.611.704.550.93Big delete and small insert
1.483.211.8113.160.36Big insert after delete
0.752.062.261.310.22Delete with index
0.564.000.971.500.32Delete on text index
1.592.781.5361.360.65Insert from select
1.722.406.9848.1310.3225000 updates on text
3.103.528.1318.798.3325000 updates with index
0.630.638.411.730.431000 updates
1.161.121.274.615.225000 range index selects
3.373.364.6413.402.15100 string range selects
2.522.492.763.620.18100 range selects
1.420.942.184.916.7125000 inserts 1 transaction
0.2213.060.154.300.271000 inserts transactions
SQLlitenosync
SQLiteMySQLPostgreSQLMonetDB
Not all are equal
Not all are equal
Why does it take so long to built a 10Mx2 table?How long will it take to do 10Mx32 on SQLserver Beta 2 ?
Gaining insight• Study the code base (inspection + profiling)
– Often not accessible outside development lab
• Study individual techniques (data structures + simulation)– Focus of most PhD research in DBMS
• Detailed knowledge becomes available, but ignores the total cost of execution.
• Study as a functional black box– Analyse a small application framework
The Jack The Ripper Project• Study the snippet of the database technology and
design an XQuery and SQL application
• What is the schema?
• What are the queries?
• What are unorthodox solutions?
Learning points• My poor knowledge on relational database? Read
the chapters on SQL and relational algebra. Knowledge on data structures comes in handy.
• Database systems are much more than administrative bookkeeping systems
Learning points
– Advanced application challenge the technology provided by a DBMS
– Many techniques do not easily scale in size, complexity, functionality
– Effectiveness of a DBMS is determined by many tightly interlocked components