distributed databases based on material provided by: jim gray (microsoft), heinz stockinger (cern),...
TRANSCRIPT
Distributed Distributed DatabasesDatabases
Based on material provided by:Based on material provided by:Jim Gray (Microsoft), Heinz Stockinger (CERN), Raghu Jim Gray (Microsoft), Heinz Stockinger (CERN), Raghu
Ramakrishnan (Wisconsin)Ramakrishnan (Wisconsin)
Dr. Julian BunnDr. Julian BunnCenter for Advanced Computing ResearchCenter for Advanced Computing Research
CaltechCaltech
J.J.Bunn, Distributed Databases, 2001 2
OutlineOutline
Introduction to Introduction to Database SystemsDatabase Systems
Distributed Databases Distributed Databases Distributed SystemsDistributed Systems Distributed Databases Distributed Databases
for Physicsfor Physics
Part IPart I
Introduction to Introduction to Database SystemsDatabase Systems
..
Julian Bunn
California Institute of Technology
J.J.Bunn, Distributed Databases, 2001 4
What is a Database?What is a Database? A large, integrated collection A large, integrated collection
of dataof data Entities (things) and Entities (things) and
Relationships (connections)Relationships (connections) Objects and Objects and
Associations/ReferencesAssociations/References A Database Management A Database Management
System (DBMS) is a software System (DBMS) is a software package designed to store package designed to store and manage Databasesand manage Databases
““Traditional” (ER) Databases Traditional” (ER) Databases and “Object” Databasesand “Object” Databases
J.J.Bunn, Distributed Databases, 2001 5
Why Use a DBMS?Why Use a DBMS?
Data IndependenceData Independence Efficient AccessEfficient Access Reduced Application Development Reduced Application Development
TimeTime Data IntegrityData Integrity Data SecurityData Security Data Analysis ToolsData Analysis Tools Uniform Data AdministrationUniform Data Administration Concurrent AccessConcurrent Access Automatic ParallelismAutomatic Parallelism Recovery from crashesRecovery from crashes
J.J.Bunn, Distributed Databases, 2001 6
Cutting Edge Cutting Edge DatabasesDatabases
Scientific ApplicationsScientific Applications Digital Libraries, Interactive Digital Libraries, Interactive
Video, Human Genome Video, Human Genome project, Particle Physics project, Particle Physics Experiments, National Digital Experiments, National Digital Observatories, Earth ImagesObservatories, Earth Images
Commercial Web Systems Commercial Web Systems Data Mining / Data WarehouseData Mining / Data Warehouse Simple data but very high Simple data but very high
transaction rate and transaction rate and enormous volume (e.g. click enormous volume (e.g. click through)through)
J.J.Bunn, Distributed Databases, 2001 7
Data ModelsData Models Data Model: A Collection of Data Model: A Collection of
Concepts for Describing DataConcepts for Describing Data Schema: A Set of Descriptions Schema: A Set of Descriptions
of a Particular Collection of of a Particular Collection of Data, in the context of the Data, in the context of the Data ModelData Model
Relational Model:Relational Model: E.g. A Lecture E.g. A Lecture is attended byis attended by
zero or more Studentszero or more Students Object Model:Object Model:
E.g. A Database Lecture E.g. A Database Lecture inherits inherits attributesattributes from a general from a general LectureLecture
J.J.Bunn, Distributed Databases, 2001 8
Data IndependenceData Independence
Applications insulated from Applications insulated from how data in the Database is how data in the Database is structured and storedstructured and stored Logical Data Independence: Logical Data Independence:
Protection from changes in the Protection from changes in the logical structure of the datalogical structure of the data
Physical Data Independence: Physical Data Independence: Protection from changes in the Protection from changes in the physical structure of the dataphysical structure of the data
J.J.Bunn, Distributed Databases, 2001 9
Concurrency ControlConcurrency Control
Good DBMS performance Good DBMS performance relies on allowing concurrent relies on allowing concurrent access to the data by more access to the data by more than one client than one client
DBMS ensures that DBMS ensures that interleaved actions coming interleaved actions coming from different clients do not from different clients do not cause inconsistency in the cause inconsistency in the data data E.g. two simultaneous bookings E.g. two simultaneous bookings
for the same airplane seatfor the same airplane seat Each client is unaware of how Each client is unaware of how
many other clients are using many other clients are using the DBMSthe DBMS
J.J.Bunn, Distributed Databases, 2001 10
TransactionsTransactions
A Transaction is an atomic A Transaction is an atomic sequence of actions in the sequence of actions in the Database (reads and writes)Database (reads and writes)
Each Transaction has to be Each Transaction has to be executed executed completelycompletely, and , and must leave the Database in a must leave the Database in a consistent stateconsistent state The definition of “consistent” is ultimately the client’s The definition of “consistent” is ultimately the client’s
responsibility!responsibility!
If the Transaction fails or If the Transaction fails or aborts midway, then the aborts midway, then the Database is “rolled back” to Database is “rolled back” to its initial consistent state its initial consistent state (when the Transaction (when the Transaction began).began).
J.J.Bunn, Distributed Databases, 2001 11
What Is A What Is A Transaction?Transaction?
Programmer’s view: Programmer’s view: Bracket a collection of actionsBracket a collection of actions
A A simplesimple failure model failure model Only two outcomes:Only two outcomes:
Begin()Begin() actionaction actionaction actionaction actionactionCommit()Commit()
Success!Success!
Begin()Begin()action action actionactionactionactionRollback()Rollback()
Begin()Begin()action action actionactionactionaction
Rollback()Rollback()
Failure!Failure!
Fail !Fail !Fail !Fail !
J.J.Bunn, Distributed Databases, 2001 12
ACIDACID
AtomicAtomic: all or nothing: all or nothing ConsistentConsistent: state : state
transformationtransformation IsolatedIsolated: no concurrency : no concurrency
anomaliesanomalies DurableDurable: committed : committed
transaction effects persisttransaction effects persist
J.J.Bunn, Distributed Databases, 2001 13
Why Bother: Why Bother: Atomicity?Atomicity?
RPC semantics:RPC semantics: At most once: try one time At most once: try one time
At least once: keep trying At least once: keep trying ’till acknowledged’till acknowledged
Exactly once: keep trying Exactly once: keep trying ’till acknowledged and ’till acknowledged and serverserverdiscards duplicate requestsdiscards duplicate requests
???
J.J.Bunn, Distributed Databases, 2001 14
Why Bother: Why Bother: Atomicity?Atomicity? Example: insert record in fileExample: insert record in file
At most onceAt most once: time-out means “maybe”: time-out means “maybe” At least onceAt least once: retry may get “duplicate” : retry may get “duplicate”
error error or retry may do or retry may do second insertsecond insert
Exactly onceExactly once: you do not have to worry: you do not have to worry What if operation involvesWhat if operation involves
Insert several records? Insert several records? Send several messages?Send several messages?
Want ALL or NOTHING for group of Want ALL or NOTHING for group of actionsactions
J.J.Bunn, Distributed Databases, 2001 15
Why Bother: Why Bother: ConsistencyConsistency
Begin-Commit brackets a set of Begin-Commit brackets a set of operationsoperations
You can violate consistency inside You can violate consistency inside bracketsbrackets Debit but not credit (destroys money)Debit but not credit (destroys money) Delete old file before create new file in a Delete old file before create new file in a
copycopy Print document before delete from spool Print document before delete from spool
queuequeue Begin and commit are points of Begin and commit are points of
consistencyconsistencyState transformationsState transformations
new state under constructionnew state under construction
Be
gin
Be
gin
Co
mm
itC
om
mit
J.J.Bunn, Distributed Databases, 2001 16
Why Bother: Why Bother: Isolation Isolation
Running programs concurrentlyRunning programs concurrentlyon same data can createon same data can createconcurrency anomaliesconcurrency anomalies The shared checking account The shared checking account
exampleexample
Programming is hard enough Programming is hard enough without having to worry about without having to worry about concurrencyconcurrency
Begin()Begin() read BALread BAL add 10add 10 write BALwrite BALCommit()Commit()
Bal = 100Bal = 100
Bal = 70Bal = 70
Bal = 110Bal = 110
Bal = 100Bal = 100
Begin()Begin() read BALread BAL Subtract 30Subtract 30 write BALwrite BALCommit()Commit()
J.J.Bunn, Distributed Databases, 2001 17
IsolationIsolation It is as though programs run one at a It is as though programs run one at a
timetime No concurrency anomaliesNo concurrency anomalies
System automatically protects System automatically protects applicationsapplications Locking (DB2, Informix, MicrosoftLocking (DB2, Informix, Microsoft® ®
SQL ServerSQL Server™™, Sybase…), Sybase…) Versioned databases (Oracle, Interbase…)Versioned databases (Oracle, Interbase…)
Begin()Begin() read BALread BAL add 10add 10 write BALwrite BALCommit()Commit()
Bal = 100Bal = 100
Bal = 110Bal = 110
Bal = 80Bal = 80
Bal = 110Bal = 110
Begin()Begin() read BALread BAL Subtract 30Subtract 30 write BALwrite BALCommit()Commit()
J.J.Bunn, Distributed Databases, 2001 18
Why Bother: Why Bother: DurabilityDurability Once a transaction commits,Once a transaction commits,
want effects to survive failureswant effects to survive failures Fault tolerance:Fault tolerance:
old master-new master won’t work: old master-new master won’t work: Can’t do daily dumps: Can’t do daily dumps:
would lose recent workwould lose recent work Want “continuous” dumpsWant “continuous” dumps
Redo “lost” transactions Redo “lost” transactions in case of failurein case of failure
Resend unacknowledged messages Resend unacknowledged messages
J.J.Bunn, Distributed Databases, 2001 19
Why ACID For Why ACID For Client/Server And Client/Server And
DistributedDistributed ACID is important for ACID is important for centralized systemscentralized systems
Failures in centralized systems Failures in centralized systems are simplerare simpler
In distributed systems:In distributed systems: More and more-independent More and more-independent
failuresfailures ACID is harder to implementACID is harder to implement
That makes it even MORE That makes it even MORE IMPORTANTIMPORTANT Simple failure modelSimple failure model Simple repair modelSimple repair model
J.J.Bunn, Distributed Databases, 2001 20
ACID GeneralizationsACID Generalizations Taxonomy of actions Taxonomy of actions
Unprotected: not undone or redoneUnprotected: not undone or redone Temp filesTemp files
Transactional: can be undone before Transactional: can be undone before commitcommit Database and message operationsDatabase and message operations
Real: cannot be undoneReal: cannot be undone Drill a hole in a piece of metal,Drill a hole in a piece of metal,
print a checkprint a check Nested transactions: Nested transactions:
subtransactionssubtransactions Work flow: long-lived transactionsWork flow: long-lived transactions
J.J.Bunn, Distributed Databases, 2001 21
Scheduling Scheduling TransactionsTransactions
The DBMS has to take care of The DBMS has to take care of a set of Transactions that a set of Transactions that arrive concurrentlyarrive concurrently
It converts the concurrent It converts the concurrent Transaction set into a new set Transaction set into a new set that can be executed that can be executed sequentiallysequentially
It ensures that, before It ensures that, before reading or writing an Object, reading or writing an Object, each Transaction waits for a each Transaction waits for a LockLock on the Object on the Object
Each Transaction releases all Each Transaction releases all its Locks when finishedits Locks when finished (Strict Two-Phase-Locking Protocol)(Strict Two-Phase-Locking Protocol)
J.J.Bunn, Distributed Databases, 2001 22
Concurrency ControlConcurrency ControlLockingLocking
How to automatically preventHow to automatically preventconcurrency bugs?concurrency bugs?
Serialization theorem:Serialization theorem: If you lock all you touch and hold to If you lock all you touch and hold to
commit: commit: no bugsno bugs
If you do not follow these rules, you may If you do not follow these rules, you may see bugssee bugs
Automatic Locking:Automatic Locking: Set automatically (well-formed)Set automatically (well-formed) Released at commit/rollback (two-phase Released at commit/rollback (two-phase
locking)locking)
Greater concurrency for locks:Greater concurrency for locks: Granularity: objects or containers or serverGranularity: objects or containers or server Mode: shared or exclusive or…Mode: shared or exclusive or…
J.J.Bunn, Distributed Databases, 2001 23
Reduced Isolation Reduced Isolation LevelsLevels
It is possible to lock less and risk fuzzy It is possible to lock less and risk fuzzy datadata
Example: want statistical summary of DB Example: want statistical summary of DB But do not want to lock whole databaseBut do not want to lock whole database
Reduced levels:Reduced levels: Repeatable Read: may see fuzzy Repeatable Read: may see fuzzy
inserts/deleteinserts/delete But will serialize all updatesBut will serialize all updates
Read Committed: see only committed dataRead Committed: see only committed data Read Uncommitted: may see uncommitted Read Uncommitted: may see uncommitted
updatesupdates
J.J.Bunn, Distributed Databases, 2001 24
Ensuring AtomicityEnsuring Atomicity
The DBMS ensures the The DBMS ensures the atomicityatomicity of a Transaction, even if the of a Transaction, even if the system crashes in the middle of itsystem crashes in the middle of it
In other words In other words allall of the of the Transaction is applied to the Transaction is applied to the Database, or Database, or nonenone of it is of it is
How?How? Keep a Keep a log/historylog/history of all actions of all actions
carried out on the Databasecarried out on the Database Before making a change, put the log Before making a change, put the log
for the change somewhere “safe”for the change somewhere “safe” After a crash, effects of partially After a crash, effects of partially
executed transactions are undone executed transactions are undone using the logusing the log
J.J.Bunn, Distributed Databases, 2001 25
Each action generates a log Each action generates a log recordrecord
Has an UNDO action Has an UNDO action
Has a REDO actionHas a REDO action
DO/UNDO/REDODO/UNDO/REDO
New stateNew stateOld stateOld state
DODO
Log Log
New stateNew state Old stateOld state
UNDOUNDO
Log Log
New stateNew stateOld stateOld state
REDOREDO
Log Log
J.J.Bunn, Distributed Databases, 2001 26
What Does A Log What Does A Log Record Look Like?Record Look Like?
Log record has Log record has Header (transaction ID, timestamp… )Header (transaction ID, timestamp… ) Item IDItem ID Old valueOld value New valueNew value
For messages: just message textFor messages: just message textand sequence #and sequence #
For records: old and new valueFor records: old and new valueon updateon update
Keep records small Keep records small
? Log ? ? Log ?
J.J.Bunn, Distributed Databases, 2001 27
Transaction Is A Transaction Is A Sequence Of ActionsSequence Of Actions Each action changes stateEach action changes state
Changes database Changes database Sends messagesSends messages Operates a display/printer/drill Operates a display/printer/drill
presspress Leaves a log trailLeaves a log trail
New stateNew stateOld stateOld state
DODO
Log Log
Log Log
New stateNew stateOld stateOld state
DODO
DODO
Log Log
New stateNew stateOld stateOld state
Old stateOld state
DODO
Log Log
New stateNew state
J.J.Bunn, Distributed Databases, 2001 28
Transaction UNDO Is EasyTransaction UNDO Is Easy
Read log backwardsRead log backwards UNDO one step at a timeUNDO one step at a time Can go half-way back toCan go half-way back to
get nested transactionsget nested transactions
New stateNew stateOld stateOld state
UNDOUNDO
Log Log
Log Log
Old stateOld state
UNDOUNDO
New stateNew state
UNDOUNDO
Log Log
Old stateOld state New stateNew state
Old stateOld state
UNDOUNDO
Log Log
New stateNew state
New stateNew stateOld stateOld state
UNDOUNDO
Log Log
Log Log
Old stateOld state
UNDOUNDO
New stateNew state
UNDOUNDO
Log Log
Old stateOld state New stateNew state
New stateNew stateOld stateOld state
UNDOUNDO
Log Log
Log Log
Old stateOld state
UNDOUNDO
New stateNew state
New stateNew stateOld stateOld state
UNDOUNDO
Log Log
J.J.Bunn, Distributed Databases, 2001 29
Durability: Protecting Durability: Protecting The LogThe Log
When transaction commitsWhen transaction commits Put its log in a durable place Put its log in a durable place
(duplexed disk)(duplexed disk) Need log to redo transaction Need log to redo transaction
in case of failurein case of failure System failure: lostSystem failure: lost
in-memory updatesin-memory updates Media failure (lost disk)Media failure (lost disk)
This makes transaction durableThis makes transaction durable Log is sequential fileLog is sequential file
Converts random IO to single Converts random IO to single sequential IOsequential IO
See NTFS or newer UNIX file systemsSee NTFS or newer UNIX file systems
Log Log Log Log Log Log Log Log Log Log Log Log Log Log Log Log
Writ
eW
rite
J.J.Bunn, Distributed Databases, 2001 30
Recovery After System Recovery After System FailureFailure During normal processing, During normal processing,
write checkpoints on non-volatile write checkpoints on non-volatile storagestorage
When recovering from a system When recovering from a system failure…failure… return to the checkpoint statereturn to the checkpoint state Reapply log of all committed transactionsReapply log of all committed transactions Force-at-commit insures log will survive Force-at-commit insures log will survive
restartrestart Then UNDO all uncommitted Then UNDO all uncommitted
transactionstransactions
New stateNew stateOld stateOld state
REDOREDO
Log Log
New stateNew stateOld stateOld state
REDOREDO
Log Log
REDOREDO
New stateNew stateOld stateOld state
Log Log Log Log
Old stateOld state
REDOREDO
New stateNew state
J.J.Bunn, Distributed Databases, 2001 31
IdempotenceIdempotenceDealing with failureDealing with failure
What if fail during restart? What if fail during restart? REDO many timesREDO many times
What if new state not around at What if new state not around at restart?restart? UNDO something not doneUNDO something not done
Old stateOld state
REDOREDO
New stateNew state
Log Log Log Log
REDOREDO
New stateNew state New stateNew state
UNDOUNDO
Old stateOld state
Log Log Log Log
UNDOUNDO
Old stateOld state
J.J.Bunn, Distributed Databases, 2001 32
IdempotenceIdempotenceDealing with failureDealing with failure
Solution: make F(F(x))=F(x) Solution: make F(F(x))=F(x) (idempotence)(idempotence) Discard duplicates Discard duplicates
Message sequence numbers Message sequence numbers to discard duplicatesto discard duplicates
Use sequence numbers on pages to Use sequence numbers on pages to detect statedetect state
(Or) make operations idempotent (Or) make operations idempotent Move to position x, write value V to byte Move to position x, write value V to byte
B…B…Old stateOld state
REDOREDO
New stateNew state
Log Log Log Log
REDOREDO
New stateNew state New stateNew state
UNDOUNDO
Old stateOld state
Log Log Log Log
UNDOUNDO
Old stateOld state
J.J.Bunn, Distributed Databases, 2001 33
The Log: More DetailThe Log: More Detail Actions recorded in the LogActions recorded in the Log
Transaction writes an ObjectTransaction writes an Object Store in the Log: Transaction Store in the Log: Transaction
Identifier, Object Identifier, Identifier, Object Identifier, new value and old valuenew value and old value
This must happen before This must happen before actually writing the Object!actually writing the Object!
Transaction commits or abortsTransaction commits or aborts Duplicate Log on “stable” Duplicate Log on “stable”
storagestorage Log records chained by Log records chained by
Transaction Identifier: easy to Transaction Identifier: easy to undo a Transactionundo a Transaction
J.J.Bunn, Distributed Databases, 2001 34
Structure of a Structure of a DatabaseDatabase
Typical DBMS has a layered Typical DBMS has a layered architecturearchitecture
Query Optimisation & Execution
Relational Operators
Files and Access Methods
Buffer Management
Disk Space Management
Disk
J.J.Bunn, Distributed Databases, 2001 35
Database Database AdministrationAdministration
Design Logical/Physical Design Logical/Physical SchemaSchema
Handle Security and Handle Security and AuthenticationAuthentication
Ensure Data Availability, Ensure Data Availability, Crash RecoveryCrash Recovery
Tune Database as needs and Tune Database as needs and workload evolvesworkload evolves
J.J.Bunn, Distributed Databases, 2001 36
SummarySummary Databases are used to Databases are used to
maintain and query large maintain and query large datasetsdatasets
DBMS benefits include DBMS benefits include recovery from crashes, recovery from crashes, concurrent access, data concurrent access, data integrity and security, quick integrity and security, quick application developmentapplication development
Abstraction ensures Abstraction ensures independenceindependence
ACIDACID Increasingly Important (and Increasingly Important (and
Big) in Scientific and Big) in Scientific and Commercial EnterprisesCommercial Enterprises
Part 2Part 2
Distributed Distributed DatabasesDatabases
..
Julian Bunn
California Institute of Technology
J.J.Bunn, Distributed Databases, 2001 38
Distributed Distributed DatabasesDatabases
Data are stored at several Data are stored at several locationslocations Each managed by a DBMS that Each managed by a DBMS that
can run autonomouslycan run autonomously Ideally, location of data is Ideally, location of data is
unknown to clientunknown to client Distributed Data IndependenceDistributed Data Independence
Distributed Transactions are Distributed Transactions are supportedsupported Clients can write Transactions Clients can write Transactions
regardless of where the regardless of where the affected data are locatedaffected data are located
Distributed Transaction Distributed Transaction AtomicityAtomicity
Hard, and in some cases Hard, and in some cases undesirableundesirable E.g. need to avoid overhead of ensuring location E.g. need to avoid overhead of ensuring location
transparencytransparency
J.J.Bunn, Distributed Databases, 2001 39
Types of Distributed Types of Distributed DatabaseDatabase
Homogeneous: Every site runs Homogeneous: Every site runs the same type of DBMSthe same type of DBMS
Heterogeneous: Different Heterogeneous: Different sites run different DBMS sites run different DBMS (maybe even RDBMS and (maybe even RDBMS and ODBMS)ODBMS)
J.J.Bunn, Distributed Databases, 2001 40
Distributed DBMS Distributed DBMS ArchitecturesArchitectures
Client-ServersClient-Servers Client sends query to each Client sends query to each
database server in the database server in the distributed systemdistributed system
Client caches and accumulates Client caches and accumulates responsesresponses
Collaborating ServerCollaborating Server Client sends query to “nearest” Client sends query to “nearest”
ServerServer Server executes query locallyServer executes query locally Server sends query to other Server sends query to other
Servers, as requiredServers, as required Server sends response to ClientServer sends response to Client
J.J.Bunn, Distributed Databases, 2001 41
Storing the Storing the Distributed DataDistributed Data
In fragments at each siteIn fragments at each site Split the data upSplit the data up Each site stores one or more Each site stores one or more
fragmentsfragments In complete replicas at each In complete replicas at each
sitesite Each site stores a replica of the Each site stores a replica of the
complete datacomplete data A mixture of fragments and A mixture of fragments and
replicasreplicas Each site stores some replicas Each site stores some replicas
and/or fragments or the dataand/or fragments or the data
J.J.Bunn, Distributed Databases, 2001 42
Partitioned Data Partitioned Data Break file into disjoint Break file into disjoint
groupsgroups Exploit data access Exploit data access
localitylocality Put data near consumerPut data near consumer Less network trafficLess network traffic Better response timeBetter response time Better availabilityBetter availability Owner controls data Owner controls data
autonomy autonomy Spread LoadSpread Load
data or traffic may exceed data or traffic may exceed single storesingle store
OrdersN.A. S.A. Europe Asia
J.J.Bunn, Distributed Databases, 2001 43
How to Partition Data?How to Partition Data? How to PartitionHow to Partition
by attribute or by attribute or random or random or by source or by source or by useby use
Problem: to find it must Problem: to find it must havehave Directory (replicated) orDirectory (replicated) or AlgorithmAlgorithm
Encourages Encourages attribute-based attribute-based partitioningpartitioning
N.A. S.A. Europe Asia
J.J.Bunn, Distributed Databases, 2001 44
Replicated DataReplicated DataPlace fragment at many Place fragment at many
sitessites Pros:Pros:+ Improves availabilityImproves availability+ Disconnected (mobile) Disconnected (mobile)
operationoperation+ Distributes loadDistributes load+ Reads are cheaperReads are cheaper
Cons:Cons: N times more updates N times more updates N times more storageN times more storage
Placement strategies:Placement strategies: Dynamic: cache on demandDynamic: cache on demand Static: place specific Static: place specific
Catalog
J.J.Bunn, Distributed Databases, 2001 45
ID #Particles Energy Event# Run# Date Time… … … … … … …
10001 3 121.5 111 13120 3/1406 13:30:55.000110002 3 202.2 112 13120 3/1406 13:30:55.000110003 4 99.3 113 13120 3/1406 13:30:55.000110004 5 231.9 120 13120 3/1406 13:30:55.000110005 6 287.1 125 13120 3/1406 13:30:55.000110006 6 107.7 126 13120 3/1406 13:30:55.000110007 6 98.9 127 13120 3/1406 13:30:55.000110008 9 100.1 128 13120 3/1406 13:30:55.0001
… … … … … … …
FragmentationFragmentation
Horizontal – Horizontal – “Row-wise”“Row-wise” E.g. rows of the table make up one E.g. rows of the table make up one
fragmentfragment Vertical – Vertical – “Column-Wise”“Column-Wise”
E.g. columns of the table make up E.g. columns of the table make up one fragmentone fragment
J.J.Bunn, Distributed Databases, 2001 46
ReplicationReplication
Make synchronised or Make synchronised or unsynchronised copies of data unsynchronised copies of data at serversat servers Synchronised: data are always Synchronised: data are always
current, updates are constantly current, updates are constantly shipped between replicasshipped between replicas
Unsynchronised: good for read-Unsynchronised: good for read-only dataonly data
Increases availability of dataIncreases availability of data Makes query execution fasterMakes query execution faster
J.J.Bunn, Distributed Databases, 2001 47
Distributed Catalogue Distributed Catalogue ManagementManagement
Need to know where data are Need to know where data are distributed in the systemdistributed in the system
At each site, need to name each At each site, need to name each replica of each data fragmentreplica of each data fragment ““Local name”, “Birth Place”Local name”, “Birth Place”
Site Catalogue: Site Catalogue: Describes all fragments and replicas Describes all fragments and replicas
at the siteat the site Keeps track of replicas of relations at Keeps track of replicas of relations at
the sitethe site To find a relation, look up Birth site’s To find a relation, look up Birth site’s
catalogue: “Birth Place” site never catalogue: “Birth Place” site never changes, even if relation is movedchanges, even if relation is moved
J.J.Bunn, Distributed Databases, 2001 48
Replication CatalogueReplication Catalogue Which objects are being Which objects are being
replicatedreplicated Where objects are being Where objects are being
replicated toreplicated to How updates are propagatedHow updates are propagated
Catalogue is a set of tables Catalogue is a set of tables that can be backed up, and that can be backed up, and recovered (as any other table)recovered (as any other table)
These tables are themselves These tables are themselves replicated to each replication replicated to each replication sitesite No single point of failure in the No single point of failure in the
Distributed DatabaseDistributed Database
J.J.Bunn, Distributed Databases, 2001 49
ConfigurationsConfigurations Single Master with multiple read-only Single Master with multiple read-only
snapshot sitessnapshot sites Multiple MastersMultiple Masters Single Master with multiple updatable Single Master with multiple updatable
snapshot sitessnapshot sites Master at record-level granularityMaster at record-level granularity Hybrids of the aboveHybrids of the above
J.J.Bunn, Distributed Databases, 2001 50
Distributed QueriesDistributed Queries
SELECT AVG(E.Energy) FROM SELECT AVG(E.Energy) FROM Events E WHERE E.particles > 3 Events E WHERE E.particles > 3 AND E.particles < 7AND E.particles < 7
Replicated: Copies of the complete Replicated: Copies of the complete Event table at Geneva and at Event table at Geneva and at IslamabadIslamabad
Choice of where to execute queryChoice of where to execute query Based on local costs, network costs, Based on local costs, network costs,
remote capacity, etc.remote capacity, etc.
ID #Particles Energy Event# Run# Date Time… … … … … … …
10001 3 121.5 111 13120 3/1406 13:30:55.000110002 3 202.2 112 13120 3/1406 13:30:55.000110003 4 99.3 113 13120 3/1406 13:30:55.000110004 5 231.9 120 13120 3/1406 13:30:55.000110005 6 287.1 125 13120 3/1406 13:30:55.000110006 6 107.7 126 13120 3/1406 13:30:55.000110007 6 98.9 127 13120 3/1406 13:30:55.000110008 9 100.1 128 13120 3/1406 13:30:55.0001
… … … … … … …
ID #Particles Energy Event# Run# Date Time… … … … … … …
10001 3 121.5 111 13120 3/1406 13:30:55.000110002 3 202.2 112 13120 3/1406 13:30:55.000110003 4 99.3 113 13120 3/1406 13:30:55.000110004 5 231.9 120 13120 3/1406 13:30:55.000110005 6 287.1 125 13120 3/1406 13:30:55.000110006 6 107.7 126 13120 3/1406 13:30:55.000110007 6 98.9 127 13120 3/1406 13:30:55.000110008 9 100.1 128 13120 3/1406 13:30:55.0001
… … … … … … …
IslamabadID #Particles Energy Event# Run# Date Time… … … … … … …
10001 3 121.5 111 13120 3/1406 13:30:55.000110002 3 202.2 112 13120 3/1406 13:30:55.000110003 4 99.3 113 13120 3/1406 13:30:55.000110004 5 231.9 120 13120 3/1406 13:30:55.000110005 6 287.1 125 13120 3/1406 13:30:55.000110006 6 107.7 126 13120 3/1406 13:30:55.000110007 6 98.9 127 13120 3/1406 13:30:55.000110008 9 100.1 128 13120 3/1406 13:30:55.0001
… … … … … … …
Geneva
J.J.Bunn, Distributed Databases, 2001 51
Distributed Queries Distributed Queries (contd.)(contd.) SELECT AVG(E.Energy) FROM SELECT AVG(E.Energy) FROM
Events E WHERE E.particles > Events E WHERE E.particles > 3 AND 3 AND E.particles < 7E.particles < 7
Row-wise fragmented:Row-wise fragmented:
Particles < 5 at Geneva, Particles < 5 at Geneva, Particles > 4 at IslamabadParticles > 4 at Islamabad Need to compute SUM(E.Energy) and Need to compute SUM(E.Energy) and
COUNT(E.Energy) at COUNT(E.Energy) at bothboth sites sites If WHERE clause had E.particles > 4 If WHERE clause had E.particles > 4
then only need to compute at then only need to compute at IslamabadIslamabad
ID #Particles Energy Event# Run# Date Time… … … … … … …
10001 3 121.5 111 13120 3/1406 13:30:55.000110002 3 202.2 112 13120 3/1406 13:30:55.000110003 4 99.3 113 13120 3/1406 13:30:55.000110004 5 231.9 120 13120 3/1406 13:30:55.000110005 6 287.1 125 13120 3/1406 13:30:55.000110006 6 107.7 126 13120 3/1406 13:30:55.000110007 6 98.9 127 13120 3/1406 13:30:55.000110008 9 100.1 128 13120 3/1406 13:30:55.0001
… … … … … … …
J.J.Bunn, Distributed Databases, 2001 52
Distributed Queries Distributed Queries (contd.)(contd.)
SELECT AVG(E.Energy) FROM Events E SELECT AVG(E.Energy) FROM Events E WHERE E.particles > 3 AND E.particles WHERE E.particles > 3 AND E.particles < 7< 7
Column-wise Fragmented: Column-wise Fragmented:
ID, Energy and Event# Columns at ID, Energy and Event# Columns at Geneva, ID and remaining Columns at Geneva, ID and remaining Columns at Islamabad:Islamabad: Need to join on IDNeed to join on ID Select IDs satisfying Particles constraint at Select IDs satisfying Particles constraint at
IslamabadIslamabad SUM(Energy) and Count(Energy) for those SUM(Energy) and Count(Energy) for those
IDs at GenevaIDs at Geneva
ID #Particles Energy Event# Run# Date Time… … … … … … …
10001 3 121.5 111 13120 3/1406 13:30:55.000110002 3 202.2 112 13120 3/1406 13:30:55.000110003 4 99.3 113 13120 3/1406 13:30:55.000110004 5 231.9 120 13120 3/1406 13:30:55.000110005 6 287.1 125 13120 3/1406 13:30:55.000110006 6 107.7 126 13120 3/1406 13:30:55.000110007 6 98.9 127 13120 3/1406 13:30:55.000110008 9 100.1 128 13120 3/1406 13:30:55.0001
… … … … … … …
J.J.Bunn, Distributed Databases, 2001 53
JoinsJoins Joins are used to compare or Joins are used to compare or
combine relations (rows) from combine relations (rows) from two or more tables, when the two or more tables, when the relations share a common relations share a common attribute valueattribute value
Simple approach: for every Simple approach: for every relation in the first table “S”, relation in the first table “S”, loop over all relations in the loop over all relations in the other table “R”, and see if the other table “R”, and see if the attributes matchattributes match
N-way joins are evaluated as N-way joins are evaluated as a series of 2-way joinsa series of 2-way joins
Join Algorithms are a Join Algorithms are a continuing topic of intense continuing topic of intense research in Computer Scienceresearch in Computer Science
J.J.Bunn, Distributed Databases, 2001 54
Join AlgorithmsJoin Algorithms Need to run in memory for best Need to run in memory for best
performanceperformance Nested-Loops: efficient only if “R” Nested-Loops: efficient only if “R”
very small (can be stored in very small (can be stored in memory)memory)
Hash-Join: Build an in-memory Hash-Join: Build an in-memory hash table of “R”, then loop over hash table of “R”, then loop over “S” hashing to check for match“S” hashing to check for match
Hybrid Hash-Join: When “R” hash Hybrid Hash-Join: When “R” hash is too big to fit in memory, split is too big to fit in memory, split join into partitionsjoin into partitions
Merge-Join: Used when “R” and Merge-Join: Used when “R” and “S” are already sorted on the join “S” are already sorted on the join attribute, simply merging them in attribute, simply merging them in parallelparallel
Special versions of Join Algorithms Special versions of Join Algorithms needed for Distributed Database needed for Distributed Database query execution!query execution!
J.J.Bunn, Distributed Databases, 2001 55
Distributed Query Distributed Query OptimisationOptimisation
Cost-based:Cost-based: Consider all “plans”Consider all “plans” Pick cheapest: include Pick cheapest: include
communication costscommunication costs Need to use distributed join Need to use distributed join
methodsmethods Site that receives query Site that receives query
constructs Global Plan, hints constructs Global Plan, hints for local plansfor local plans Local plans may be changed at Local plans may be changed at
each siteeach site
J.J.Bunn, Distributed Databases, 2001 56
ReplicationReplication
Synchronous: All data that Synchronous: All data that have been changed must be have been changed must be propagated before the propagated before the Transaction commitsTransaction commits
Asynchronous: Changed data Asynchronous: Changed data are periodically sentare periodically sent Replicas may go out of sync.Replicas may go out of sync. Clients must be aware of thisClients must be aware of this
J.J.Bunn, Distributed Databases, 2001 57
Synchronous Synchronous Replication CostsReplication Costs
Before an update Transaction Before an update Transaction can commit, it obtains locks can commit, it obtains locks on all modified copieson all modified copies Sends lock requests to remote Sends lock requests to remote
sites, holds lockssites, holds locks If links or remote sites fail, If links or remote sites fail,
Transaction cannot commit until Transaction cannot commit until links/sites restoredlinks/sites restored
Even without failure, Even without failure, commit commit protocolprotocol is complex, and is complex, and involves many messagesinvolves many messages
J.J.Bunn, Distributed Databases, 2001 58
Asynchronous Asynchronous ReplicationReplication
Allows Transaction to commit Allows Transaction to commit before all copies have been before all copies have been modifiedmodified
Two methods:Two methods: Primary SitePrimary Site
Peer-to-PeerPeer-to-Peer
J.J.Bunn, Distributed Databases, 2001 59
Primary Site Primary Site ReplicationReplication
One copy designated as One copy designated as “Master” “Master”
Published to other sites who Published to other sites who subscribe to “Secondary” subscribe to “Secondary” copiescopies
Changes propagated to Changes propagated to “Secondary” copies“Secondary” copies
Done in two steps:Done in two steps: Capture changes made by Capture changes made by
committed Transactionscommitted Transactions Apply these changesApply these changes
J.J.Bunn, Distributed Databases, 2001 60
The Capture StepThe Capture Step
Procedural: A procedure, Procedural: A procedure, automatically invoked, does automatically invoked, does the capture (takes a the capture (takes a snapshot)snapshot)
Log-based: the log is used to Log-based: the log is used to generate a Change Data Tablegenerate a Change Data Table Better (cheaper and faster) but Better (cheaper and faster) but
relies on proprietary log detailsrelies on proprietary log details
J.J.Bunn, Distributed Databases, 2001 61
The Apply StepThe Apply Step
The Secondary site The Secondary site periodically obtains from the periodically obtains from the Primary site a snapshot or Primary site a snapshot or changes to the Change Data changes to the Change Data Table Table Updates its copyUpdates its copy Period can be timer-based or Period can be timer-based or
defined by the user/applicationdefined by the user/application Log-based capture with Log-based capture with
continuous Apply minimises continuous Apply minimises delays in propagating delays in propagating changeschanges
J.J.Bunn, Distributed Databases, 2001 62
Peer-to-Peer Peer-to-Peer ReplicationReplication
More than one copy can be More than one copy can be “Master”“Master”
Changes are somehow Changes are somehow propagated to other copiespropagated to other copies
Conflicting changes must be Conflicting changes must be resolvedresolved
So best when conflicts do not So best when conflicts do not or cannot arise:or cannot arise: Each “Master” owns a disjoint Each “Master” owns a disjoint
fragment or copyfragment or copy Update permission only granted Update permission only granted
to one “Master” at a timeto one “Master” at a time
J.J.Bunn, Distributed Databases, 2001 63
Replication ExamplesReplication Examples Master copy, many slave copies (SQL Master copy, many slave copies (SQL
Server)Server) always know the correct value (master)always know the correct value (master) change propagation can be change propagation can be
transactionaltransactional as soon as possibleas soon as possible periodicperiodic on demandon demand
Symmetric, and anytime (Access)Symmetric, and anytime (Access) allows mobile (disconnected) updatesallows mobile (disconnected) updates updates propagated ASAP, periodic, on updates propagated ASAP, periodic, on
demanddemand non-serializablenon-serializable colliding updates must be reconciled.colliding updates must be reconciled. hard to know “real” valuehard to know “real” value
J.J.Bunn, Distributed Databases, 2001 64
Data Warehousing Data Warehousing and Replicationand Replication
Build giant “warehouses” of data Build giant “warehouses” of data from many sitesfrom many sites Enable complex decision support Enable complex decision support
queries over data from across an queries over data from across an organisationorganisation
Warehouses can be seen as an Warehouses can be seen as an instance of asynchronous instance of asynchronous replicationreplication Source data is typically controlled by Source data is typically controlled by
different DBMS: emphasis on different DBMS: emphasis on “cleaning” data by removing “cleaning” data by removing mismatches while creating replicasmismatches while creating replicas
Procedural Capture and Procedural Capture and application Apply work best for application Apply work best for this environmentthis environment
J.J.Bunn, Distributed Databases, 2001 65
Distributed LockingDistributed Locking How to manage Locks across How to manage Locks across
many sites?many sites? Centrally: one site does all Centrally: one site does all
lockinglocking Vulnerable to single site Vulnerable to single site
failurefailure Primary Copy: all locking for an Primary Copy: all locking for an
object done at the primary copy object done at the primary copy site for the objectsite for the object Reading requires access to Reading requires access to
locking site as well as site locking site as well as site which stores objectwhich stores object
Fully Distributed: locking for a Fully Distributed: locking for a copy done at site where the copy done at site where the copy is storedcopy is stored Locks at all sites while writing Locks at all sites while writing
an objectan object
J.J.Bunn, Distributed Databases, 2001 66
Distributed Deadlock Distributed Deadlock DetectionDetection
Each site maintains a local “waits-Each site maintains a local “waits-for” graphfor” graph
Global deadlock might occur even Global deadlock might occur even if local graphs contain no cyclesif local graphs contain no cycles E.g. Site A holds lock on X, waits for E.g. Site A holds lock on X, waits for
lock on Ylock on Y Site B holds lock on Y, waits for lock Site B holds lock on Y, waits for lock
on Xon X Three solutions:Three solutions:
Centralised (send all local graphs to Centralised (send all local graphs to one site)one site)
Hierarchical (organise sites into Hierarchical (organise sites into hierarchy and send local graphs to hierarchy and send local graphs to parent)parent)
Timeout (abort Transaction if it waits Timeout (abort Transaction if it waits too long)too long)
J.J.Bunn, Distributed Databases, 2001 67
Distributed RecoveryDistributed Recovery
Links and Remote Sites may Links and Remote Sites may crash/failcrash/fail
If sub-transactions of a If sub-transactions of a Transaction execute at Transaction execute at different sites, all or none different sites, all or none must commitmust commit
Need a commit protocol to Need a commit protocol to achieve thisachieve this
Solution: Maintain a Log at Solution: Maintain a Log at each site of commit protocol each site of commit protocol actionsactions Two-Phase CommitTwo-Phase Commit
J.J.Bunn, Distributed Databases, 2001 68
Two-Phase CommitTwo-Phase Commit Site which originates Transaction is Site which originates Transaction is
coordinator, other sites involved in coordinator, other sites involved in Transaction are subordinatesTransaction are subordinates
When the Transaction needs to Commit:When the Transaction needs to Commit: Coordinator sends “prepare” message to Coordinator sends “prepare” message to
subordinatessubordinates Subordinates each force-writes an Subordinates each force-writes an abort abort or or
prepareprepare Log record, and sends “yes” or “no” Log record, and sends “yes” or “no” message to Coordinatormessage to Coordinator
If Coordinator gets unanimous “yes” If Coordinator gets unanimous “yes” messages, force-writes a messages, force-writes a commitcommit Log record, Log record, and sends “commit” message to all and sends “commit” message to all subordinatessubordinates
Otherwise, force-writes an Otherwise, force-writes an abortabort Log record, Log record, and sends “abort” message to all and sends “abort” message to all subordinatessubordinates
Subordinates force-write abort/commit Log Subordinates force-write abort/commit Log record accordingly, then send an “ack” record accordingly, then send an “ack” message to Coordinatormessage to Coordinator
Coordinator writes Coordinator writes endend Log record after Log record after receiving all acksreceiving all acks
J.J.Bunn, Distributed Databases, 2001 69
Notes on Two-Phase Notes on Two-Phase Commit (2PC)Commit (2PC)
First: voting, Second: termination First: voting, Second: termination – both initiated by Coordinator– both initiated by Coordinator
Any site can decide to abort the Any site can decide to abort the TransactionTransaction
Every message is recorded in the Every message is recorded in the local Log by the sender to ensure local Log by the sender to ensure it survives failuresit survives failures
All Commit Protocol log records All Commit Protocol log records for a Transaction contain the for a Transaction contain the Transaction ID and Coordinator ID. Transaction ID and Coordinator ID. The Coordinator’s abort/commit The Coordinator’s abort/commit record also includes the Site IDs of record also includes the Site IDs of all subordinatesall subordinates
J.J.Bunn, Distributed Databases, 2001 70
Restart after Site Restart after Site FailureFailure If there is a commit or abort Log If there is a commit or abort Log
record for Transaction T, but no record for Transaction T, but no end record, then must undo/redo Tend record, then must undo/redo T If the site is Coordinator for T, then If the site is Coordinator for T, then
keep sending keep sending commit/abortcommit/abort messages messages to Subordinates until to Subordinates until acksacks received received
If there is a prepare Log record, If there is a prepare Log record, but no commit or abort:but no commit or abort: This site is a Subordinate for TThis site is a Subordinate for T Contact Coordinator to find status of Contact Coordinator to find status of
T, then T, then write write commit/abortcommit/abort Log record Log record Redo/undoRedo/undo T T Write Write endend Log record Log record
J.J.Bunn, Distributed Databases, 2001 71
BlockingBlocking
If Coordinator for Transaction If Coordinator for Transaction T fails, then Subordinates T fails, then Subordinates who have voted “yes” cannot who have voted “yes” cannot decide whether to decide whether to commitcommit or or abort abort until Coordinator until Coordinator recovers!recovers!
T is T is blockedblocked Even if all Subordinates are Even if all Subordinates are
aware of one another (e.g. via aware of one another (e.g. via extra information in extra information in “prepare” message) they are “prepare” message) they are blocked blocked Unless one of them voted “no”Unless one of them voted “no”
J.J.Bunn, Distributed Databases, 2001 72
Link and Remote Site Link and Remote Site FailuresFailures
IfIf a Remote Site does not a Remote Site does not respond during the Commit respond during the Commit Protocol for TProtocol for T E.g. it crashed or the link is E.g. it crashed or the link is
downdown ThenThen
If current Site is Coordinator for If current Site is Coordinator for T: abortT: abort
If Subordinate and not yet voted If Subordinate and not yet voted “yes”: abort“yes”: abort
If Subordinate and has voted If Subordinate and has voted “yes”, it is blocked until “yes”, it is blocked until Coordinator back onlineCoordinator back online
J.J.Bunn, Distributed Databases, 2001 73
Observations on 2PCObservations on 2PC
Ack messages used to let Ack messages used to let Coordinator know when it can Coordinator know when it can “forget” a Transaction“forget” a Transaction Until it receives all acks, it must Until it receives all acks, it must
keep T in the Transaction Tablekeep T in the Transaction Table If Coordinator fails after If Coordinator fails after
sending “prepare” messages, sending “prepare” messages, but before writing but before writing commit/abort Log record, commit/abort Log record, when it comes back up, it when it comes back up, it aborts Taborts T
If a subtransaction does no If a subtransaction does no updates, its commit or abort updates, its commit or abort status is irrelevantstatus is irrelevant
J.J.Bunn, Distributed Databases, 2001 74
2PC with Presumed 2PC with Presumed AbortAbort When Coordinator aborts T, it When Coordinator aborts T, it
undoes T and removes it from the undoes T and removes it from the Transaction Table immediatelyTransaction Table immediately Doesn’t wait for “acks”Doesn’t wait for “acks” ““Presumes Abort” if T not in Presumes Abort” if T not in
Transaction TableTransaction Table Names of Subordinates not recorded Names of Subordinates not recorded
in in abortabort Log record Log record Subordinates do not send “ack” Subordinates do not send “ack”
on aborton abort If subtransaction does no updates, If subtransaction does no updates,
it responds to “prepare” message it responds to “prepare” message with “reader” (instead of with “reader” (instead of “yes”/”no”)“yes”/”no”)
Coordinator subsequently ignores Coordinator subsequently ignores “reader”s“reader”s
If all Subordinates are “reader”s, If all Subordinates are “reader”s, then 2then 2ndnd. Phase not required. Phase not required
J.J.Bunn, Distributed Databases, 2001 75
Replication and Replication and Partitioning ComparedPartitioning Compared
CentralCentralScaleupScaleup2x2x
more workmore work
Partition Partition ScaleupScaleup2x2x
more workmore work
ReplicationReplicationScaleupScaleup4x4x
more workmore work
ReplicationPartitioningTwo 1 TPS systems
1 TPS server100 Users
1 TPS server100 Users
O t
ps
O t
ps
1 TPS server100 Users
Base casea 1 TPS system
2 TPS server200 Users
Scaleupto a 2 TPS centralized system
Two 2 TPS systems
2 TPS server100 Users
2 TPS server100 Users
1 t
ps
1 t
ps
J.J.Bunn, Distributed Databases, 2001 76
““Porter” Agent-based Porter” Agent-based Distributed DatabaseDistributed Database
Charles Univ, PragueCharles Univ, Prague Based on “Aglets” SDK from IBMBased on “Aglets” SDK from IBM
Part 3Part 3
Distributed SystemsDistributed Systems..
Julian Bunn
California Institute of Technology
J.J.Bunn, Distributed Databases, 2001 78
What’s a Distributed What’s a Distributed System?System?
Centralized: Centralized: everything in one placeeverything in one place stand-alone PC or stand-alone PC or
MainframeMainframe
Distributed: Distributed: some parts remotesome parts remote
distributed usersdistributed users distributed executiondistributed execution distributed datadistributed data
J.J.Bunn, Distributed Databases, 2001 79
Why Distribute?Why Distribute?
No best organizationNo best organization
Organisations constantly swing Organisations constantly swing betweenbetween Centralized: focus, control, economyCentralized: focus, control, economy Decentralized: adaptive, responsive, Decentralized: adaptive, responsive,
competitivecompetitive
Why distribute?Why distribute? reflect organisation or application reflect organisation or application
structure structure empower users / producersempower users / producers improve service (response / improve service (response /
availability)availability) distribute loaddistribute load use PC technology (economics)use PC technology (economics)
J.J.Bunn, Distributed Databases, 2001 80
What What Should Be Should Be
Distributed? Distributed? Users and User Users and User InterfaceInterface Thin client Thin client
ProcessingProcessing Trim clientTrim client
DataData Fat clientFat client
Will discuss tradeoffs Will discuss tradeoffs later later
Database
Business Objects
workflow
Presentation
J.J.Bunn, Distributed Databases, 2001 81
Transparency Transparency in Distributed in Distributed
SystemsSystems Make distributed system as easy to Make distributed system as easy to use and manage as a centralized use and manage as a centralized systemsystem
Give a Single-System ImageGive a Single-System Image
Location transparency:Location transparency: hide fact that object is remotehide fact that object is remote hide fact that object has movedhide fact that object has moved hide fact that object is partitioned or hide fact that object is partitioned or
replicatedreplicated Name doesn’t change if object is Name doesn’t change if object is
replicated, partitioned or moved.replicated, partitioned or moved.
J.J.Bunn, Distributed Databases, 2001 82
Naming- The basicsNaming- The basics Objects haveObjects have
Globally Unique Identifier (GUIDs)Globally Unique Identifier (GUIDs) location(s) = address(es) location(s) = address(es) name(s) name(s) addresses can changeaddresses can change objects can have many namesobjects can have many names
Names are context dependent:Names are context dependent: (Jim @ KGB (Jim @ KGB Jim @ CIA)Jim @ CIA)
Many naming systemsMany naming systems UNC: \\node\device\dir\dir\dir\objectUNC: \\node\device\dir\dir\dir\object Internet: Internet:
http://node.domain.root/dir/dir/dir/objecthttp://node.domain.root/dir/dir/dir/object LDAP: LDAP:
ldap://ldap.domain.root/o=org,c=US,cn=dirldap://ldap.domain.root/o=org,c=US,cn=dir
guid
Jim
Address
James
J.J.Bunn, Distributed Databases, 2001 83
Name ServersName Serversin Distributed in Distributed
SystemsSystems Name servers translate Name servers translate names + context names + context
to address (+ GUID)to address (+ GUID) Name servers are partitioned Name servers are partitioned
(subtrees of name space)(subtrees of name space) Name servers replicate root Name servers replicate root
of name treeof name tree Name servers form a hierarchyName servers form a hierarchy Distributed data from hell: Distributed data from hell:
high read traffic high read traffic high reliability & availabilityhigh reliability & availability
autonomyautonomy
J.J.Bunn, Distributed Databases, 2001 84
Autonomy Autonomy in Distributed in Distributed
SystemsSystems Owner of site Owner of site (or node, or application, or (or node, or application, or database)database)Wants to control itWants to control it
If my part is working, If my part is working, must be able to access & manage itmust be able to access & manage it
(reorganize, upgrade, add user,(reorganize, upgrade, add user,…)…)
Autonomy isAutonomy is EssentialEssential Difficult to implement. Difficult to implement. Conflicts with global consistencyConflicts with global consistency
examples: naming, authentication, examples: naming, authentication, admin…admin…
J.J.Bunn, Distributed Databases, 2001 85
Security Security The BasicsThe Basics
Authentication server Authentication server subject + Authenticator => subject + Authenticator =>
(Yes + (Yes + token) | Notoken) | No
Security matrix:Security matrix: who can do what to who can do what to
whomwhom Access control list is Access control list is
column of matrixcolumn of matrix ““who” is who” is
authenticated IDauthenticated ID In a distributed system, In a distributed system,
“who” and “what” and “who” and “what” and “whom” are distributed “whom” are distributed objectsobjects
subject
Object
Permissions
J.J.Bunn, Distributed Databases, 2001 86
Security Security in Distributed in Distributed
SystemsSystems Security domainSecurity domain: :
nodes with a shared security server.nodes with a shared security server. Security domains can have trust relationships:Security domains can have trust relationships:
A trusts B: A “believes” B when it says this is Jim@BA trusts B: A “believes” B when it says this is Jim@B
Security domains form a hierarchy.Security domains form a hierarchy. Delegation: Delegation: passing authority to a server passing authority to a server
when A asks B to do something when A asks B to do something (e.g. print a file, read a (e.g. print a file, read a database)database)B may need A’s authorityB may need A’s authority
Autonomy requires:Autonomy requires: each node is an authenticatoreach node is an authenticator each node does own security checkseach node does own security checks
Internet Today: Internet Today: no trust among domains no trust among domains (fire walls, many (fire walls, many
passwords)passwords) trust based on digital signaturestrust based on digital signatures
J.J.Bunn, Distributed Databases, 2001 87
Clusters Clusters The Ideal Distributed System.The Ideal Distributed System.
Cluster is Cluster is distributed distributed system BUT singlesystem BUT single locationlocation managermanager security policysecurity policy
relatively relatively homogeneoushomogeneous
communications iscommunications is high bandwidthhigh bandwidth low latencylow latency low error ratelow error rate
Clusters use Clusters use distributed distributed
system system techniques fortechniques for load distributionload distribution
storage storage executionexecution
growthgrowth fault tolerancefault tolerance
J.J.Bunn, Distributed Databases, 2001 88
Cluster: Shared Cluster: Shared What?What? Shared Memory Shared Memory
MultiprocessorMultiprocessor Multiple processors, one Multiple processors, one
memorymemory all devices are localall devices are local HP V-classHP V-class
Shared Disk ClusterShared Disk Cluster an array of nodesan array of nodes all shared common disksall shared common disks VAXcluster + OracleVAXcluster + Oracle
Shared Nothing ClusterShared Nothing Cluster each device local to a nodeeach device local to a node ownership may changeownership may change Beowulf,Tandem, SP2, Beowulf,Tandem, SP2,
WolfpackWolfpack
J.J.Bunn, Distributed Databases, 2001 89
Distributed ExecutionDistributed ExecutionThreads and MessagesThreads and Messages
Thread is Execution unitThread is Execution unit(software analog of (software analog of
cpu+memory)cpu+memory)
Threads execute at a Threads execute at a nodenode
Threads communicate Threads communicate viavia Shared memory (local)Shared memory (local) Messages (local and Messages (local and
remote)remote)
threads
shared memory
messages
J.J.Bunn, Distributed Databases, 2001 90
Peer-to-Peer or Client-Peer-to-Peer or Client-ServerServer
Peer-to-Peer is Peer-to-Peer is symmetric:symmetric: Either side can sendEither side can send
Client-serverClient-server client sends requestsclient sends requests server sends responsesserver sends responses simple subset of peer-to-simple subset of peer-to-
peerpeer
requestresponse
J.J.Bunn, Distributed Databases, 2001 91
Connection-less Connection-less oror ConnectedConnected Connection-less Connection-less
request containsrequest contains client idclient id client contextclient context work requestwork request
client authenticated on client authenticated on each messageeach message
only a single response only a single response messagemessage
e.g. HTTP, NFS v1 e.g. HTTP, NFS v1
Connected Connected (sessions)(sessions)open - request/reply - closeopen - request/reply - closeclient authenticated onceclient authenticated onceMessages arrive in orderMessages arrive in orderCan send many replies Can send many replies (e.g. FTP)(e.g. FTP)
Server has client contextServer has client context (context sensitive) (context sensitive) e.g. Winsock and ODBC e.g. Winsock and ODBC HTTP adding connectionsHTTP adding connections
J.J.Bunn, Distributed Databases, 2001 92
Remote Procedure Remote Procedure Call: The key to Call: The key to transparencytransparency
Object may Object may be local or be local or remoteremote
Methods on Methods on object work object work wherever it wherever it is.is.
Local Local invocationinvocation
y = pObj->f(x);
f()
x
valy = val;
return val;
J.J.Bunn, Distributed Databases, 2001 93
Remote Procedure Remote Procedure Call: Call: The key to transparencyThe key to transparency
Remote invocationRemote invocation
Obj Local?x
valy = val;
f()
return val;
y = pObj->f(x);
marshal
unmarshal
marshal
unmarshal
x
proxy
unmarshal
pObj->f(x)
marshal
xstub
Obj Local?Obj Local?x
val
f()
return val;
val
val
Gee!! Nice pictures!
J.J.Bunn, Distributed Databases, 2001 94
Transaction
Object Request Broker Object Request Broker (ORB)(ORB)
Orchestrates RPCOrchestrates RPC Registers ServersRegisters Servers Manages pools of serversManages pools of servers Connects clients to serversConnects clients to servers Does Naming, request-level Does Naming, request-level
authorization,authorization, Provides transaction coordination Provides transaction coordination (new (new
feature)feature) Old names: Old names:
Transaction Processing Monitor, Transaction Processing Monitor, Web server, Web server, NetWareNetWare Object-Request Broker
J.J.Bunn, Distributed Databases, 2001 95
Send updates to correct Send updates to correct partitionpartition
Using RPC for Using RPC for TransparencyTransparency
Partition TransparencyPartition Transparency
part Local?xy = pfile->write(x);
sendto
correct partition
sendto
correct partition
x
unmarshal
pObj->write(x)
marshal
x
x
val
write()
return val;
val
val
J.J.Bunn, Distributed Databases, 2001 96
Send updates to EACH Send updates to EACH nodenode
Using RPC for Using RPC for TransparencyTransparency
Replication TransparencyReplication Transparency
xy = pfile->write(x);
Sendto
eachreplica
Sendto
eachreplica
x
val
J.J.Bunn, Distributed Databases, 2001 97
Client/Server Client/Server InteractionsInteractions
All can be done with RPCAll can be done with RPC Request-ResponseRequest-Responseresponse may be many response may be many
messagesmessages
ConversationalConversationalserver keeps client contextserver keeps client context
DispatcherDispatcherthree-tier: complex operation at three-tier: complex operation at
serverserver
QueuedQueuedde-couples client from serverde-couples client from serverallows disconnected operationallows disconnected operation
C S
C S
C S SS
C SS
J.J.Bunn, Distributed Databases, 2001 98
Queued Queued Request/ResponseRequest/Response Time-decouples client and serverTime-decouples client and server
Three TransactionsThree Transactions
Almost real time, ASAP processingAlmost real time, ASAP processing
Communicate at each other’s Communicate at each other’s convenienceconvenienceAllows mobile (disconnected) operationAllows mobile (disconnected) operation
Disk queues survive client & server Disk queues survive client & server failuresfailures
Client Server
SubmitPerform
Response
J.J.Bunn, Distributed Databases, 2001 99
Why Queued Why Queued Processing?Processing? Prioritize requestsPrioritize requests
ambulance dispatcher favors high-priority callsambulance dispatcher favors high-priority calls
Manage WorkflowsManage Workflows
Deferred processing in mobile appsDeferred processing in mobile apps
Interface heterogeneous systemsInterface heterogeneous systemsEDI, EDI, MOM: Message-Oriented-Middleware MOM: Message-Oriented-Middleware DAD: Direct Access to Data DAD: Direct Access to Data
Order Build Ship Invoice Pay
J.J.Bunn, Distributed Databases, 2001 100
Work Distribution Work Distribution SpectrumSpectrum
Presentation Presentation
and plug-insand plug-ins Workflow Workflow
manages manages session & session & invokes invokes objectsobjects
Business Business objectsobjects
DatabaseDatabase
Fat
ThinFat
Thin
Database
Business Objects
workflow
Presentation
J.J.Bunn, Distributed Databases, 2001 101
Transaction Processing Transaction Processing Evolution to Three TierEvolution to Three TierIntelligence migrated to clients Intelligence migrated to clients
Mainframe Batch Mainframe Batch processing (centralized)processing (centralized)
Dumb terminals &Dumb terminals & Remote Job Entry Remote Job Entry
Intelligent terminals Intelligent terminals database backendsdatabase backends
Workflow SystemsWorkflow SystemsObject Request BrokersObject Request BrokersApplication GeneratorsApplication Generators
Mainframe
cards
Active
green screen3270
Server
TP Monitor
ORB
J.J.Bunn, Distributed Databases, 2001 102
Web Evolution to Three Web Evolution to Three TierTier
Intelligence migrated to clients (like Intelligence migrated to clients (like TP)TP)
Character-mode clients, Character-mode clients, smart serverssmart servers
GUI Browsers - Web file GUI Browsers - Web file serversservers
GUI Plugins - Web dispatchers GUI Plugins - Web dispatchers - CGI- CGI
Smart clients - Web dispatcher Smart clients - Web dispatcher (ORB)(ORB)pools of app servers (ISAPI, pools of app servers (ISAPI, Viper)Viper)workflow scripts at client & workflow scripts at client & serverserver
archie ghophergreen screen
WebServer
Mosaic
WAIS
NS & IE
Active
J.J.Bunn, Distributed Databases, 2001 103
PC Evolution to Three PC Evolution to Three TierTier
Intelligence migrated to serverIntelligence migrated to server Stand-alone PC Stand-alone PC
(centralized)(centralized)
PC + File & print PC + File & print serverserver
message per I/Omessage per I/O
PC + Database PC + Database server server
message per SQL message per SQL statementstatement
PC + App server PC + App server message per message per
transactiontransaction
ActiveX Client, ORB ActiveX Client, ORB ActiveX server, ActiveX server, Xscript Xscript
disk I/OIO request
reply
SQL Statement
Transaction
J.J.Bunn, Distributed Databases, 2001 104
The Pattern: The Pattern: Three Tier ComputingThree Tier Computing
Clients do presentation, gather Clients do presentation, gather inputinput
Clients do some workflow Clients do some workflow (Xscript)(Xscript)
Clients send high-level requests Clients send high-level requests to ORB (Object Request Broker)to ORB (Object Request Broker)
ORB dispatches workflows and ORB dispatches workflows and business objects -- proxies for business objects -- proxies for client, orchestrate flows & client, orchestrate flows & queuesqueues
Server-side workflow scripts call Server-side workflow scripts call on distributed business objects on distributed business objects to execute taskto execute task
Database
Business Objects
workflow
Presentation
J.J.Bunn, Distributed Databases, 2001 105
The Three The Three TiersTiers
Web Client
HTML
VB or Java Script Engine
VB or Java Virt Machine
VBscritptJavaScrpt
VB Javaplug-ins
InternetORB
HTTP+DCOM
ObjectserverPool
MiddlewareORB
TP MonitorWeb Server...
DCOM (oleDB, ODBC,...)
Object & Dataserver.
LU6.2
IBMLegacy Gateways
J.J.Bunn, Distributed Databases, 2001 106
Why Did Everyone Go To Why Did Everyone Go To Three-Tier?Three-Tier?
ManageabilityManageability Business rules must be with dataBusiness rules must be with data Middleware operations toolsMiddleware operations tools
Performance (scaleability)Performance (scaleability) Server resources are preciousServer resources are precious ORB dispatches requests to server ORB dispatches requests to server
poolspools
Technology & PhysicsTechnology & Physics Put UI processing near userPut UI processing near user Put shared data processing near Put shared data processing near
shared datashared data Database
Business Objects
workflow
Presentation
J.J.Bunn, Distributed Databases, 2001 107
DAD’sRaw Data
Customer comes to storeTakes what he wantsFills out invoiceLeaves money for goods
Easy to buildNo clerks
Why Put Business Why Put Business Objects Objects
at Server?at Server?
Customer comes to store with list Gives list to clerk Clerk gets goods, makes invoiceCustomer pays clerk, gets goods
Easy to manageClerks controls accessEncapsulation
MOM’s Business Objects
J.J.Bunn, Distributed Databases, 2001 108
Why Server Pools?Why Server Pools? Server resources are precious.Server resources are precious.
Clients have 100x more power than Clients have 100x more power than server. server.
Pre-allocate everything on serverPre-allocate everything on server preallocate memorypreallocate memory pre-open filespre-open files pre-allocate threadspre-allocate threads pre-open and authenticate clientspre-open and authenticate clients
Keep high duty-cycle on objectsKeep high duty-cycle on objects (re-use them)(re-use them) Pool threads, not one per clientPool threads, not one per client
Classic example: Classic example: TPC-C benchmarkTPC-C benchmark 2 processes2 processes
everything pre-allocatedeverything pre-allocated
7,000 clients
IIS SQL
Pool ofDBC linksHTTP
N clients x N Servers x F files =N x N x F file opens!!!
IE
J.J.Bunn, Distributed Databases, 2001 109
Classic MistakesClassic Mistakes Thread per terminalThread per terminal
fix: DB server thread poolsfix: DB server thread poolsfix: server poolsfix: server pools
Process per request (CGI)Process per request (CGI)fix: ISAPI & NSAPI DLLs fix: ISAPI & NSAPI DLLs fix: connection poolsfix: connection pools
Many messages per Many messages per operationoperationfix: stored proceduresfix: stored proceduresfix: server-side objectsfix: server-side objects
File open per requestFile open per requestfix: cache hot filesfix: cache hot files
J.J.Bunn, Distributed Databases, 2001 110
Distributed Distributed Applications need Applications need
Transactions!Transactions! Transactions are key to Transactions are key to structuring distributed applications structuring distributed applications
ACID properties easeACID properties easeexception handlingexception handling AtomicAtomic: all or nothing: all or nothing ConsistentConsistent: state transformation: state transformation IsolatedIsolated: no concurrency anomalies: no concurrency anomalies DurableDurable: committed transaction effects : committed transaction effects
persistpersist
J.J.Bunn, Distributed Databases, 2001 111
Programming & Programming & TransactionsTransactions
The Application ViewThe Application View You Start You Start (e.g. in TransactSQL)(e.g. in TransactSQL):: Begin [Distributed] Transaction Begin [Distributed] Transaction
<name><name> Perform actionsPerform actions Optional Save Transaction Optional Save Transaction
<name><name> Commit or RollbackCommit or Rollback
You Inherit a XIDYou Inherit a XID Caller passes you a transactionCaller passes you a transaction You return or Rollback.You return or Rollback. You can Begin / Commit sub-You can Begin / Commit sub-
trans.trans. You can use save pointsYou can use save points
Begin
Commit
Begin
RollBack
Return
RollBackReturn
XID
J.J.Bunn, Distributed Databases, 2001 112
Transaction Save Transaction Save PointsPoints
Backtracking within a Backtracking within a transactiontransaction
Allows app to Allows app to cancel parts cancel parts of a of a transaction transaction prior to prior to commitcommit
This is in This is in most SQL most SQL productsproducts
BEGIN WORK:1
SAVE WORK:2
action
action
action
action
action
action
action
SAVE WORK:4
SAVE WORK:3
ROLLBACK WORK(2)
SAVE WORK:5
action
action
SAVE WORK:6
action
SAVE WORK:7
action
action
action
action
ROLLBACK WORK(7)
SAVE WORK:8
action
action
action
COMMIT WORK
J.J.Bunn, Distributed Databases, 2001 113
Chained TransactionsChained Transactions Commit of T1 implicitly begins T2. Carries context forward to next
transaction cursorscursors lockslocks other stateother state
Transaction #1 Transaction #2Commit
Begin
Processingcontext
establishedestablished
Processingcontextusedused
J.J.Bunn, Distributed Databases, 2001 114
Nested TransactionsNested TransactionsGoing Beyond Flat TransactionsGoing Beyond Flat Transactions
Need transactions within Need transactions within transactionstransactions
Sub-transactions commit only if Sub-transactions commit only if root doesroot does
Only root commit is durable.Only root commit is durable. Subtransactions may rollbackSubtransactions may rollback
if so, all its subtransactions if so, all its subtransactions rollbackrollback
Parallel version of nested Parallel version of nested transactionstransactions
T1
T114
T113
T112T111
T11
T123T122T121T12
T131 T132 T133T13
J.J.Bunn, Distributed Databases, 2001 115
Workflow: Workflow: A Sequence of TransactionsA Sequence of Transactions
Application transactions are Application transactions are multi-stepmulti-step order, build, ship & invoice, order, build, ship & invoice,
reconcilereconcile Each step is an ACID unitEach step is an ACID unit Workflow is a script describing Workflow is a script describing
stepssteps Workflow systems Workflow systems
Instantiate the scriptsInstantiate the scripts Drive the scriptsDrive the scripts Allow query against scriptsAllow query against scripts
Examples Examples Manufacturing Work In Process Manufacturing Work In Process
(WIP)(WIP)Queued processingQueued processingLoan application & approval,Loan application & approval,Hospital admissions…Hospital admissions…
Database
Business Objects
workflow
Presentation
J.J.Bunn, Distributed Databases, 2001 116
Workflow ScriptsWorkflow Scripts Workflow scripts are programsWorkflow scripts are programs
(could use VBScript or JavaScript)(could use VBScript or JavaScript)
If step fails, compensation action If step fails, compensation action handles errorhandles error
Events, messages, time, other steps Events, messages, time, other steps cause step.cause step.
Workflow controller drives flowsWorkflow controller drives flows
Source
Step
branchcase
fork
loopCompensationAction
join
J.J.Bunn, Distributed Databases, 2001 117
Workflow and ACIDWorkflow and ACID Workflow is not Atomic or IsolatedWorkflow is not Atomic or Isolated Results of a step visible to allResults of a step visible to all Workflow is Consistent and Workflow is Consistent and
DurableDurable Each flow may take hours, weeks, Each flow may take hours, weeks,
monthsmonths Workflow controller Workflow controller
keeps flows movingkeeps flows moving maintains context (state) for each flowmaintains context (state) for each flow provides a query and operator provides a query and operator
interfaceinterfacee.g.: “what is the status of Job # e.g.: “what is the status of Job # 72149?”72149?”
J.J.Bunn, Distributed Databases, 2001 118
ACID Objects Using ACID ACID Objects Using ACID DBsDBs
The easy way to build transactional The easy way to build transactional objectsobjects
Application uses transactional Application uses transactional objectsobjects(objects have ACID properties)(objects have ACID properties)
If object built on top of ACID If object built on top of ACID objects, objects, then object is ACID.then object is ACID. Example: New, EnQueue, DeQueue Example: New, EnQueue, DeQueue
on top of SQLon top of SQL SQL provides ACIDSQL provides ACID
SQL
Business Object: Customer
Business Object Mgr: CustomerMgr
SQL
dim c as Customerdim CM as CustomerMgr...set C = CM.get(CustID)...C.credit_limit = 1000...CM.update(C, CustID)..Persistent Programming languages automate this.
J.J.Bunn, Distributed Databases, 2001 119
ACID Objects From Bare ACID Objects From Bare Metal Metal The Hard Way to Build The Hard Way to Build
Transactional ObjectsTransactional Objects Object Class is a Object Class is a Resource Manager Resource Manager (RM)(RM) Provides ACID objects from persistent Provides ACID objects from persistent
storagestorage Provides Undo (on rollback)Provides Undo (on rollback) Provides Redo (on restart or media Provides Redo (on restart or media
failure)failure) Provides Isolation for concurrent opsProvides Isolation for concurrent ops
Microsoft SQL Server, IBM DB2, Microsoft SQL Server, IBM DB2, Oracle,…Oracle,…are Resource managers.are Resource managers.
Many more coming.Many more coming. RM implementation techniques described RM implementation techniques described
laterlater
J.J.Bunn, Distributed Databases, 2001 120
Transaction ManagerTransaction Manager Transaction Manager (TM): Transaction Manager (TM):
manages transaction manages transaction objects.objects. XID factoryXID factory tracks themtracks them coordinates themcoordinates them
App gets XID from TMApp gets XID from TM Transactional RPC Transactional RPC
passes XID on all callspasses XID on all calls manages XID inheritancemanages XID inheritance
TM manages commit & TM manages commit & rollbackrollback
AppAppRMRM
TMTM
begin
XID
call(..XID)
enlist
J.J.Bunn, Distributed Databases, 2001 121
TM Two-Phase TM Two-Phase CommitCommit
Dealing with multiple RMsDealing with multiple RMs If all use one RM, then all or none If all use one RM, then all or none commitcommit
If multiple RMs, then need If multiple RMs, then need coordinationcoordination
Standard technique:Standard technique: Marriage: Do you? I do. I Marriage: Do you? I do. I
pronounce…Kisspronounce…Kiss Theater: Ready on the set? Ready! Theater: Ready on the set? Ready!
Action! ActAction! Act Sailing: Ready about? Ready! Helm’s Sailing: Ready about? Ready! Helm’s
a-lee! Tacka-lee! Tack Contract law: Escrow agentContract law: Escrow agent
Two-phase commit:Two-phase commit: 1. Voting phase: can you do it?1. Voting phase: can you do it? 2. If all vote yes, then commit phase: 2. If all vote yes, then commit phase:
do it! do it!
J.J.Bunn, Distributed Databases, 2001 122
Two-Phase Commit In Two-Phase Commit In PicturesPictures Transactions managed by TM Transactions managed by TM
App gets unique ID (XID) from App gets unique ID (XID) from TM at Begin()TM at Begin()
XID passed on Transactional XID passed on Transactional RPCRPC
RMs Enlist when first do work RMs Enlist when first do work on XIDon XID
AppApp RM1RM1
TMTM
RM2RM2
Begin
XID
Call(..XID..)
Enlist
Call(..XID..)
Enlist
J.J.Bunn, Distributed Databases, 2001 123
When App Requests When App Requests CommitCommit
Two Phase Commit in PicturesTwo Phase Commit in Pictures TM tracks all RMs enlisted on an TM tracks all RMs enlisted on an XIDXID
TM calls enlisted RM’s Prepared() TM calls enlisted RM’s Prepared() callbackcallback
If all vote yes, TM calls RM’s Commit() If all vote yes, TM calls RM’s Commit() If any vote no, TM calls RM’s Rollback()If any vote no, TM calls RM’s Rollback()
AppApp RM1RM1
TMTM
RM2RM2
1. Application requests Commit1. Application requests Commit
Commit1
Prepare
Prepare2
2
2. TM broadcasts prepared?2. TM broadcasts prepared?
YesYes
3
3
3. RMs all vote Yes3. RMs all vote Yes
4. TM decides Yes, 4. TM decides Yes, broadcastsbroadcasts
4
Comm
it
Comm
it4
5. RMs 5. RMs acknowledgeacknowledge
5
Yes5
Yesyes
6. TM says6. TM saysyesyes
J.J.Bunn, Distributed Databases, 2001 124
Implementing Implementing TransactionsTransactions
AtomicityAtomicity The DO/UNDO/REDO protocolThe DO/UNDO/REDO protocol IdempotenceIdempotence Two-phase commitTwo-phase commit
DurabilityDurability Durable logsDurable logs Force at commitForce at commit
IsolationIsolation Locking or versioningLocking or versioning
Part 4Part 4
Distributed Distributed Databases for Databases for
PhysicsPhysics..
Julian Bunn
California Institute of Technology
J.J.Bunn, Distributed Databases, 2001 126
Distributed Distributed Databases in PhysicsDatabases in Physics
Virtual Observatories (e.g. Virtual Observatories (e.g. NVO)NVO)
Gravity Wave Data (e.g. LIGO)Gravity Wave Data (e.g. LIGO) Particle Physics (e.g. LHC Particle Physics (e.g. LHC
Experiments)Experiments)
J.J.Bunn, Distributed Databases, 2001 127
Distributed Particle Distributed Particle Physics DataPhysics Data
Next Generation of particle Next Generation of particle physics experiments are data physics experiments are data intensiveintensive Acquisition rates of 100 Acquisition rates of 100
MBytes/secondMBytes/second At least One PetaByte (10At least One PetaByte (101515
Bytes) of raw data per year, per Bytes) of raw data per year, per experimentexperiment
Another PetaByte of Another PetaByte of reconstructed datareconstructed data
More PetaBytes of simulated More PetaBytes of simulated datadata
Many TeraBytes of MetaDataMany TeraBytes of MetaData To be accessed by ~2000 To be accessed by ~2000
physicists sitting around the physicists sitting around the globeglobe
J.J.Bunn, Distributed Databases, 2001 128
An Ocean of ObjectsAn Ocean of Objects Access from anywhere to any Access from anywhere to any
object in an Ocean of many object in an Ocean of many PetaBytes of objectsPetaBytes of objects
Approach:Approach: Distribute collections of useful Distribute collections of useful
objects to where they will be objects to where they will be most usedmost used
Move applications Move applications toto the the collection locationscollection locations
Maintain an up-to-date Maintain an up-to-date catalogue of collection locationscatalogue of collection locations
Try to balance the global Try to balance the global compute resources with the compute resources with the task load from the global clientstask load from the global clients
J.J.Bunn, Distributed Databases, 2001 129
RDBMS vs. Object RDBMS vs. Object DatabaseDatabase
•DBMS functionality split between the client and server
•allowing computing resources to be used
•allowing scalability.
•clients added without slowing down others,
•ODBMS automatically establishes direct, independent, parallel communication paths between clients and servers
•servers added to incrementally increase performance without limit.
•Users send requests into the server queue
•all requests must first be serialized through this queue.
•to achieve serialization and avoid conflicts, all requests must go through the server queue.
•Once through the queue, the server may be able to spawn off multiple threads
J.J.Bunn, Distributed Databases, 2001 130
Designing the Designing the Distributed DatabaseDistributed Database
Problem is: how to handle Problem is: how to handle distributed clients and distributed distributed clients and distributed data whilst maximising client task data whilst maximising client task throughput and use of resourcesthroughput and use of resources
Distributed Databases for:Distributed Databases for: The physics dataThe physics data The metadataThe metadata
Use middleware that is conscious Use middleware that is conscious of the global state of the system:of the global state of the system: Where are the clients?Where are the clients? What data are they asking for?What data are they asking for? Where are the CPU resources?Where are the CPU resources? Where are the Storage resources?Where are the Storage resources? How does the global system measure How does the global system measure
up to it workload, in the past, now up to it workload, in the past, now and in the future?and in the future?
J.J.Bunn, Distributed Databases, 2001 131
Distributed Distributed Databases for HEPDatabases for HEP
Replica synchronisation usually based Replica synchronisation usually based on small transactionson small transactions But HEP transactions are large (and long-But HEP transactions are large (and long-
lived)lived) Replication at the Object level desiredReplication at the Object level desired
Objectivity DRO requires dynamic quorumObjectivity DRO requires dynamic quorum bad for unstable WAN linksbad for unstable WAN links
So too difficult – use file replication So too difficult – use file replication E.g. GDMP Subscription methodE.g. GDMP Subscription method
Which Replica to Select?Which Replica to Select? Complex decision tree, involvingComplex decision tree, involving
Prevailing WAN and Systems conditionsPrevailing WAN and Systems conditions Objects that the Query “touches” and Objects that the Query “touches” and
“needs”“needs” Where the compute power isWhere the compute power is Where the replicas areWhere the replicas are Existence of previously cached datasetsExistence of previously cached datasets
J.J.Bunn, Distributed Databases, 2001 132
Distributed LHC Distributed LHC Databases TodayDatabases Today
Architecture is Architecture is loosely loosely coupled, coupled, autonomous, autonomous, Object Object DatabasesDatabases
File-based File-based replication replication withwith
Globus Globus middlewaremiddleware
Efficient WAN Efficient WAN transporttransport