cs346: advanced databases graham cormode [email protected] distributed databases, base, cap...

16
CS346: Advanced Databases Graham Cormode [email protected]. uk Distributed Databases, BASE, CAP & NoSQL

Upload: clinton-blankenship

Post on 15-Jan-2016

222 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: CS346: Advanced Databases Graham Cormode G.Cormode@warwick.ac.uk Distributed Databases, BASE, CAP & NoSQL

CS346: Advanced DatabasesGraham Cormode [email protected]

Distributed Databases, BASE, CAP

& NoSQL

Page 2: CS346: Advanced Databases Graham Cormode G.Cormode@warwick.ac.uk Distributed Databases, BASE, CAP & NoSQL

Outline

Chapter: “Distributed Databases” in Elmasri and Navathe

¨ What are distributed databases?¨ Architectural choices¨ ACID vs BASE¨ Consistency, Availability, Partition tolerance: CAP¨ NoSQL systems

Why?¨ As data gets larger, must move to distributed data management¨ Tech companies (Google, Facebook etc.) rely on distributed data

CS346 Advanced Databases2

Page 3: CS346: Advanced Databases Graham Cormode G.Cormode@warwick.ac.uk Distributed Databases, BASE, CAP & NoSQL

Distributed Databases

¨ When data gets large and processing is slow, use distribution– A distributed database (DDB) managed by a distributed DBMS– Goal: split the processing into smaller pieces and spread them

¨ DDB technology combines databases with OS/Networks– Manage concurrent access to replicated data

¨ DDB is quite different to e.g. the world-wide web– Similarities: many machines, distributed around the world– Different: each website is (mostly) independent of others

Facebook and YouTube are managed independently– However: many large websites use DDB technology

Facebook can be seen as a massive distributed database

CS346 Advanced Databases3

Page 4: CS346: Advanced Databases Graham Cormode G.Cormode@warwick.ac.uk Distributed Databases, BASE, CAP & NoSQL

Distributed Databases: Pros and Cons¨ DDB can be (in principle) more available

– If one machine fails, others can take over¨ DDB can (in principle) be faster

– Parallelize computation, combine results¨ DDB is (in principle) easier to expand

– Just add more machines/storage¨ “In principle” isn’t always the case

– DDB is more complicated to manage– Performance/availability may worsen in unpredictable ways

CS346 Advanced Databases4

Page 5: CS346: Advanced Databases Graham Cormode G.Cormode@warwick.ac.uk Distributed Databases, BASE, CAP & NoSQL

Additional functionality of DDB

The DDB has additional or expanded roles to perform:¨ Keeping track of data distribution: where’s my data?¨ Distributed query processing: break up a query into pieces¨ Distributed transaction management: data items are distributed¨ Replicated data management: keep distribute copies of the data¨ Distributed database recovery: manage machine failures¨ Security: manage security of distributed data¨ Distributed catalog management: keep the metadataSaw some of these issues in Hadoop/MapReduce

CS346 Advanced Databases5

Page 6: CS346: Advanced Databases Graham Cormode G.Cormode@warwick.ac.uk Distributed Databases, BASE, CAP & NoSQL

Distributed Architectures

¨ Many possible levels of sharing:– Shared memory: multiple processors (cores) share disk, memory– Shared disk: multiple cores share disk, but have separate memory– Shared nothing: no common storage, communicate over network

¨ ‘Shared nothing’ is the model for large distributed systems– Hadoop follows a shared nothing architecture

¨ Shared nothing pros and cons:– Can be slower: network is slower than local disk (is it? fibre is fast)– Easy to expand: add more machines to the network– Allows fragmentation (sharding): breaking the database into pieces

CS346 Advanced Databases6

Page 7: CS346: Advanced Databases Graham Cormode G.Cormode@warwick.ac.uk Distributed Databases, BASE, CAP & NoSQL

Fragmentation and Replication

¨ How to split the data up among sites?– Horizontal fragmentation: subset of tuples on each machine

E.g. break up the EMPLOYEE relation by Dno– Vertical fragmentation: different columns on each machine

Name, Bdate, Address on one, Ssn, Salary, Dno on another– Mixed: break up by both horizontal and vertical

¨ How to replicate data around the system? – No replication: a unique copy exists– Fully replicated: data is copied everywhere– Partial replication: in between these two extremes

E.g. HDFS, default number of replicas is 3

CS346 Advanced Databases7

Page 8: CS346: Advanced Databases Graham Cormode G.Cormode@warwick.ac.uk Distributed Databases, BASE, CAP & NoSQL

ACID vs BASE systems

¨ Recall the ACID properties of transactions– Atomicity, Consistency, Isolation, Durability

¨ Not every system requires this level of guarantee– Can trade-off guarantees for perfomance

¨ “BASE”: Basically Available, Soft-State, Eventually Consistent (coined by Eric Brewer, founder of Inktomi, 2000)– A weaker set of requirements – Drop consistency and isolation to improve availability, performance– Suits distributed settings without much competition for resources

¨ ACID vs BASE is a spectrum of possible design points– “Real internet systems are a mixture of ACID and BASE subsystems”

CS346 Advanced Databases8

Page 9: CS346: Advanced Databases Graham Cormode G.Cormode@warwick.ac.uk Distributed Databases, BASE, CAP & NoSQL

CAP concepts

¨ Consistency: all processes/transactions see the same data– Equivalent to having a single, up to date copy of the data– Not easy to provide, hence much effort on concurrency

¨ Availability: is the system up and responsive to requests?– All processes can find some version of the data they need– Formally: does every request receive a response (allowing fails)

¨ Partition-tolerance: what happens when the network breaks?– Network partition: something breaks and the network divides

E.g. a router fails/crashes: messages can’t traverse the router– Does the system still operate even if messages are lost?

CS346 Advanced Databases9

Page 10: CS346: Advanced Databases Graham Cormode G.Cormode@warwick.ac.uk Distributed Databases, BASE, CAP & NoSQL

Points of Comparison

¨ Consistency: strong (ACID) or weak consistency (BASE)?– Weak: processes can see operations in different orders– Weak: synchronization points bring processes into agreement

¨ Eventual consistency: system eventually reaches a consistent state– If no updates are made to an item, then reads will give same value

¨ Compared to ACID, the BASE approach is: – More focused on availability of resources– Tolerates approximate answers rather than exact– More aggressive (optimistic concurrency control)– Aims to be simpler, faster– Provides ‘best effort’ rather than guarantees

CS346 Advanced Databases10

Page 11: CS346: Advanced Databases Graham Cormode G.Cormode@warwick.ac.uk Distributed Databases, BASE, CAP & NoSQL

The CAP Conjecture / Theorem

¨ Brewer made a famous “CAP conjecture” in 2000– Consistency, Availability, Partition Tolerance: pick any two– I.e. it is impractical to build a distributed system with all three

¨ Lynch and Gilbert “proved” a CAP theorem in 2002– For a specific set of distributed scenarios

¨ An example of a ‘pick two’ (from three) choice– For university: Good grades, enough sleep or a social life– For products: fast, good or cheap

CS346 Advanced Databases11

Page 12: CS346: Advanced Databases Graham Cormode G.Cormode@warwick.ac.uk Distributed Databases, BASE, CAP & NoSQL

Consequences of CAP Theorem

Obtain different results from different choices: ¨ Forfeit partition tolerance (obtain consistency and availability)

– E.g. traditional centralized DBMS¨ Forfeit availability (obtain partition tolerance and consistency)

– E.g. distributed databases, protocols based on majority agreement¨ Forfeit consistency (obtain partition tolerance and availability)

– E.g. Emerging NoSQL systems¨ These concepts cut across many aspects of computer science:

– The OS and network provide availability, but no consistency– Databases are better at consistency than availability– Distributed databases want both

CS346 Advanced Databases12

Page 13: CS346: Advanced Databases Graham Cormode G.Cormode@warwick.ac.uk Distributed Databases, BASE, CAP & NoSQL

NoSQL systems

¨ NoSQL systems drop support for the full relational model– Do not provide same level of reliability/availability– Do not necessarily support rich languages like SQL– Aim to have simpler design, better scaling via distribution

¨ Often support analysis via query language or MapReduce on top– Systems primarily support data storage and retrieval

CS910 Foundations of Data Analytics13

Page 14: CS346: Advanced Databases Graham Cormode G.Cormode@warwick.ac.uk Distributed Databases, BASE, CAP & NoSQL

Types of NoSQL systems

¨ Key-value store: stores and retrieves data in the form (key, value)– E.g. store demographic data (values) for each user (by key)– Data is distributed, and replicated for resilience, e.g. Memcached

¨ Column store: stores data organized by column (instead of row)– Allows faster access to particular entries when data is sparse– Implemented in Hbase (database component of Hadoop system)

¨ Document store: to store and retrieve document data– E.g. to store information for very large websites (Amazon, eBay)– Each “document” can be an arbitrary collection of information– Examples include MongoDB and Apache Cassandra

CS910 Foundations of Data Analytics14

Page 15: CS346: Advanced Databases Graham Cormode G.Cormode@warwick.ac.uk Distributed Databases, BASE, CAP & NoSQL

NoSQL systems: pros and cons

¨ NoSQL systems are highly popular at the moment– Scale to truly massive amounts of data– Allow analytics on top via MapReduce/Hadoop– Can be very fast to retrieve data

¨ But they also have limitations– Systems still under development, hard to make use of– Some quite primitive: just provide data storage/retrieval– Currently have to write and debug code to implement applications– Can be overkill when your data is not massive

CS910 Foundations of Data Analytics15

Page 16: CS346: Advanced Databases Graham Cormode G.Cormode@warwick.ac.uk Distributed Databases, BASE, CAP & NoSQL

Summary

CS346 Advanced Databases16

¨ Motivations for Distributed Databases¨ Architectural choices for distributed databases

¨ What is shared? How much replication? ¨ ACID/BASE (Basically Available, Soft-State, Eventually Consistent)¨ Consistency, Availability, Partition tolerance: CAP

¨ Pick any two¨ NoSQL systems: key-value, column, document store

Recommended reading: Brewer’s PODC’00 KeynoteChapter: “Distributed Databases” in Elmasri and Navathe