code camp2012
DESCRIPTION
This is the presentation that I gave at Silicon Valley Code Camp 2012. The deck covers various aspects of bigdata and NoSQL solutions available to handle this.TRANSCRIPT
Sanjeev Mishra SVCC 2012
Big Data and NoSQL Landscape
Sanjeev MishraSilicon Valley Code Camp 2012
Sanjeev Mishra SVCC 2012
Timeline• 1970s – Genesis of modern db
• Modeling the world based on relational calculus: best for managing uniform data
• 1980s
• RDBMS takes over the world
• 1990s – 2000+
• Invention of HTML• Spread of Web based technologies
Sanjeev Mishra SVCC 2012
Need for Modern Data Storage
• Amazon• Managing: Shopping carts, Seller Lists, Customer
Preferences, Sales Rank, Recommendations
• Google• Storing and managing web scale data
• Facebook• Managing social graphs
• LinkedIn, Twitter and others
Sanjeev Mishra SVCC 2012
Data Explosion Current
• Every two days now we create as much information as we did from the dawn of civilization up until 2003 - about 5 exabytes (1K PB) of data: Eric Schmidt *
Sanjeev Mishra SVCC 2012
Data Explosion Future
• A telescope planned to be finished in 2024 will generate more data in a single day than the entire Internet.*
Sanjeev Mishra SVCC 2012
What is Big Data?
• Terabytes(TB) is not big data, petabytes (PB) (1000 TB) may be.
• Current definition of big data: zettabytes (1M PB or 1G TB)
Sanjeev Mishra SVCC 2012
Nature of Big DataWeb 2.0 kind of data
• Different from traditional RDBMS/Warehouse data – more reads less updates
• User Generated Content – Tweets, Reviews, Comments etc…
• Lots of updates and lots of reads• Scale to millions of users• Not necessarily Transactional• Compromised consistency
Sanjeev Mishra SVCC 2012
Data Explosion, So What?
• Structural issues• The dynamic nature of data
• Performance issues• Insertion• Search
• Scaling Horizontally • Dozens or hundreds of machines to operate as
single server
Sanjeev Mishra SVCC 2012
What is NoSQL?Not Only SQL or Not Relational
• Carlo Strozzi used it in 1998 and then Eric Evans in 2009
• Simple call level interface (SQL not supported)
• Flexible schema
• Efficient use of distributed indexes
•Horizontally scaling of operations over many server
• No ACID but BASE (Basically Available, Soft state*, Eventually consistent**)
Sanjeev Mishra SVCC 2012
CAP Theorem (Brewer’s Theorem)*
A distributed system can satisfy any two of following three guarantees at any time
o Consistency (all nodes see the same data at the same time)
o Availability (a guarantee that every request receives a response about whether it was successful or failed)
o Partition tolerance (the system continues to operate despite arbitrary message loss or failure of part of the system)
Sanjeev Mishra SVCC 2012
Eventual Consistency Flavors
• Causal consistency o changes are notified through events, the
receiving session will always see the updated value.
• Read your own writeso a session that updates the db will immediately
see the changes.
• Monotonic consistency*o once a session reads a value will never see an
earlier value.
Sanjeev Mishra SVCC 2012
Consistency Tradeoffs
Where,o N is # of copies of each data that db maintainso R is # of copies that is read for each reado W is # of copies that must be written for each
write
• Most NoSQL use N>W>1: More than one write must complete but not all nodes need to update immediately.
Sanjeev Mishra SVCC 2012
Column Vs Row Storage
Sanjeev Mishra SVCC 2012
Row vs. Column Oriented DB
Row oriented1
John
Doe
111-222-3333
8/12/1968
2
Jane
Doe
111-332-3408
4/3/1972
1
2
John
Jane
Doe
Doe
111-222-3333
111-332-3408
8/12/1968
4/3/1972
Column oriented
Id First name Last name SSN DOB
1 John Doe 111-222-3333 8/12/1968
2 Jane Doe 111-332-3408 4/3/1972
Sanjeev Mishra SVCC 2012
Contrasting Operations on Row vs Col DB
Row oriented Column oriented
Insert a new tuple
1
John
Doe
111-22-3333
8/12/1968
2
Jane
Doe
111-32-3408
4/3/1972
3
Foo
Bar
237-23-3924
2/3/1978
1
2
3John
Jane
FooDoe
Doe
Bar111-22-3333
111-32-3408
237-23-3924
8/12/1968
4/3/1972
2/3/1978
Sanjeev Mishra SVCC 2012
Row vs. Column Oriented DB
Row oriented Column oriented
Create a new attribute
1
John
Doe
111-22-3333
8/12/1968
2
Jane
Doe
111-32-3408
4/3/1972
408-555-1212
650-555-2323
1
2
John
Jane
Doe
Doe
111-22-3333
111-32-3408
8/12/1968
4/3/1972
408-555-1212
650-555-2323
Sanjeev Mishra SVCC 2012
Row vs. Column Oriented DB
Row oriented Column oriented
Get all who were born in a given year
Easy, just pick all rows where year of DOB matches the given year
Not so simple, scan the years and remember the indexes of all occurrences that match given year and extract based on these indexes
Get sum of all years
Easy, the data is found consecutively
Little difficult, data does not live consecutively so scanning through entire dataset needed
Sanjeev Mishra SVCC 2012
Glossary
• Consistent Hashing (Cassandra, Dynamo)o the output range of a hash function is treated as a fixed circular space or
“ring” (i.e. the largest hash value wraps around to the smallest hash value)
• Vector Clock (Cassandra, Riak, Dynamo)o an algorithm for generating a partial ordering of events in a distributed
system and detecting causality violations
• Quorum (Cassandra, Dynamo (sloppy))
• Merkle Tree (Cassandra, Riak, Dynamo)o a hash tree where leaves are hashes of the values of individual keys. Parent
nodes higher in the tree are hashes of their respective children. The principal advantage of Merkle tree is that each branch of the tree can be checked independently without requiring nodes to download the entire data set
• Anti-Entropy Gossip Protocol (Cassandra, Dynamo) o comparing all the replicas of each piece of data that exist and updating each
replica to the newest version
• Order preserving partitioning (Cassandra, MongoDB)
Sanjeev Mishra SVCC 2012
Glossary
• MVCCo multi version concurrency control
• Atomicityo all or nothing
• Consistencyo each transaction leaves the db in valid state
• Isolationo concurrent execution of txn results into a state that is obtained if txn were
executed serially
• Durabilityo committed txn remain so even in the event of power loss, crashes or errors
• WALo Write ahead logging – changes are written to a log before they are applied
(Durability)
• Eventually consistento sufficiently long quiet period all updates can be expected to propagate
eventually through the system and all replicas will be consistent
Sanjeev Mishra SVCC 2012
Glossary• Sharding
o horizontal partitioning of data, storing records on different servers according to some key
• Tupleo row in RDBMS, predefined schema.
• Documento contains nested document or lists as well as scalar values. No predefined schema.
• Extensible Recordo hybrid between Tuple and Document, families of attributes defined in a schema
but attributes can be added on a per record basis.
• Key-value Storeso stores values indexed by a user defined key.
• Document Storeso indexed document store
• Extensible Record Stores aka Wide Column Storeso Stores extensible records partitioned vertically and horizontally across nodes.
Sanjeev Mishra SVCC 2012
NoSQL Categories• Key-value Stores
o Stores values indexed by a user defined key.
• Document Storeso Indexed document store
• Extensible Record Stores (Column Stores)o Stores extensible records partitioned vertically
and horizontally across nodes.
• Graph Databases
Sanjeev Mishra SVCC 2012
Key-Value Stores
Sanjeev Mishra SVCC 2012
Key-Value Stores• A distributed cache/Hashtable
o Inspired by Amazon Dynamoo like memcached with
o persistence, replication, versioning, locking, transactions, sorting etc.
o get/put and lookups o No secondary indices or keyso Values are BLOBs or in some cases JSON
documento Scalability through key distribution over nodes
Sanjeev Mishra SVCC 2012
Key-Value Stores• Riak (Erlang/Basho/Apache)
• Membase (C+Erlang/Couchbase/Apache)
• Project Voldemort (Java/LinkedIn/Apache)
• Redis (C/VMWare/BSD)
• Scalaris (Erlang/Zuse+onScale/Apache)
• Tokyo Cabinet (C/Fal Labs/LGPL)
• Dynamo (Java/For Amazon internal use)
There are others Key Value / Tuple Store at http://nosql-database.org/
Sanjeev Mishra SVCC 2012
Amazon Dynamo
• KV Store Developed by Amazon to supporto Best Seller Listso Shopping cartso Customer Preferenceso Session Managemento Sales Ranko Product Catalog etc...
• Variation of Consistent Hashing based Data Partitioning and Replication
• Dynamic add/delete of Storage Nodes
• Each service uses distinct instance of Dynamo
Sanjeev Mishra SVCC 2012
Amazon Dynamo Cont...
• Key/Value are opaque byte[]. ID= 128-bit MD5 hash of the Key
• “always writeable” where no updates are rejected due to failures or concurrent writes
• Simple Read/Write - get/put - operation on data uniquely identified by a key, value is binary object (BLOB)o get(key): single or a list (conflicts with
context)o put(key,context,object)
• Eventual consistency with no isolation guarantees
Sanjeev Mishra SVCC 2012
RIAK• Developed in Erlang by Basho
• Clients:Python, Javascript, Java, PHP, Erlang
• Dynamo inspired Open-Source o Advanced K/V and o Document Store (not a full featured document
store)
• Replication and sharding by primary key hasho Consistent Hashingo De-Centralized (No-Master node)
• Eventually consistento Tunable number of replicas for read and writeo Tunable per-read and per-writeo Different parts of application can choose
different trade offs
Sanjeev Mishra SVCC 2012
Project Voldemort
• Java based advanced Key/Value store
• Developed at LinkedIn
• Open source, Apache license
• Supports MVCC for updates
• Replicas are updated asynchronously - up-to-date view guaranteed if majority of replicas read
• Uses optimistic locking for consistent multi-record updates
• Versions are ordered based on Vector clocks
• More info: http://www.project-voldemort.com/voldemort/
Sanjeev Mishra SVCC 2012
Document Stores
Sanjeev Mishra SVCC 2012
Document Stores
• Data more complex than that in K/V stores• Data encapsulated and encoded in
o JSON, XML, YAML, BSON or some other standard format
• Multiple types of documents per databaseo Documents of similar type grouped togethero Optional metadata/schema for the documento Less rigid schema than that of RDBMS
• Nested documents or collection• Secondary indexes • Complex query/update support
o Multiple attributes, collections etc
Sanjeev Mishra SVCC 2012
Document Example{
"when": "2011-09-19T02:10:11.3Z",
"author": "alex",
"title": "No Free Lunch",
"text": "This is the text of the post. It could be very long.",
"tags": [ "business", "ramblings“ ],
"votes": 5,
"voters": ["jane“, "joe", "spencer", "phyllis", "li”],
"comments": [
{
"who": "jane",
"when": "2011-09-19T04:00:10.112Z",
"comment": "I agree."
},
{
"who": "meghan",
"when": "2011-09-20T14:36:06.958Z",
"comment": "You must be joking. etc etc ..."
}
]
}
Sanjeev Mishra SVCC 2012
Document Stores
• MongoDB (C/10Gen/AGPL)
• Apache CouchDB (Erlang/Apache)
• Amazon SimpleDB (Erlang/Amazon)
• Terrastore (Java/Terracota/Apache)
• RavenDB (C#/HibernatingRhino/AGPL)
There are others Document Store at http://nosql-database.org/
Sanjeev Mishra SVCC 2012
MongoDB
Sanjeev Mishra SVCC 2012
MongoDB huMongous
• Document format: BSON (Binary JSON)
• Supports nested documents• Documents are grouped in
Collections• Supports secondary indexes• Scalability – auto sharding• Consistency – Tunable based on
request (WriteConcerns)• Replication – replica set – master –
slave• Atomicity – document level
Sanjeev Mishra SVCC 2012
MongoDB
SQL MongoDBDatabase Database
Table Collection
Index Index
Row Document
Column Field
Join Embedding or Linking
Primary Key
_id
SQL MongoDB
create table users (name varchar(128), age number)
db.createCollections(“users”)
insert into users values (‘bob’,32’) db.users.insert({name:”bob”, age:32})
select * from user db.users.find()
select name, age from users db.users.find({}, {name:1, age:1,_id:0})
select name, age from users where age =32
db.users.find({age:32}, {name:1, age:1})
select * from user order by name asc
db.users.find().sort({name:1})
select * from user limit 10 offset 20
db.users.find().skip(20).limit(10)
select distinct name from user db.users.distinct(“name”)
select count(*) from user db.users.count()
update users set age =39 where name = ‘bob’
db.users.update({name:”bob”},{$set:{age:33}}, false, true)
delete from users where name=‘bob’ db.users.remove({name:”bob”})
Data Type
String Integer
Boolea Double
Null Array
Object ObjectId
Binary Regex
Code
Sanjeev Mishra SVCC 2012
Extensible Record Stores
akaColumn Stores
Sanjeev Mishra SVCC 2012
Extensible Record Stores Column Stores
• Motivated by Google BigTable• Basic Data Model – Rows and
Columns• Scale by splitting rows and columns
over multiple nodeso Rows split by sharding on primary key –
split by range rather than hash function o Columns split by column groups
Sanjeev Mishra SVCC 2012
Extensible Record Stores
• Cassandra (Java/Facebook/Apache)• Marriage of Dynamo and BigTable
• HBase (Java/Yahoo/Apache)• Inspired by BigTable, used HDFS for storage
• HyperTable (C/Zvent/GPL)• Similar to HBase/BigTable
• Accumulo (Java/NSA/Apache)• Uses Hadoop, ZooKeeper, and Thrift, cell level access control
• Google BigTable (Internal to Google)
There are others Wide Column Store at http://nosql-database.org/
Sanjeev Mishra SVCC 2012
Cassandra
Sanjeev Mishra SVCC 2012
Cassandra Features
• Decentralized o Data is distributed across cluster of nodeso No master, any node can address any requesto No single point of failure
• Fault-tolerant (Configurable replication strategies)o Simple Strategy (first determined by
partitioner, rest on other nodes clockwise)o Network Topology Strategy: multi datacenter
strategy
Sanjeev Mishra SVCC 2012
Cassandra Features Cont…
• Failure detection and recoveryo Based on Gossip protocol o Node state updated based on gossip message
versiono Per-node heartbeat threshold
• Tunable consistencyo Can be configured per read/write
Sanjeev Mishra SVCC 2012
Cassandra
SQL CassandraDatabase Keyspace
Table Column Family
Index Index
Row Row
Column Column
Join
Primary Key Primary Key
SQL Cassandra QL
create database codecamp CREATE KEYSPACE codecamp WITHstrategy_class = ‘NetworkTopologyStrategy’ AND strategy_options:DC1=3
create table users (key varchar(128), name varchar(128), age number)
CREATE COLUMNFAMILY users (key varchar PRIMARY KEY, name varchar, age int)
create index idx_name ON users(name)
CREATE INDEX idx_name ON users(name)
insert into users values (‘bob’, ‘Bob’,32’) INSERT INTO users(KEY, name, age) VALUES(‘jdoe’,’Jane Doe’, 39)
select name, age from users where age>30
SELECT name, age FROM usersWHERE age>30
update users set age = 35 where name = ‘bob’
UPDATE users SET age=35WHERE name=‘bob’
delete from users where key=‘bob’
DELETE FROM users where KEY = ‘bob’DELETE age FROM users where KEY=‘alice’
drop table users DROP COLUMNFAMILY users
drop database codecamp DROP KEYSPACE codecamp
Data Type
ascii int
float decimal
boolean bigint
double varchar
counter timestamp
uuid text
blob varint
Sanjeev Mishra SVCC 2012
Cassandra Column and Column Family
Column
name:byte[]
value:byte[]
timestamp
Row Key
Row
Column Column Column
jdoename: “userid”value: “jdoe”timestamp:…
name: “name”value: “Jane Doe”timestamp:…=
name: “age”value: 33timestamp:…
ladamsname: “userid”value: “ladams”timestamp:…
name: “name”value: “Larry Adam”timestamp:…=
name: “age”value: 47timestamp:…
bdolename: “userid”value: “bdole”timestamp:…
name: “name”value: “Bob Dole”timestamp:…=
name: “age”value: 67timestamp:…
Super Column
Name: byte[]
Value: Collection of Columns
Column
name:”userid”
value:”jdoe”
Timestamp:
Super Column
name: homeaddress
value:
name: “street”value: “555 Homestead Rd”timestamp:…
name: ”city”value:“Sunnyvale” timestamp:…
name: “zip”value: “95051”timestamp:…
ColumnFamily
Sanjeev Mishra SVCC 2012
Cassandra Keyspace Analogous to database in RDBMS
• Contains one or more Column Families analogous to tables in RDBMS
• Column Family contains columns
• A Row Key identifies a set of related columns
• A Row is not required to have same set of columns
• No join between two column families: o Each column family is self contained to serve
a queryo A rule of thumb - one column family per
query for better performance
• Replication is controlled on per-keyspace basis
Sanjeev Mishra SVCC 2012
Cassendra In Enterprise
• Netflix, Twitter, Urban Airship, Constant Contact, Reddit, Cisco, OpenX, Rackspace, Ooyala, and many more
• The largest Cassandra cluster has over 300 TB of data in over 400 machines
Sanjeev Mishra SVCC 2012
HBase• Design influenced by Google BigTable
• A type of NoSQL – more a data store than data base, lacks many RDBMS features such as
• Typed column, secondary indexes, triggers, advanced query language etc.
• Build on top of HDFS: Data is stored in HDFS as indexed “StoreFiles”
• Strongly consistent R/W not “eventually consistent” – suitable for counter aggregation
• Auto Sharding
• Auto Region Server Failover
• Out of the box support for Hadoop/HDFS
• Can be used as Source and/or Sink for MapReduce
• Java, Thrift/REST client
• Support Block Cache and Bloom Filters for high volume query optimization
• Web management tool and JMX support
Sanjeev Mishra SVCC 2012
Sanjeev Mishra SVCC 2012
NoSQL Growth Trends
Sanjeev Mishra SVCC 2012
Big Data and NoSQL Landscape