code camp2012

Sanjeev Mishra SVCC 2012

Big Data and NoSQL Landscape

Sanjeev MishraSilicon Valley Code Camp 2012


Timeline• 1970s – Genesis of modern db

• Modeling the world based on relational calculus: best for managing uniform data

• 1980s

• RDBMS takes over the world

• 1990s – 2000+

• Invention of HTML• Spread of Web based technologies


Need for Modern Data Storage

• Amazon• Managing: Shopping carts, Seller Lists, Customer

Preferences, Sales Rank, Recommendations

• Google• Storing and managing web scale data

• Facebook• Managing social graphs

• LinkedIn, Twitter and others


Data Explosion Current

• Every two days now we create as much information as we did from the dawn of civilization up until 2003 - about 5 exabytes (1K PB) of data: Eric Schmidt *


Data Explosion Future

• A telescope planned to be finished in 2024 will generate more data in a single day than the entire Internet.*


What is Big Data?

• Terabytes(TB) is not big data, petabytes (PB) (1000 TB) may be.

• Current definition of big data: zettabytes (1M PB or 1G TB)


Nature of Big DataWeb 2.0 kind of data

• Different from traditional RDBMS/Warehouse data – more reads less updates

• User Generated Content – Tweets, Reviews, Comments etc…

• Lots of updates and lots of reads• Scale to millions of users• Not necessarily Transactional• Compromised consistency


Data Explosion, So What?

• Structural issues• The dynamic nature of data

• Performance issues• Insertion• Search

• Scaling Horizontally • Dozens or hundreds of machines to operate as

single server


What is NoSQL?Not Only SQL or Not Relational

• Carlo Strozzi used it in 1998 and then Eric Evans in 2009

• Simple call level interface (SQL not supported)

• Flexible schema

• Efficient use of distributed indexes

•Horizontally scaling of operations over many server

• No ACID but BASE (Basically Available, Soft state*, Eventually consistent**)


CAP Theorem (Brewer’s Theorem)*

A distributed system can satisfy any two of following three guarantees at any time

o Consistency (all nodes see the same data at the same time)

o Availability (a guarantee that every request receives a response about whether it was successful or failed)

o Partition tolerance (the system continues to operate despite arbitrary message loss or failure of part of the system)


Eventual Consistency Flavors

• Causal consistency o changes are notified through events, the

receiving session will always see the updated value.

• Read your own writeso a session that updates the db will immediately

see the changes.

• Monotonic consistency*o once a session reads a value will never see an

earlier value.


Consistency Tradeoffs

Where,o N is # of copies of each data that db maintainso R is # of copies that is read for each reado W is # of copies that must be written for each

write

• Most NoSQL use N>W>1: More than one write must complete but not all nodes need to update immediately.


Column Vs Row Storage


Row vs. Column Oriented DB

Row oriented1

John

Doe

111-222-3333

8/12/1968

2

Jane

Doe

111-332-3408

4/3/1972

1

2

John

Jane

Doe

Doe

111-222-3333

111-332-3408

8/12/1968

4/3/1972

Column oriented

Id First name Last name SSN DOB

1 John Doe 111-222-3333 8/12/1968

2 Jane Doe 111-332-3408 4/3/1972


Contrasting Operations on Row vs Col DB

Row oriented Column oriented

Insert a new tuple

1

John

Doe

111-22-3333

8/12/1968

2

Jane

Doe

111-32-3408

4/3/1972

3

Foo

Bar

237-23-3924

2/3/1978

1

2

3John

Jane

FooDoe

Doe

Bar111-22-3333

111-32-3408

237-23-3924

8/12/1968

4/3/1972

2/3/1978




Create a new attribute

1

John

Doe

111-22-3333

8/12/1968

2

Jane

Doe

111-32-3408

4/3/1972

408-555-1212

650-555-2323

1

2

John

Jane

Doe

Doe

111-22-3333

111-32-3408

8/12/1968

4/3/1972

408-555-1212

650-555-2323




Get all who were born in a given year

Easy, just pick all rows where year of DOB matches the given year

Not so simple, scan the years and remember the indexes of all occurrences that match given year and extract based on these indexes

Get sum of all years

Easy, the data is found consecutively

Little difficult, data does not live consecutively so scanning through entire dataset needed


Glossary

• Consistent Hashing (Cassandra, Dynamo)o the output range of a hash function is treated as a fixed circular space or

“ring” (i.e. the largest hash value wraps around to the smallest hash value)

• Vector Clock (Cassandra, Riak, Dynamo)o an algorithm for generating a partial ordering of events in a distributed

system and detecting causality violations

• Quorum (Cassandra, Dynamo (sloppy))

• Merkle Tree (Cassandra, Riak, Dynamo)o a hash tree where leaves are hashes of the values of individual keys. Parent

nodes higher in the tree are hashes of their respective children. The principal advantage of Merkle tree is that each branch of the tree can be checked independently without requiring nodes to download the entire data set

• Anti-Entropy Gossip Protocol (Cassandra, Dynamo) o comparing all the replicas of each piece of data that exist and updating each

replica to the newest version

• Order preserving partitioning (Cassandra, MongoDB)


Glossary

• MVCCo multi version concurrency control

• Atomicityo all or nothing

• Consistencyo each transaction leaves the db in valid state

• Isolationo concurrent execution of txn results into a state that is obtained if txn were

executed serially

• Durabilityo committed txn remain so even in the event of power loss, crashes or errors

• WALo Write ahead logging – changes are written to a log before they are applied

(Durability)

• Eventually consistento sufficiently long quiet period all updates can be expected to propagate

eventually through the system and all replicas will be consistent


Glossary• Sharding

o horizontal partitioning of data, storing records on different servers according to some key

• Tupleo row in RDBMS, predefined schema.

• Documento contains nested document or lists as well as scalar values. No predefined schema.

• Extensible Recordo hybrid between Tuple and Document, families of attributes defined in a schema

but attributes can be added on a per record basis.

• Key-value Storeso stores values indexed by a user defined key.

• Document Storeso indexed document store

• Extensible Record Stores aka Wide Column Storeso Stores extensible records partitioned vertically and horizontally across nodes.


NoSQL Categories• Key-value Stores

o Stores values indexed by a user defined key.

• Document Storeso Indexed document store

• Extensible Record Stores (Column Stores)o Stores extensible records partitioned vertically

and horizontally across nodes.

• Graph Databases


Key-Value Stores


Key-Value Stores• A distributed cache/Hashtable

o Inspired by Amazon Dynamoo like memcached with

o persistence, replication, versioning, locking, transactions, sorting etc.

o get/put and lookups o No secondary indices or keyso Values are BLOBs or in some cases JSON

documento Scalability through key distribution over nodes


Key-Value Stores• Riak (Erlang/Basho/Apache)

• Membase (C+Erlang/Couchbase/Apache)

• Project Voldemort (Java/LinkedIn/Apache)

• Redis (C/VMWare/BSD)

• Scalaris (Erlang/Zuse+onScale/Apache)

• Tokyo Cabinet (C/Fal Labs/LGPL)

• Dynamo (Java/For Amazon internal use)

There are others Key Value / Tuple Store at http://nosql-database.org/


Amazon Dynamo

• KV Store Developed by Amazon to supporto Best Seller Listso Shopping cartso Customer Preferenceso Session Managemento Sales Ranko Product Catalog etc...

• Variation of Consistent Hashing based Data Partitioning and Replication

• Dynamic add/delete of Storage Nodes

• Each service uses distinct instance of Dynamo


Amazon Dynamo Cont...

• Key/Value are opaque byte[]. ID= 128-bit MD5 hash of the Key

• “always writeable” where no updates are rejected due to failures or concurrent writes

• Simple Read/Write - get/put - operation on data uniquely identified by a key, value is binary object (BLOB)o get(key): single or a list (conflicts with

context)o put(key,context,object)

• Eventual consistency with no isolation guarantees


RIAK• Developed in Erlang by Basho

• Clients:Python, Javascript, Java, PHP, Erlang

• Dynamo inspired Open-Source o Advanced K/V and o Document Store (not a full featured document

store)

• Replication and sharding by primary key hasho Consistent Hashingo De-Centralized (No-Master node)

• Eventually consistento Tunable number of replicas for read and writeo Tunable per-read and per-writeo Different parts of application can choose

different trade offs


Project Voldemort

• Java based advanced Key/Value store

• Developed at LinkedIn

• Open source, Apache license

• Supports MVCC for updates

• Replicas are updated asynchronously - up-to-date view guaranteed if majority of replicas read

• Uses optimistic locking for consistent multi-record updates

• Versions are ordered based on Vector clocks

• More info: http://www.project-voldemort.com/voldemort/

http://www.project-voldemort.com/voldemort/


Document Stores


Document Stores

• Data more complex than that in K/V stores• Data encapsulated and encoded in

o JSON, XML, YAML, BSON or some other standard format

• Multiple types of documents per databaseo Documents of similar type grouped togethero Optional metadata/schema for the documento Less rigid schema than that of RDBMS

• Nested documents or collection• Secondary indexes • Complex query/update support

o Multiple attributes, collections etc


Document Example{

"when": "2011-09-19T02:10:11.3Z",

"author": "alex",

"title": "No Free Lunch",

"text": "This is the text of the post. It could be very long.",

"tags": [ "business", "ramblings“ ],

"votes": 5,

"voters": ["jane“, "joe", "spencer", "phyllis", "li”],

"comments": [

{

"who": "jane",

"when": "2011-09-19T04:00:10.112Z",

"comment": "I agree."

},

{

"who": "meghan",

"when": "2011-09-20T14:36:06.958Z",

"comment": "You must be joking. etc etc ..."

}

]

}


Document Stores

• MongoDB (C/10Gen/AGPL)

• Apache CouchDB (Erlang/Apache)

• Amazon SimpleDB (Erlang/Amazon)

• Terrastore (Java/Terracota/Apache)

• RavenDB (C#/HibernatingRhino/AGPL)

There are others Document Store at http://nosql-database.org/


MongoDB


MongoDB huMongous

• Document format: BSON (Binary JSON)

• Supports nested documents• Documents are grouped in

Collections• Supports secondary indexes• Scalability – auto sharding• Consistency – Tunable based on

request (WriteConcerns)• Replication – replica set – master –

slave• Atomicity – document level


MongoDB

SQL MongoDBDatabase Database

Table Collection

Index Index

Row Document

Column Field

Join Embedding or Linking

Primary Key

_id

SQL MongoDB

create table users (name varchar(128), age number)

db.createCollections(“users”)

insert into users values (‘bob’,32’) db.users.insert({name:”bob”, age:32})

select * from user db.users.find()

select name, age from users db.users.find({}, {name:1, age:1,_id:0})

select name, age from users where age =32

db.users.find({age:32}, {name:1, age:1})

select * from user order by name asc

db.users.find().sort({name:1})

select * from user limit 10 offset 20

db.users.find().skip(20).limit(10)

select distinct name from user db.users.distinct(“name”)

select count(*) from user db.users.count()

update users set age =39 where name = ‘bob’

db.users.update({name:”bob”},{$set:{age:33}}, false, true)

delete from users where name=‘bob’ db.users.remove({name:”bob”})

Data Type

String Integer

Boolea Double

Null Array

Object ObjectId

Binary Regex

Code


Extensible Record Stores

akaColumn Stores


Extensible Record Stores Column Stores

• Motivated by Google BigTable• Basic Data Model – Rows and

Columns• Scale by splitting rows and columns

over multiple nodeso Rows split by sharding on primary key –

split by range rather than hash function o Columns split by column groups


Extensible Record Stores

• Cassandra (Java/Facebook/Apache)• Marriage of Dynamo and BigTable

• HBase (Java/Yahoo/Apache)• Inspired by BigTable, used HDFS for storage

• HyperTable (C/Zvent/GPL)• Similar to HBase/BigTable

• Accumulo (Java/NSA/Apache)• Uses Hadoop, ZooKeeper, and Thrift, cell level access control

• Google BigTable (Internal to Google)

There are others Wide Column Store at http://nosql-database.org/


Cassandra


Cassandra Features

• Decentralized o Data is distributed across cluster of nodeso No master, any node can address any requesto No single point of failure

• Fault-tolerant (Configurable replication strategies)o Simple Strategy (first determined by

partitioner, rest on other nodes clockwise)o Network Topology Strategy: multi datacenter

strategy


Cassandra Features Cont…

• Failure detection and recoveryo Based on Gossip protocol o Node state updated based on gossip message

versiono Per-node heartbeat threshold

• Tunable consistencyo Can be configured per read/write


Cassandra

SQL CassandraDatabase Keyspace

Table Column Family

Index Index

Row Row

Column Column

Join

Primary Key Primary Key

SQL Cassandra QL

create database codecamp CREATE KEYSPACE codecamp WITHstrategy_class = ‘NetworkTopologyStrategy’ AND strategy_options:DC1=3

create table users (key varchar(128), name varchar(128), age number)

CREATE COLUMNFAMILY users (key varchar PRIMARY KEY, name varchar, age int)

create index idx_name ON users(name)

CREATE INDEX idx_name ON users(name)

insert into users values (‘bob’, ‘Bob’,32’) INSERT INTO users(KEY, name, age) VALUES(‘jdoe’,’Jane Doe’, 39)

select name, age from users where age>30

SELECT name, age FROM usersWHERE age>30

update users set age = 35 where name = ‘bob’

UPDATE users SET age=35WHERE name=‘bob’

delete from users where key=‘bob’

DELETE FROM users where KEY = ‘bob’DELETE age FROM users where KEY=‘alice’

drop table users DROP COLUMNFAMILY users

drop database codecamp DROP KEYSPACE codecamp

Data Type

ascii int

float decimal

boolean bigint

double varchar

counter timestamp

uuid text

blob varint


Cassandra Column and Column Family

Column

name:byte[]

value:byte[]

timestamp

Row Key

Row

Column Column Column

jdoename: “userid”value: “jdoe”timestamp:…

name: “name”value: “Jane Doe”timestamp:…=

name: “age”value: 33timestamp:…

ladamsname: “userid”value: “ladams”timestamp:…

name: “name”value: “Larry Adam”timestamp:…=


bdolename: “userid”value: “bdole”timestamp:…

name: “name”value: “Bob Dole”timestamp:…=


Super Column

Name: byte[]

Value: Collection of Columns

Column

name:”userid”

value:”jdoe”

Timestamp:

Super Column

name: homeaddress

value:

name: “street”value: “555 Homestead Rd”timestamp:…

name: ”city”value:“Sunnyvale” timestamp:…

name: “zip”value: “95051”timestamp:…

ColumnFamily


Cassandra Keyspace Analogous to database in RDBMS

• Contains one or more Column Families analogous to tables in RDBMS

• Column Family contains columns

• A Row Key identifies a set of related columns

• A Row is not required to have same set of columns

• No join between two column families: o Each column family is self contained to serve

a queryo A rule of thumb - one column family per

query for better performance

• Replication is controlled on per-keyspace basis


Cassendra In Enterprise

• Netflix, Twitter, Urban Airship, Constant Contact, Reddit, Cisco, OpenX, Rackspace, Ooyala, and many more

• The largest Cassandra cluster has over 300 TB of data in over 400 machines


HBase• Design influenced by Google BigTable

• A type of NoSQL – more a data store than data base, lacks many RDBMS features such as

• Typed column, secondary indexes, triggers, advanced query language etc.

• Build on top of HDFS: Data is stored in HDFS as indexed “StoreFiles”

• Strongly consistent R/W not “eventually consistent” – suitable for counter aggregation

• Auto Sharding

• Auto Region Server Failover

• Out of the box support for Hadoop/HDFS

• Can be used as Source and/or Sink for MapReduce

• Java, Thrift/REST client

• Support Block Cache and Bloom Filters for high volume query optimization

• Web management tool and JMX support


NoSQL Growth Trends


Big Data and NoSQL Landscape

code camp2012

Technology