no sql landscape_nosqltips

The NoSQL Landscape

Objective – Reasonable understanding of the non-relational or NoSQL data stores and how they relate to RDBMS databases we are all used to working with.

About Me

Chief Architect – youwho.com

Former dot com CTO

NoSql advocate

nosqltips.blogspot.com

@nosqltips on twitter

Agenda

What is NoSQL?

Landscape

Vocabulary and concepts

CAP Theorem

SQL vs NoSQL comparison

Overview of each type w/ examples

Question and Answer

Vocabulary

CAP Theorem – consistency, availability, partitioning

ACID – Atomic, Consistent, Isolated, Durable

BASE – Basically Available, Soft state, Eventually consistent

RDF – Resource Description Framework

Sharding – Partitioning, distributed

Web Scale – Google, Twitter, Facebook, etc

CAP Tuning

NRW

N: Number of Data Copies

R: Read Quorum

W: Write Quorum

Hard Consistency – RDBMS

Soft Consistency – No Guarantees

Eventual Consistency – Most NoSQL

Cap Tuning Chart

NRW Outcome

N=3 Magic Number of Data Replicas

W=N R=1 Read Optimized – Strong Consistency.

W=1 R=N Write Optimized – Strong Consistency.

W+R > N Strong Consistency on Read and Write.

W+R <= N Weak Eventual Consistency. Read may not see the latest Data.

N > W > 1 Eventual Consistency - Most NoSQL data stores live here.

Eventual Consistency

All replicas have same data – eventually

Milliseconds to seconds

Not all applications are compatible

Various ways to ensure latest data

Vector Clocks, Read Repair, Gossiping

Application determines correct data

Comparison

SQL

Prefers big-box, self redundant

Keep things from breaking

Solidly in CA land

P is difficult and expensive

Query by SQL

Stored procedures

NoSQL

Prefers commodity hardware, distributed

Assume things break or are broken

Mostly AP, some tunable

P generally easy

Custom API, SQLish

Map/Reduce

Comparison

SQL

ACID transactions

Advanced indexing

Foreign key support

Strong lock support

Schema centric

API – usually JPA or JDBC

Strong access control

NoSQL

BASE transactions

Key only to Advanced

Usually none

Usually none

Usually schema-less

Depends on implementation

Usually none

Comparison

SQL

Complex disk store, random access

Easy for dev with JPA/Hibernate/SQL

Multi-platform

General purpose

Strong commercial support

Great tool support

NoSQL

Usually append only, 1 seek, 1 read

Puts more work on application dev

Favors Linux/Unix

More special purpose

Strong to no commercial support

Not so much

Column Stores

Data stored by column instead of row

Schema-less

Non-relational, data is de-normalized

Column format stores sparse data efficiently

Column families cannot change

10,000+ columns by 100 million+ rows

Easy sharding (partitioning)

Usually not ACID compliant

Column stores

BigTable – Google, 2006 paper

Hadoop/HBase – Part of Apache Hadoop

Cassandra – Facebook, LAN/WAN replication

Hypertable – Pluggable DFS, HQL

Vertica – Full SQL implementation

Amazon SimpleDB – Cloud store

Document Stores

CAP tunable

Either key/value or bucket/key/value

Easy/Auto sharding - Consistent hashing

Usually ACID compliant

Not SQL compliant, maybe custom query

Easy implementation via map or custom api

Document stores

Amazon – Dynamo and S3 (cloud based)

Riak – CAP tunable, built in map/reduce

CouchDB – ACID, REST api

MongoDB – Indexing, query support

Voldemort – Java, pluggable serialization

MySQL – Key access, denormalize schema, kill indexes

Memory Stores

Mostly in the CA realm

P can be tough depending on implementation

Some are distributed, some local only

Usually key-value stores

Many are disk backed, append only files

Designed for very high-speed access

Memory stores

CouchBase – Membase + CouchDb

Memcached – Local map

Coherence – Commercial Oracle, distributed

Redis – Supports hash, list, set, and sorted set, data structure server

Tokyo/Kyoto Cabinet – disk backed map

Infinispan – JSR-107 jcache impl

Scalaris – Erlang, strong consistency

Graph/Triple Store

Model relationships well, bi-directional

Node/edges – edges can be weighted or not

RDF Triple – subject -> predicate -> object, w3c standard for semantic web

Many implement SPARQL, object api

Sharding can difficult because of graph nature

Schema-less – nodes, edges, properties

Fast set operations

Graph/Triple Stores

Neo4j – ACID transactions, object API

Alegrograph – Reference impl of SPARQL

Bigdata – dynamic sharding

Trinity – Microsoft research

Infinite Graph – Distributed, cross-platform

FlockDb – Twitter, fast set operations

Infogrid – Object based, REST api

Interesting Integrations

Lucene - Document Store with Search as Query Language

SOLR and Elastic Search – Scalable Lucene

Riak Search – Elang impl of Lucene APIs

Solandra – Lucene on Cassandra backend

Couchdb-lucene – Integration

DistributedLucene – Lucene on Hadoop

Neo4j – Full Text Search on Graph Store

Worth Mentioning

Configuration Dbs – ZooKeeper, Doozer

Distributed configuration, locks, synchronization

Used to make other apps scalable

XML Dbs – eXist, BaseX, Xindice

XML only, Xquery, Xpath, ACID, GUI support

non-distributed

Case Study - HBase

Apache – part of Hadoop/HDFS

Requires ZooKeeper

Java based

Runs well on Amazon EC2

Excellent language support

Supports REST interface

HBase continued

Map/Reduce via Hadoop

Schema-less, column families fixed

Nearly unlimited columns and rows

HBQL – partial sql + JDBC support

Some ACID support, atomicity, durability

Integration with Hive for data warehousing, ad-hoc query support - HiveQL

Case Study - Riak

Data Model – Bucket/Key/Value

Value has MIME type, byte[]

Value supports one-way Links, basic graph

Erlang, Protocol Buffers, REST interfaces

Pre/Post Commit Hooks

CAP Tunable per bucket

Map/Reduce – Erlang and Javascript

Riak Continued

Vector Clocks

Read repair for R < N

Peer-to-Peer, Nothing Shared Architecture

Replication across data centers

Pluggable storage

API for Most Languages + REST

Commercial Support

Case Study - Redis

Supports hash, list, set, and sorted set

Fast set operations

Atomic updates

Everything stored in memory

Persistence to disk – periodic save, append only file, can be compacted

Good API support, JDBC subset driver

Redis Continued

Master – slave replication, read scalability, redundancy, slave can sync to disk

Can swap out values, keys must be in memory

Can be used as pub/sub messaging system

Can send multiple commands in single request

Built to be extremely fast

Supports very high speed atomic counters

Case Study - Neo4j

Java based – cross platform

ACID transactions

Durable persistence

Handle billions of nodes/edges single machine

Supports bulk data loading

Good language support

Neo4j Continued

Spatial index support

RDF triples/OWL/SPARQL support

Replication and HA – commercial version

Object oriented API

Sharding at client level

Dual open source and commercial license

Resources

fallabs.com/tokyocabinet

fallabs.com/kyotocabinet

redis.io

www.membase.org

neo4j.org

en.wikipedia.org/wiki/Triplestore

en.wikipedia.org/wiki/Graph_theory

research.microsoft.com/en-us/projects/trinity

Resources

www.jboss.org/infinispan

basho.com

nosqlpedia.com/wiki/Consistency_models_in_nonrelational_dbs

www.hypertable.org

project-voldemort.com

www.allthingsdistributed.com/2007/10/amazons_dynamo.html

Resources

nosql-database.org

couchdb.apache.org

engineering.twitter.com/2010/05/introducing-flockdb.html

infinitegraph.com

nosql-database.org

http://www.w3.org/TR/rdf-concepts/

no sql landscape_nosqltips

Technology

nosql data stores

column stores data

r n strong consistency

optimized strong consistency

correct data

document stores cap

object api sharding

keyvalue stores