copyright © 2011-2013 curt hill nosql databases no sql or not only sql
TRANSCRIPT
Copyright © 2011-2013 Curt Hill
NoSQL Databases
No SQL or Not Only SQL
Historically …
• Typical relational databases live in a pleasant niche– Their data is relatively small and
usually on one machine– The meaning of the data is well-
understood– Schemas are tightly defined– Transactional consistency (ACID) is
maintained– Results of transactions are accurate
Copyright © 2011-2013 Curt Hill
Things can be different
• Extremely large amounts of data• Data is spread over many
machines, possibly geographically distant
• Change to this data is continuous• Data quality may be poor,
obtained from many sources• Schemas are fuzzy and uncertain
– Or completely lacking
Copyright © 2011-2013 Curt Hill
Relaxing principles• Classic database principles have
been left behind• Locking is usually absent• Schema are often inconsistent or
lacking• Data come from many sources
– How does this get integrated into rigid schema?
• Accuracy of the data is missing– By the time we update it has already
changedCopyright © 2011-2013 Curt Hill
People and Business• A normal relational database gives
us accuracy– The limitation is the accuracy of the
data
• People are used to making decisions without all the facts
• Businesses often make decisions without all the facts or complete analysis– Otherwise the window of opportunity
has passedCopyright © 2011-2013 Curt Hill
CAP Theorem• A distributed database or web
service cannot guarantee all of the following:
• Consistency– That operations occur all at once
• Availability– Every operation must terminate in the
intended operation
• Partition tolerance– Operations will complete even if
individual components failCopyright © 2011-2013 Curt Hill
ACID absent• ACID, in particular, is in danger• The goal of a transaction is to make
it look like it occurs by itself without considering other transactions
• When multiple computers are communicating and have their own data this is in danger
• Locking and unlocking is a problem– Things are changing too fast to let one
transaction lock data– Without it serializing is in danger
Copyright © 2011-2013 Curt Hill
Now and Then• Suppose a transaction is made• One computer messages all the
others• By the time that message arrives it
reflects a past state• By the time it is processed that
state may have changed• Virtually everything on the Internet
represents a past state and not currently
Copyright © 2011-2013 Curt Hill
Now and Then Again
• A single computer may think of its data as current
• It must accept all messages from other computers as in the past
• Absolute consistency cannot be obtained
• Eventual consistency is now the norm
Copyright © 2011-2013 Curt Hill
BASE not ACID
• BASE is an alternative to ACID• Basically Available, Soft state,
Eventually consistent– Clearly contrived to complement ACID
• This is acknowledging that when the data becomes too widely distributed something has to give
Copyright © 2011-2013 Curt Hill
Not the only relaxation of requirements
• NoSQL databases usually abandon the whole relational format
• They may also include the relational database as a subset of the entire database
• The most common form is the data store– AKA key-value store
Copyright © 2011-2013 Curt Hill
NoSQL Databases
• Must provide APIs to various programming language
• Must scale well to very large sizes• Indexing is the key to rapid access• These NoSQL databases are
targeted at different niches• Generally not interchangable
– Unlike most RDBMS
Copyright © 2011-2013 Curt Hill
Kinds of NoSQL
• Key Value• Columnar• Document• Graph
Copyright © 2011-2013 Curt Hill
Key Value• Simplest model• There is a key (which must be unique)
linked to a group of values• It gets more interesting if the values
may include key value pairs as well• Often not much of a schema• Think of a database with one table
– Unlimited string as key– Unlimited string as second field
• Two examples: Riak and Reddis
Copyright © 2011-2013 Curt Hill
Key-Value stores• A relational table is a restricted
form of key-value– The key is the primary key– The data is all the fields associated
with that key– However, it may not be even in First
Normal Form
• There is only one table– Key is unrestricted size string– Data is whatever needs to be there– The values may be completely
differentCopyright © 2011-2013 Curt Hill
KV Picture
Copyright © 2011-2013 Curt Hill
Key Value Again
• In a relational database we always know what the value extracted from a cell is
• It has the same meaning as everything else in the column
• This is no longer the case in key value stores
Copyright © 2011-2013 Curt Hill
Columnar• Also known as a column store• A lot of similarity to relational, but the
dominant item is the column not the row
• We lack rectangularity that relational has
• Columns are stored together• Halfway between Relational and Key
Value• HBase, Cassandra, HyperTable,
CalPont, MonetDB are examples
Copyright © 2011-2013 Curt Hill
Columnar
Copyright © 2011-2013 Curt Hill
Columnar again
• Often used in Data warehouses• Since the columns are stored
together (rather than the rows) and since the columns have only one data type, there is an opportunity to compress a column that is absent in relational DBs
Copyright © 2011-2013 Curt Hill
Document• The basic object is now a document
instead of a simple field like a number– Document is often XML or JSON
• Each document has an ID and other identifying values
• A document is an arbitrary and complicated item– As if every field were a BLOB
• Examples: MongoDB, CouchDB, Oracle NoSQL, Amazon’s SimpleDB
Copyright © 2011-2013 Curt Hill
Graph• A mathematical graph consists of
nodes (the data) and links between these– This is the network model revisited
• Used for highly interconnected data
• Processing rides the links• Neo4J and Zope are examples
Copyright © 2011-2013 Curt Hill
Commentary
• These classifications are incomplete
• Many examples exist that are combinations of several
• We next look at some example databases– Most of these are open source
Copyright © 2011-2013 Curt Hill
Riak• Key value store designed to be
distributed over many nodes• Designed to be fault-tolerant
– Peer to peer architecture – no master– All the data is scattered over many
servers and disk– Any one or more failures does not
compromise the data
• Everything is done through web queries
• Used by a quarter of Fortune 50• Includes Best Buy, Github, Comcast
Copyright © 2011-2013 Curt Hill
Redis• Key value store, optimized for speed• Creator is Salvatore Sanfilippo who
calls it a data structure server– Data could be more than a string or
number linked to a key
• May also consider data a sorted or unsorted set strings– This enables set operations on keys
• Keeps data in memory and occasionally updates disk– No ACID guarantees in that
• Used by Craigslist, flickrCopyright © 2011-2013 Curt Hill
MongoDB• Designed to be very scalable
document model database– Used by CERN for Large Hadron data
• Data is formatted as JavaScript objects – JavaScript Object Notation (JSON)
• Attributes are indexed• Queries now become JavaScript
functions• APIs in the major languages• Who is Mongo?
Copyright © 2011-2013 Curt Hill
JSON• A lightweight data interchange
format• Defined by JavaScript but used
outside of the JavaScript• Most languages have a subroutine
to parse and assimilate JSON• A short JSON presentation
Copyright © 2011-2013 Curt Hill
MongoDB and ACID
• Atomicity - yes• Consistency – no schema, so no
consistency or inconsistency• Isolation – good, but not perfect• Durable – yes
Copyright © 2011-2013 Curt Hill
Terms
RDBMS MongoDBTable Collection
Row JSON Document
Index Index
Join Embedding and linking
Partition Shard
Copyright © 2011-2013 Curt Hill
CouchDB
• Document based with JSON content
• Each document has a set of keys that link to it
• Written in Erlang, but with JavaScript API– Other languages interface to that
• Very fault tolerant• Used by LinkedIn, Orbitz
Copyright © 2011-2013 Curt Hill
HBase• A columnar database• Very scalable – designed for big
data• Each field is versioned, making it
3D rather than 2D – Columns are stored together– Rows are the related data– Depth are older versions
• Used by Facebook, Twitter, Yahoo, eBay
Copyright © 2011-2013 Curt Hill
Cassandra• Project started by Facebook to track
status updates• Became an Apache project• Intended to create a network of equal
nodes• Eventual consistency not perfect
consistency• Mostly written in Java but provides
APIs in Python, Ruby, PhP among others
• Used by IBM, HP, Netflix among others
Copyright © 2011-2013 Curt Hill
Neo4J• Graph database
– Network of nodes and links
• Data is information on a person or thing
• Links are the connections between one datum and another
• Numerous graph algorithms have been implemented– Consider Facebook connections
• Used by Adobe, Lufthanza, Mozilla
Copyright © 2011-2013 Curt Hill
CAP
• Several of these are distributed• Since they cannot do all three they
generally are good at two of the three
• See the following picture
Copyright © 2011-2013 Curt Hill
CAP
Copyright © 2011-2013 Curt Hill
Consistency
AvailabilityPartition tolerance
RiakMongoDBHBase
CouchDB
Niches
• For a product to be successful it must find one or more niches where it may do well
• A niche is a particular set of circumstances and requirements
• Next we want to consider some of these products and what they do well and what they do poorly
Copyright © 2011-2013 Curt Hill
Relational
• Layout and form of the data is well known in advance and relatively stable– We do not need to know in advance
what will be done with the data, but we do need to know the form
– Most business processes have this kind of requirements
• Not as effective for deeply hierarchical and widely varying data
Copyright © 2011-2013 Curt Hill
Key Value
Copyright © 2011-2013 Curt Hill
• Easy to make fast or horizontally scalable or both
• Useful where data does not conform to a well known schema or the data is not very well related
• Searches are easy but more complicated queries are not– No indices– No linkages, ie. foreign keys
Columnar
• Horizontal scalability is based on storing columns in different nodes– Thus good for big data
• Allows for versioning• Like relational, schema needs to
be done in advance– Based on what queries are needed– Does poorly with ad hoc data and
queries
Copyright © 2011-2013 Curt Hill
Document
• Works well with data that is highly variable and not known in advance
• Content is often JSON, so these are object oriented databases
• No normalization is possible, so redundancies are mostly unavoidable
• Most interesting queries are not possible
Copyright © 2011-2013 Curt Hill
Graph
• Particularly useful for modeling networking
• For social networking applications– Nodes are people and edges their
relationships– Hard to model this in other models
• Not easy to partition, so not easy to scale
• No common query language
Copyright © 2011-2013 Curt Hill
déjà vu?• In the early 1970s database world
was in some disarray• There were several models• None had achieved dominance• Commercial offerings were
present, but theoretical foundation was lacking
• There was no uniformity to these products
• Interchanging products was very difficult
Copyright © 2011-2013 Curt Hill
The End or Start of an Era• Codd changed that by the
development of a theoretical foundation for relational databases
• SQL became the common language• For several decades now Relational
Databases have been the undisputed king
• RDMS is a 32 billion dollar industry• The products are to some degree
interchangeable
Copyright © 2011-2013 Curt Hill
Again• The situation around NoSQL
databases has a lot of the same feel as in the 1970s
• They are not interchangeable and not even directed towards the same ends
• Is this the end of RDBMS era?• Unlikely we will soon get rid of
RDBMS, but it is not likely to be as exclusive as it has been
Copyright © 2011-2013 Curt Hill
Finally
• Some of the motivations of the NoSQL movement are:– Big Data– Requirements to be distributed– Volatility of data, largely caused by
web
• Check out the following link– DB-Engines.com rates popularity of d
ata bases
Copyright © 2011-2013 Curt Hill