nosql
DESCRIPTION
NoSQL: Overview of the main features and approaches This presentation has been developed in the context of the Databases course at the DISIM Department of the University of L’Aquila (Italy). http://www.di.univaq.it/malavoltaTRANSCRIPT
DISIM - University of L’Aquila
Why, When, Who NOSQL (now)?
The CAP Theorem
NOSQL Approaches
Case Study 1: Instagram
Case Study 2: Twitter
Case Study 3: tumblr
Summary
References
DISIM - University of L’Aquila
ACID
Atomicity
Consistency
Isolation
Durability
Based on Relational Algebra
Select, Projection, Set Operators, Renaming, Joins
Concept of Schema
Standard
DISIM - University of L’Aquila
The term was coined in 2009 by Eric Evans,
Software Developer at Apache Software Foundation
Class of non-relational data storage systems
Usually do not require a fixed schema
Many NoSQL offerings relax one or more of the ACID properties
DISIM - University of L’Aquila
DISIM - University of L’Aquila
No to SQL …we are not against SQL!
Not only SQL It’s about recognizing that for some problems other storage solutions are better suited!
http://goo.gl/gWIoy
DISIM - University of L’Aquila
Each NOSQL approach addresses some
limitations of relational databases, like:
• horizontal scalability
• read/write performance
• schema limitations
• difficult query patterns
• parallel data processing
• etc.
reason about sharding and master-slave
replicas
DISIM - University of L’Aquila
Massive read/write performance
usually fast key-value access
High Availability
Data can be stored in multiple nodes data can be partitioned
Helps in avoiding a single point of failure fault-tolerance
http://goo.gl/PVpoh
http://goo.gl/DAxmN
DISIM - University of L’Aquila
Flexible schema and data types
easy to develop the application layer
(JSON, HTTP access, JS functions, etc.)
Ease of maintenance, administration
many vendors are spending a lot of effort on ease of use, minimal administration, and automated operations
Promotes parallel computing
tremendously performant!
see Map-Reduce
http://goo.gl/PVpoh http://goo.gl/DAxmN
DISIM - University of L’Aquila
Supporting large data sets with room to grow
thanks to partitioning, data structures and dedicated algorithms
Tunable for deployment size or functionality
can be used for either medium to large datasets both in terms of size and complexity
CHEAP (open-source)
http://goo.gl/PVpoh
http://goo.gl/DAxmN
DISIM - University of L’Aquila
What are we giving up?
• joins
• group by
• order by
• indexes
• ACID transactions
• complex relationships
• powerful and standard query language (SQL)
• data independence (mainly for data integrity)
• maturity
http://goo.gl/PVpoh
some NOSQL approaches provide some (but not
all) features listed here
DISIM - University of L’Aquila
– Storage of large amount of non-transactional data • log analysis, web statistics, etc.
– Caching results from slower databases (see Twitter)
– Data denormalization of expensive join queries
– Manage data that is not easily analyzed in a RDBMS such as time-or location-based data
– Real-time systems • games, financial data, chats, etc.
Do you have somewhere a large set of uncontrolled, unstructured, data
that you are trying to fit into a RDBMS?
DISIM - University of L’Aquila Slide curtesy of Tobias Lindaaker http://www.thobe.org/
DISIM - University of L’Aquila Slide curtesy of Tobias Lindaaker http://www.thobe.org/
DISIM - University of L’Aquila Slide curtesy of Tobias Lindaaker http://www.thobe.org/
DISIM - University of L’Aquila Slide curtesy of Tobias Lindaaker http://www.thobe.org/
DISIM - University of L’Aquila Slide curtesy of Tobias Lindaaker http://www.thobe.org/
DISIM - University of L’Aquila Slide curtesy of Tobias Lindaaker http://www.thobe.org/
DISIM - University of L’Aquila Slide curtesy of Tobias Lindaaker http://www.thobe.org/
DISIM - University of L’Aquila Slide curtesy of Tobias Lindaaker http://www.thobe.org/
DISIM - University of L’Aquila Slide curtesy of Tobias Lindaaker http://www.thobe.org/
DISIM - University of L’Aquila
Why, When, Who NOSQL (now)?
The CAP Theorem
NOSQL Approaches
Case Study 1: Instagram
Case Study 2: Twitter
Case Study 3: tumblr
Summary
References
DISIM - University of L’Aquila
CAP Theorem
formulated by scientist Eric Brewer in 2000
It is impossible for a distributed computer system to
simultaneously provide all three of the following guarantees:
• Consistency: each client always has the same view of the data
• Availability: every received request must result in a response
• Partition Tolerance: every node must respond, even though some messages between the nodes may be lost
DISIM - University of L’Aquila
Demonstration...
DISIM - University of L’Aquila
To scale out, you have to partition
you have to choose between consistency or availability
Consistency Availability
Partition Tolerance
CP AP
CA
∅
DISIM - University of L’Aquila
Consistency model weaker than
= Basically Available, Soft state, Eventual consistency
ACID
BASE
If a node fails, part of the data
will not be available, but the entire data layer stays operational
The state of the system may change
over time, even without input
The system becomes consistent at some later time
Atomicity Consistency Isolation Durability
http://queue.acm.org/detail.cfm?id=1394128
DISIM - University of L’Aquila
BASE example
DISIM - University of L’Aquila
Why, When, Who NOSQL (now)?
The CAP Theorem
NOSQL Approaches
Case Study 1: Instagram
Case Study 2: Twitter
Case Study 3: tumblr
Summary
References
DISIM - University of L’Aquila
Four genres of NOSQL databases:
key value
Key-value
Columnar
key
Document
Graph
DISIM - University of L’Aquila
Here the focus is on SCALABILITY
designed to handle massive load
stores a collection of Key-Value pairs
think absout maps or (associative arrays) in classical programming languages
http://goo.gl/LfG1N
KEY= string value
VALUE= any kind of element such as strings, videos, XML files, etc.
Key Namespaces to avoid collisions
Implementations:
Riak Redis
Voldemort Dynamo
DISIM - University of L’Aquila
PROS • easy to use • extreme performance • no need to maintain indices • large horizontal data CONS • no complex queries (no SQL) • no transactions
– actually REDIS has transactions
• many data structures cannot be easily modeled as key-value pairs • must fit in memory
http://goo.gl/PGfjU
DISIM - University of L’Aquila
• Stock prices
• Analytics
• Real-time data collection
• Real-time communication
• User sessions storage
• Caching Data from other DBs
SEE CASE STUDIES LATER IN THIS LECTURE
DISIM - University of L’Aquila
Midway between relational and KV stores
Values are queried by matching keys like relational DBs, their values are groups of zero or more columns
Differently from relational DBs, data from a given column is stored together
adding columns is quite inexpensive
Each row can have a different set of columns, or none at all this allows tables to remain sparse without additional storage cost for null values
Implementations:
HBase BigTable
Cassandra Vertica
DISIM - University of L’Aquila
PROS
• Easy to Distribute Tasks
• Solving ‘Big Data’ issues
• High Availability
• Garbage collection for expired data
• Scanning is very easy
CONS
• De-normalization
• Expensive to insert
• Requires heavy pre-planning of queries
DISIM - University of L’Aquila
• Search engines
• Logging
• Analysing log data
• When you need to scan huge, two-dimensional, join-less tables
• Banking (consistency enforcement)
• Many implementations provide versioning facilities
• in Cassandra writing is faster than reading values (!)
SEE CASE STUDIES LATER IN THIS LECTURE
DISIM - University of L’Aquila
Super-set of key-value DBs, you can query also on the value part
the document portion is structured
Think about documents as tuples with any number of fields (JSON)
Documents can contain nested structures
Documents are often versioned
Different document databases take different approaches for indexing, querying, replication, consistency, etc.
choose wisely!
Implementations:
MongoDB CouchDB RavenDB
DISIM - University of L’Aquila
PROS
• Variable data
• Object Oriented Paradigms
• Concurrency
• Works well with de-normalized data
CONS
• Hard to do complex queries
• No Joins
• Enforcing Structured Data
DISIM - University of L’Aquila
• When you don’t know in advance what exactly your data will look like
• They map well to object-oriented programming models
• For accumulating, occasionally changing data, on which pre-defined queries are to be run
• Places where versioning is important
• Services that handle age difference, geographic location, tastes and dislikes, etc.
• A leaderboard system that depends on many variables
SEE CASE STUDIES LATER IN THIS LECTURE
DISIM - University of L’Aquila
Focus on modeling the structure of data & interconnectivity
Inspired by mathematical Graph Theory ( G=(E,V) )
Data model is the Property Graph:
• Entities are nodes
• Relationships are edges between Nodes
• Key-Value pairs on both
Excels in dealing with highly interconnected data Relational DBs can model graphs, but an edge requires a join which is expensive
Implementations:
Neo4J OrientDB FlockDB Trinity
B
D
A
E
C e
a
c
b
d
DISIM - University of L’Aquila
DISIM - University of L’Aquila
PROS
• Easy match with the problem domain – with relational, you have to create ER diagram, then normalize, etc.
• ability to quickly traverse nodes and relationships to find relevant data – you can apply the Dijstra algorithm for querying the DB
• Fit well with object-oriented concepts
• Neo4J has full ACID conformity
CONS
• generally not suitable for network partitioning – due to the high interconnectedness
• No Joins
• Enforcing Structured Data
DISIM - University of L’Aquila
• Social networks
• Recommendation engines
• Geographic data
• Public transport links
• Road maps
• Network topologies
SEE CASE STUDIES LATER IN THIS LECTURE
DISIM - University of L’Aquila
DISIM - University of L’Aquila
DISIM - University of L’Aquila http://goo.gl/0JoW8
DISIM - University of L’Aquila
Why, When, Who NOSQL (now)?
The CAP Theorem
NOSQL Approaches
Case Study 1: Instagram
Case Study 2: Twitter
Case Study 3: tumblr
Summary
References
DISIM - University of L’Aquila http://goo.gl/xpPac
DISIM - University of L’Aquila http://goo.gl/xpPac
http://goo.gl/mkfQN
key-value
key-value (in the cloud)
relational
DISIM - University of L’Aquila
DISIM - University of L’Aquila http://goo.gl/2kdvm
key-value
graph
columnar
plus Blobstore!
DISIM - University of L’Aquila
http://goo.gl/CrC0P
DISIM - University of L’Aquila http://goo.gl/CrC0P
key-value
columnar
relational
DISIM - University of L’Aquila
Why, When, Who NOSQL (now)?
The CAP Theorem
NOSQL Approaches
Case Study 1: Instagram
Case Study 2: Twitter
Case Study 3: tumblr
Summary
References
DISIM - University of L’Aquila
SCALABILITY - SCALABILITY – SCALABILITY
SCALABILITY - SCALABILITY - SCALABILITY
SCALABILITY - SCALABILITY – SCALABILITY
...usually at the cost of consistency
NOSQL is not the silver bullet for everything
Polyglot data is the new main trend...
...in 10 years the majority of the IT solutions still based
on RDBMS
both to size and complexity
DISIM - University of L’Aquila
DISIM - University of L’Aquila
Chapters 1 and 9
http://goo.gl/ThO63
http://nosql-database.org/
check out my blog for these slides
www.ivanomalavolta.com
DISIM - University of L’Aquila
Neo4j - http://neo4j.org
OrientDB – http://www.orientdb.org
VoltDB – http://www.voltdb.com
CouchDB - http://couchdb.apache.org
Cassandra - http://cassandra.apache.org
Riak – http://www.basho.com
Hbase – http: //hbase.apache.org
MongoDB - http://www.mongodb.org
Redis - http://code.google.com/p/redis
Oracle Berkley DB - http://www.oracle.com/database/berkeley-db
FlockDB - http://github.com/twitter/flockdb