overview of nosql

55
Introduction to NoSQL 2011.02 Quang Nguyen

Upload: nguyen-quang

Post on 22-Nov-2014

1.415 views

Category:

Technology


0 download

DESCRIPTION

An introduction to NoSQL. Why it is needed? What are the drawbacks of RDBMS and how the NoSQL can overcome these drawbacks.

TRANSCRIPT

Page 1: Overview of NoSQL

Introduction to NoSQL

2011.02Quang Nguyen

Page 2: Overview of NoSQL

Communicating Knowledge

New Challenges for RDBMS Introduction to NoSQL MongoDB Sharding

Agenda

2

Page 3: Overview of NoSQL

Communicating Knowledge

Since 1970 Use SQL to manipulate data

Easy to use Easy to integrate with other system

Excellent for applications such as management (accounting, reservations, staff management, etc)

Relational DBMS

3

Page 4: Overview of NoSQL

Communicating Knowledge

Databases always satisfy this four properties Atomic: “all or nothing”, when a statement is

executed, it is either successful or failed Consistent: data moves from one correct state to

another correct state Isolated: two concurrent transaction will not

become entangle with each other Durable: one a transaction has succeeded, the

change will not be lost

ACID Properties of RDBMS

4

Page 5: Overview of NoSQL

Communicating Knowledge

Schemas aren't designed for sparse data Normalize, creates a lot of tables Joins can be prohibitively expensive

Most importantly, databases are simply not designed to be distributed.

What is problem of RDBMS?

5

Page 6: Overview of NoSQL

Communicating Knowledge

A banking system consisting of 4 branches in four different city. Each branch maintains accounts locally

Account = (account-number, branch, balance) One single site that maintains information about

branchesBranch = (branch-name, city, assets)

An Example of a Distributed DB

6

Page 7: Overview of NoSQL

Communicating KnowledgeAn Example of a Distributed DB

Bank A Bank B

Transfer $1000From A:$3000To B:$2000

Clients want all-or-nothing transactions Transfer either happens or not at all

client

Transactioncoordinator

7

Page 8: Overview of NoSQL

Communicating Knowledge

Simple solutionAn Example of a Distributed DB

What can go wrong? A does not have enough money B’s account no longer exists B has crashed Coordinator crashes

client transactioncoordinator

bank A bank B

start

doneA=A-1000

B=B+1000

8

Page 9: Overview of NoSQL

Communicating Knowledge

Two-phase Commit Protocol (2PC)An Example of a Distributed DB

client transactioncoordinator

bank A bank B

start

result

prepare

prepare

rB

rA

outcomeoutcome

If rA==yes && rB==yes outcome = “commit”else outcome = “abort”

B commits uponreceiving “commit”

Loss of availability and higher latency!

9

Locked

Page 10: Overview of NoSQL

Communicating Knowledge

Use tables to represent real objects Join operation is expensive and difficult to be

executed in horizontal scale-out

Schemas vs. Schema-free

user :{ name: quang, surname: Nguyen, mobile : 398}

Name Surname Home Mobile Telephone Office Marital Status

-

Quang Nguyen Null 398 Null Null Null null

Cuong Trinh Nguyen Dinh Chieu

999 555 null null null

- - - - - - - -

user :{ name: Cuong, surname: Trinh, Home: Nguyen Dinh Chieu, mobile : 999, Telephone: 555,}

10

Page 11: Overview of NoSQL

Communicating Knowledge

New Trends and Requirements

11

Page 12: Overview of NoSQL

Communicating Knowledge

In 2010, the amount of information created and replicated first time exceeded zettabytes (trillion gigabytes). In 2011, it surpass 1.8 zettabytes

Information amount is growing fast

12

Page 13: Overview of NoSQL

Communicating Knowledge

Web Indexing Google Earth Youtube Google Books Google Mail

Google: BigTable

13

High ScalabilityHigh Availability

Page 14: Overview of NoSQL

Communicating Knowledge

RDBMS doesn’t fit requirements 10 of thousands servers around the world 10 million customers

Amazon: DynamoDB

High ReliabilityHigh Availability

14

Page 15: Overview of NoSQL

Communicating Knowledge

People More than 800 million active users More than 50% of our active users log on to

Facebook in any given day Average user has 130 friends

Activity More than 900 million objects that people interact with

(pages, groups, events and community pages) On average, more than 250 million photos are

uploaded per day Messaging system including chat, wall posts,

and email has 135+ billion messages per month

Facebook: Cassandra, HBase

High ScalabilityHigh Availability

15

Page 16: Overview of NoSQL

Communicating KnowledgeTwitter

High Availability

16

Page 17: Overview of NoSQL

Communicating Knowledge

It is impossible for a distributed computer system to simultaneously provide all three of the following guarantees Consistency: all nodes see the same data at the

same time Availability: every request receives a response

about whether it was successful or failed Partition Tolerance: the system continues to

operate despite arbitrary message loss

You have to choose only two. In almost all cases, you would choose availability over consistency

CAP Theorem

17

Page 18: Overview of NoSQL

Communicating Knowledge

Strong (Sequential): After the update completes any subsequent access will return the updated value.

Weak (weaker than Sequential): The system does not guarantee that subsequent accesses will return the updated value.

Eventual: All updates will propagate throughout all of the replicas in a distributed system, but that this may take some time. Eventually, all replicas will be consistent.

Consistency Level

18

Page 19: Overview of NoSQL

Communicating Knowledge

Stands for Not Only SQL Class of non-relational data storage systems Usually do not require a fixed table schema nor

do they use the concept of joins All NoSQL offerings relax one or more of the

ACID properties

What is NoSQL

NoSQL !=19

Page 20: Overview of NoSQL

Communicating KnowledgeWhat is NoSQL

20

Page 21: Overview of NoSQL

Communicating Knowledge

Key/Value stores or “the big hash table” Amazon S3 (Dynamo) Memcached

Schema-less, which comes in multiple flavors Document-based (MongoDB, CouchDB) Column-based (Cassandra, Hbase) Graph-based (neo4j)

NoSQL Features

21

Page 22: Overview of NoSQL

Communicating Knowledge

Advantages Very fast Very scalable Simple model Able to distribute horizontally

Disadvantages Many data structures (objects) can't be easily

modeled as key value pairs

Key/Value

22

Page 23: Overview of NoSQL

Communicating Knowledge

Advantages Schema-less data model is richer than key/value pairs Eventual consistency Many are distributed Still provide excellent performance and scalability

Disadvantages no ACID transactions

Schema-less

23

Page 24: Overview of NoSQL

Communicating KnowledgeMemcached

24

Page 25: Overview of NoSQL

Communicating Knowledge

25

Page 26: Overview of NoSQL

Communicating Knowledge

MongoDB is document-oriented database Key -> Document Structured Document Schema-free

Introduction to MongoDB

user :{ name: quang, surname: Nguyen, mobile : 398}

user :{ name: Cuong, surname: Trinh, Home: Nguyen Dinh Chieu, mobile : 999, Telephone: 555,}

Key = quang

Key = cuong

26

Page 27: Overview of NoSQL

Communicating KnowledgeIntroduction to MongoDB

user :{ name: quang, surname: Nguyen, mobile : 398}

user :{ name: Cuong, surname: Trinh, Home: Nguyen Dinh Chieu, mobile : 999, Telephone: 555,}

Query = quang

Query =cuong

Result count: 1

Result count: 1

27

Page 28: Overview of NoSQL

Communicating Knowledge

Indexing Stored JavaScript Aggregation File Storage Make Scaling out easier

Scaling out vs. Scaling up Scaling out is done automatically, balanced across a

cluster

Features of Mongo DB

28

Page 29: Overview of NoSQL

Communicating Knowledge

Large scale application Archiving and event logging Document and Content Management Systems

Some applications of MongoDB

foursquare uses MongoDB to store venues and user "check-ins" into venues, sharding the data over more than 25 machines on Amazon EC2

Craigslist uses MongoDB to archive billions of records

Disney built a common set of tools and APIs for all games within the Interactive Media Group, using MongoDB as a common object repository to persist state information

29

Page 30: Overview of NoSQL

Communicating Knowledge

30

Page 31: Overview of NoSQL

Communicating Knowledge

Column Family: logical division that associate similar data. E.g., User Column Family, Hotel Column Family.

Row oriented: each row doesn’t need to have all the same columns as other rows like it (as in a relational model).

Schema-Free

Introduction to Cassandra

31

Page 32: Overview of NoSQL

Communicating KnowledgeIntroduction to Cassandra

-(Column = name, value =quang, timestamp=32345632)-(column=surname, value=Nguyen, timestamp=12345678)-(column=mobile, value=398, timestamp=33592839)

-(column=name, value=Cuong, timestamp=33434343)-(column=surname, value=Trinh, timestamp=34568258)-(column=Home, value=Nguyen Dinh Chieu, timestamp=54542368)-(column=mobile, value=999, timestamp=23445486)-(column=Telephone, value=555, timestamp=34314642)

Query = quang

Query = cuong

Result count: 3

Result count: 5

32

Page 33: Overview of NoSQL

Communicating Knowledge

Distributed and Decentralized Some nodes need to be set up as masters in order to

organize other nodes, which are set up as slaves That there is no single point of failure

High Availability & Fault Tolerance You can replace failed nodes in the cluster with no

downtime, and you can replicate data to multiple data centers to offer improved local performance and prevent downtime if one data center experiences a catastrophe such as fire or flood.

Tunable Consistency It allows you to easily decide the level of consistency

you require, in balance with the level of availability

Features of Cassandra

33

Page 34: Overview of NoSQL

Communicating Knowledge

Elastic Scalability Elastic scalability refers to a special property of

horizontal scalability. It means that your cluster can seamlessly scale up and scale back down.

Features of Cassandra

34

Page 35: Overview of NoSQL

Communicating Knowledge

Large Deployments Lots of Writes, Statistics, and Analysis Geographical Distribution

Some Applications of Cassandra

Facebook used Cassandra to power Inbox Search, with over 200 nodes deployed

AppScale uses Cassandra as a back-end for Google App Engine applications

Twitter announced it is planning to use Cassandra because it can be run on large server clusters and is capable of taking in very large amounts of data at a time

35

Page 36: Overview of NoSQL

Communicating Knowledge

36

Page 37: Overview of NoSQL

Communicating Knowledge

Data is stored as a Graph/Network Nodes and relationships with properties

Schema-free

Neo4j – Graph Database

people :{ name: quang, surname: Nguyen}

people :{ name: Cuong, surname: Trinh, hobbies: uncountable}

KNOWS

KNOWS

people:{ name: Thanh, surname: Nguyen} Company:{

name: TechMaster, area: IT Education, founded: 2011}

OWNS

KNOWS KNOWS

Company:{ name: Fami, area: Furniture}

Company:{ name: Saltlux, Vietnam Area: SearchEngine}

WORKS

WORKS

37

Page 38: Overview of NoSQL

Communicating Knowledge

Find all persons that KNOWS a friend that KNOWS someone called “Larry Ellison”

Neo4j – Graph Database

SELECT ?person WHERE {?person neo4j:KNOWS ?friend .?friend neo4j:KNOWS ?foe .?foe neo4j:name "Larry Ellison" .}

38

Page 39: Overview of NoSQL

Communicating Knowledge

Disk-based Fully transactional like a real database (ACID is

satisfied) Scale-up, massive scalability. Neo4j can handle

graphs of several billion nodes/ relationships/ properties on a single machine.

No sharding

Features of Neo4j

39

Page 40: Overview of NoSQL

Communicating Knowledge

Ideal for any application that relies on the relationships between records Social Networks Recommendations

Some Applications of Neo4j

40

Page 41: Overview of NoSQL

Communicating Knowledge

MongoDB Sharding

41

Page 42: Overview of NoSQL

Communicating Knowledge

If you want to store a large volume of data or access to it at a higher rate higher than a single server can handle?

More servers are added, what is the dependency between servers

Can your application handle if one server/subset of servers crashes?

What if communication has problems?

Some Considerations

42

Page 43: Overview of NoSQL

Communicating Knowledge

Sharding is the method MongoDB uses to split a large collection across server servers (called cluster)

MongoDB does almost everything automatically; MongoDB lets your application grow – easily, robustly, and natually Making the cluster “invisible” Making the cluster always available for reads and

writes Let the cluster grow easily

What is sharding

43

Page 44: Overview of NoSQL

Communicating Knowledge

A shard is one or more servers in a cluster that are responsible for some subset of the data

A shard can consist of many servers. If there is more than one server in a shard, each server has identical copy of the subset of the data

A Shard

abcShard

abc abc

abc

44

Page 45: Overview of NoSQL

Communicating Knowledge

One range per shard

Data movement issue

Distributing Data – One range per shard

[“a”, “f”)Shard 1

[“f”, “n”)Shard 2

[“n”, “t”)Shard 3

[“t”,”{”)Shard 4

[“a”, “f”)Shard 1

[“f”, “n”)Shard 2

[“n”, “t”)Shard 3

[“t”,”{”)Shard 4

[“a”, “c”)Shard 1

[“c”, “n”)Shard 2

[“n”, “t”)Shard 3

[“t”,”{”)Shard 4

[“c”, “f”)

45

Page 46: Overview of NoSQL

Communicating Knowledge

Data has to be moved across the cluster Distributing Data – One range per shard

500 GB 500 GB 300 GB 300 GB

600 GB 300 GB 300 GB

400 GB 400 GB 400 GB 400 GB

100 GB

400 GB

400 GB 500 GB 300 GB

200 GB

400 GB

100 GB

400GB Data Movement

46

Page 47: Overview of NoSQL

Communicating Knowledge

It’s worse when a new shard is added Distributing Data – One range per shard

500 GB 500 GB 500 GB 500 GB

100 GB

400 GB 400 GB 400 GB

200 GB

400 GB

1 TB Data Movement

0 GB

400 GB

300 GB 400 GB

47

Page 48: Overview of NoSQL

Communicating Knowledge

Each shard can contain multiple ranges. Each range of data is called a chunk.

Distributing Data – Multi range shards

500 GB[“a”, “f”)

500 GB[“f”, “n”)

300 GB[“n”, “t”)

300 GB[“t”, “{“)

100 GB, [“d”, “f”)

500 GB[“a”, “f”)

500 GB[“f”, “n”)

300 GB[“n”, “t”)

300 GB[“t”, “{“)

100 GB, [“j”, “n”)

400 GB[“a”, “d”)

400 GB[“f”, “j”)

400 GB[“n”, “t”);[“d”, “f”)

400 GB[“t”, “{“);[“j”, “n”)

48

Page 49: Overview of NoSQL

Communicating Knowledge

Key (Shard Key) is used for chunk ranges. Shard key is of any types

null < numbers < strings < objects < arrays < binary data < ObjectIds < boolean < dates < regular expression MongoDB first creates a (-∞, + ∞) chunk for a

collection If we add more data, MongoDB would split

existing chunks to create new ones Every chunk range must be distinc, not

overlapped with other chunk range Data movement is resource-consuming, a chunk

is only 200MB by default

Sharding a collection

49

Page 50: Overview of NoSQL

Communicating Knowledge

MongoDB automatically moves chunks from one shard to another in order to keep the data evenly distributed and minimize the data movement. A shard must have at

least 09 more chunks than the least populous chunk

Balancing

50

Page 51: Overview of NoSQL

Communicating Knowledge

Avoid low-cardinality sharding key Continent value: “Asia”, “Australia”, ”Europe”,”North

America”, or “South America” MongoDB can’t split these chunks any further! The

chunks will just keep getting bigger and bigger.

Ascending key does not work as well as we expect. Use timestamp as sharding key Everything is added to the last chunk

Choose a Sharding Key

51

Page 52: Overview of NoSQL

Communicating Knowledge

Random Shard key Waste of index

So, we want to choose a shard key with nice data locality, but not so local that we end up with a hot spot.

Choose a Sharding Key

52

Page 53: Overview of NoSQL

Communicating Knowledge

In general, you should start with a nonsharded setup and convert it to a sharded one, if and when you need. Run out of disk space on your current machine. Want to write data faster than a single process can

handle. Want to keep a larger proportion of data in memory to

improve performance.

When to shard

53

Page 54: Overview of NoSQL

Communicating Knowledge

Thank you!

54

Page 55: Overview of NoSQL

Communicating Knowledge

55