no sql introduction_v1.1.1

Technical overview of cloud storage

NoSQL

Not

OnlySQL

声明：1.本文只

用于个人学习

和交流,如有

错漏，欢迎交

流

2.大部分内容是

我在诺西工作

期间

完成，但不涉

及任何诺西产

品和技

术，在此表示

感谢

3.本文中有很多

内容来源于互

联

网，如有侵犯

任何版权，请

通知我:

[email protected]; @胖悟空

Background What’s NoSQL Why NoSQL

How to make a selection of NoSQL

Data type Data model Architecture Key technologies

Summary

Agenda

What is NoSQL

Definition NoSQL ,sometimes expanded to "not only SQL“. It is a broad class of

database management systems that differ from classic relational database management systems(RDBMSs).

These data stores may NOT require FIXED table schemas, usually avoid join operations, and typically scale horizontally.

Academia typically refers to these databases as structured storage, a term that would include classic relational databases as a subset.

Refer to Wiki page: http://en.wikipedia.org/wiki/NoSQL

SQL

NotOnly SQL

http://en.wikipedia.org/wiki/Database_management_system

http://en.wikipedia.org/wiki/Database_management_system

http://en.wikipedia.org/wiki/Relational_database_management_system

http://en.wikipedia.org/wiki/Database_schema

http://en.wikipedia.org/wiki/Join_(SQL)

http://en.wikipedia.org/wiki/Scalability

http://en.wikipedia.org/wiki/NoSQL

SQL Vs. NoSQLSQL

NoSQL

NoSQL is not

good at

everything,

neither is SQL.

Transactional semantics

ACID

Restricted ACID

Complex & Functionality

Simple & App Oriented

Relational& Row storage

Key-Value, Column Oriented, Document Oriented &graph

Fixed

Schema Free/Schema less

Limited & Costly

Horizontal Scalability & Massive

Reliable & Expensive

Commodity & Inexpensive

Query Model

Data Model

Schemas

Data Storage

Failure tolerance

Hardware

failure recovery slowNative & fast recovery

Why is NoSQL?

Notable with internet

players and apps, as

some of their

requirements could

not be met by

RDBMS.

Come From RequirementFast Increasing & Development

Increasing number of servers

Scale out Inexpensive & unreliable

servers Increasing data volume

Big Data Scalability

Increasing user number High throughputs High workload

All about INCREASING

Rapid change Always beta Flexible data schema

Abundant web applications Complex data Larger record size Typically read more and write less Low transaction and consistency requirements

Online services Failure tolerance Fast recovery High availability

Come From RequirementDifferent application & Ecosystem

How to select a NoSQL system?

memcachedb

What kinds of data can I store with?

Data type Classification• Structured• Unstructured• Semi-structured

Data type Classification What kind of data should be stored

?

Unstructured data• Does not have a pre-defined data model • And/or does not fit well into relational

tables

Structured data• The entities belongs to the same class

should have same attributes and attributes order

• The data structure should be predefined and couldn’t changed

Semi-structured data• Is a form of structured data • The entities belongs to the same class

may have different attributes• Contains tags or other markers to separate

semantic elements and enforce hierarchies of records and fields within the data

• the entities belongs to the same class may have different attributes even though they are grouped together, and the attributes order is not important.

• Is also known as schema less or self-describing structure.

Dynamo Voldemort

Tokyo cabinet Redis

Berkeley DB Memcache DB

My SQL Oracle

Mongo

CassandraHBase

Couch

Hyper TableBigTable

Query

Store

STRUCTURED, e.g. CRM,ERP

SEMI-STRUCTURED, e.g.Logs, mails, web pages,Blogs

UNSTRUCTURED, e.g. Documents, Videos, Audios, Images

Summary

Flexible Record

size Efficiency Transactio

nal Scalability

Flexible

Record size

EfficiencyTransactional

Scalability

UnstructuredStructuredSemi-structured

How can I express my business model?

Data model Classification• Key-Value pair based• Column Oriented store• Document Oriented store• Graph database

Key-Value pair basedSimple read and write data item is uniquely identified by a key

Key-value stores allow the application to store its data in a schema-less way. The data could be stored in a data type of a programming language or an object.

A key indicates a unique Value Anything can be stored in a value, image, document, even a

complex data structure( array, list …)

Advantages• Efficiency• Easy to use• Flexible data storage

Disadvantages• Simple query model

Many cloud based databases can be classified to Key-Value store, such as most of column oriented databases.

http://en.wikipedia.org/wiki/Distributed_computing



Notes

High-performance, scalable, distributed Graph Database

Graph database with query language called GraphQL

Column Oriented storeA Simple :Column store Vs. Row store

Neo4j JavaHigh-performance, scalable, distributed Graph Database

OrientDB Java

Sones GraphDB

C#Graph database with query language called GraphQL

Name

Neo4j

OrientDB

FlockDB

Sones GraphDB

Language

Java

Java

Scala

C#Null is free

FlockDB Scala

Empty cells are stored

NameLanguag

eNotes

Neo4j JavaHigh-performance, scalable, distributed Graph Database

OrientDB Java

FlockDB Scala

Sones GraphDB

C#Graph database with query language called GraphQL

Query 1

Query 2

Queries

Versioned

t3

Column Oriented storeBigTable data model

“<HTML>…” “CNN” “CNN.COM”

Content Anchor

Anchor: cnnsi.com Anchor: my.look.ca

Column Families

Content:

“com.cnn.www”

Row Key

t5

t6 t8

t7

“com.cnn.www/index.htm”

Cell contents( , , )

t9

Row ColumnTimestampSorted RowKey, Storing Storing pages fromthe same domain near each other

Column Oriented storeOne to Many relationship

Row Key Content

com.cnn.www <HTML>…

… …

Row Key Anchor Reference text

com.cnn.www cnnsi.com CNN

com.cnn.www my.look.ca CNN.COM

com.cnn.www … …

1 0…n

Row Key content anchor

content: anchor:cnnsi.com anchor:my.look.cn anchor:…

com.cnn.www <HTML>… CNN CNN.COM …

RDMS model

BigTable model

Vertical Extension

Horizontals Extension

JOIN

Stores content by column rather than by row.

A key identifies a row, which contains data stored in one or more Column Families(CF)

Within a CF, each row can contain multiple columns

Columns can be added dynamically Distributed multi-dimensional sparse map

(row, column, timestamp) → cell contents

Column Oriented storeBigTable liked data model

•Advantages– Versioned– Query oriented– Good for OLAP Applications– Null is free– Compression efficient – Dynamic Columns

•Disadvantages– Read entire row is not

efficient– Contains tags or other

markers to separate semantic elements

– Not well-suited for OLTP-like workloads

– Simple query model




The idea is to replace the concept of a “row” with a more flexible model

The “document.” By allowing embedded documents and arrays

the document-oriented approach makes it possible to represent complex hierarchical relationships with a single record.

Documents have some similar information and some different

Usually store documents in a JSON or JSON-like format

Document Oriented store

•Advantages– Rich RDBMS-like functions– Freedom in modeling

documents•Disadvantages

– Query logic complex.– Documents are limited in size


Document Oriented storeExamples

Row Key Content

com.cnn.www <HTML>…

… …

Row Key Anchor Reference text

com.cnn.www cnnsi.com CNN

com.cnn.www my.look.ca CNN.COM

com.cnn.www … …

1 0…n

Document 1{

“Rowkey” : “com.cnn.www”, “content”: “<HTML>…”, “Anchor”: {

“cnnsi.com”:”CNN”,“my.look.ca”:”

CNN.COM”}

}

//rowkey == " com.cnn.www "

find({" Rowkey" : " com.cnn.www "})// 20<age <30

find({"age" : {"$lt" : 30, "$gt" : 20}}) // id_num % 5 ==1

find({"id_num" : {"$mod" : [5, 1]}})// id_num % 5 !=1

find({"id_num" : {"$not" : {"$mod" : [5, 1]}}})// regular expression :name == joe and case insensitive

find({"name" : /joe/i})

TBD

Graph database

Key-Value Column oriented Document oriented

Graph

Schema Schema less Dynamic columns Complex and hierarchical data model, JSON-like format

Graph

Query model Key-value pair Key-value Affluent and complex

Data type Unstructured Semi-structured Semi-structured

Advantage Efficiency, Easy Query oriented, null is free

Functionality and Freedom in modeling

Disadvantage Sample Simple query model Complex

Systems

Summary

How can I deploy and administrate the system?

Data model Classification

• Key-Value pair based

• Column Oriented store

• Document Oriented store

• Graph databaseArchitecture Classification

• Master-Slave architecture

• P2P architecture

• Hierarchy architecture

Region Server

Region Server

Region Server

Master-Slave architectureAn example: HBase Architecture

Zookeeper

HDFS

Control flaw

Data flaw

HMaster

and many Slaves• One Master• Master manages meta data

• Slaves, Slaves report status to the master and take over the real data management

in charge of all slaves, dispatch tasks do load balance and so on

• Usually with Data flow and Control flow detach• Typically with global storage system(e.g. DFS) for data durability and fast recovery• Especially some with a distributed coordination mechanism to do master election, maintain configuration, failure detection and synchronization

Master-Slave architecture

Is a model of communication where one device or process has unidirectional control over one or more other devices. In some systems a master is elected from a group of eligible devices, with the other devices acting in the role of slaves.

•Advantages– Clear Architect– Easy to provide Strong

Consistency– Easy for Management– Easy for scalability

•Disadvantages– Single Point Failure risk– Hotspot problems

http://en.wikipedia.org/wiki/Model_(abstract)

http://en.wikipedia.org/wiki/Communication

http://en.wikipedia.org/wiki/Computer_hardware

http://en.wikipedia.org/wiki/Process_(computing)

http://en.wikipedia.org/wiki/Computer_hardware

P2P ArchitectureAn example: Cassandra

4

8

26

3

1

5

7

Client

• Peers are equally privileged

• Node replica as a factor

• Gossip protocol for failure detection and maintaining cluster (node in/out)• Every member act as a proxy

for one hop routing

35

7

P2P architecture

Computing or networking is a distributed application architecture

Peers are equally privileged, equipotent participants in the application.

Peers make a portion of their resources, such as processing power, disk storage or network bandwidth, directly available to other network participants, without the need for central coordination by servers or stable hosts.

Usually used in conjunction with the consistent hash

•Advantages– High availability– Efficient for Random Read/write– Nature data distribute– Usually One-hop lookup– Minimal Administration

•Disadvantages– Weak of global status– More network communications

to maintain cluster(log(n))

Hierarchy architectureAn example: mongodb Architecture

Mongodprimary

Mongodsecondary

MongodArbiter

Config server1

Config server2

Config server3

mongos mongos

client client client

…

…

• Clients send queries to mongos servers

• Mongoses act as routing servers, queries are automatically routed to the appropriate shard • Each shard consists of multiple replicated servers per shard to ensure availability and automated failover. The set of servers within the shard comprise a replica set.

shard1

Mongodprimary

Mongodsecondary

MongodArbiter

Mongodprimary

Mongodsecondary

MongodArbiter

shard2 shard3

Replica setReplica setReplica set

• The config servers store the cluster's metadata, each config server has a complete copy of all metadata, and if meta data is changed, it will sent to Mongos for update routing information.

Hierarchy architectureAn example: mongo db Architecture(2)

Data storage

Meta data storage

Routing server

client

Distinct hierarchy dependency

Routing servers is scalable and store nothing

….

Data storage

client

Routing server

Routing server

Routing server

Routing serverscan be deployed up to client/APP,or down to data storage

Meta data storage

Meta data storage

Meta data storage

Meta data storage is not a single point,two phase submitis used, and the responsibilities of meta data servers decrease

Mongodprimary

Mongodsecondary

MongodArbiter

Data storage layeris grouped into replica sets, not onlyact as data serving also as data and service availabilitymechanism

Hierarchy architecture

Distinct hierarchy dependency

Especially with a routing layer

Less responsibility of client No clear data flow and

control flow

•Advantages– High availability– No single point failure– Each layer scalable alone– Flexible routing layer

•Disadvantages– Lower efficiency – Complex administrate

Summary

Availability Scalability Efficiency Concise Administrati

ve

Availability

Scalability

EfficiencyFunctionality

Administrative

Master-SlaveP2PHierarchy

SummaryFailover

Master-slave architecture Master fails -> Master election Slaves fails -> Reassign by Master

P2P Architecture Replica factor Hinted Handoff

Hierarchy Architecture Master election & Hinted Handoff Multi-routing process

What about the performance with the system?What about the key features of the system?

Key features Classification

• CAP classification

• Consistency mechanism

• Availability mechanism

• Partitioning & scalability

mechanism

• Data Durability mechanism

CAP Classification

• Consistency ,means all nodes see the same data at the same time•Availability ,a guarantee that every request receives a response about whether it was successful or failed•Partition tolerance ,the system continues to operate despite arbitrary message loss







All about RedundancyWhat’s the problems come from?

Redundancy is anywhere in distributed systems, especially with Commodity hardware

Consistency Availability Partitioning Reliability Concurrency Throughputs …

Service

ServiceService

Request Request Request

Data storage Data storage Data storage

Consistency mechanism

Two phase submit Strong consistency

Master-slave Eventual consistency Strong consistency

Quorum Eventual consistency Strong consistency

Paxos Strong consistency

• Consistency is opposite with Performance and Availability

Master-Slave architecture systems (such as HBase, BigTable) adopted lower availability and strong consistency

Hierarchy & P2P systems choose to do strong consistency at the expense of decreasing reading performance

Two-phase commitAn example: GFS lease implementation

• The commit-request phase : client push all data to replicas(step3), and send submit request to primary replica (step4)

• The commit phase: Primary replica request replica A and replica B to submit the data(step 5), replica A & replica B response “yes”(step 6), the submit is successful(step 7).

Master-slaveAn Example: MongoDB replica sets

MasterReplica

Replica

Write Read

Sync

Sync

Read only

Read only

MasterReplica

Replica

Write Read

Sync

Sync

• Master can be read and write• Replicas/slaves are read only

Eventually Consistency ButPerformance and Availability higher

• Only Master can be read and write• Replicas/slaves only for backup

Strong Consistency

Quorum

• Configurable consistency

• Usually with anti-entropy using Merkle trees for replica synchronization and Read Repair for Keep consistency

• (N, R, W) Tradeoff between consistency and performance– Typical configuration: R(2) + W(2) > N(3), – R + W > N yields a quorum-like system, ensure an application can always read the newest data

N: number of replicasR: minimum number of successful readW: minimum number of successful write

Quorum An example: Cassandra Read repair

Query

Closest replica

Cassandra Cluster

Replica A

Result

Replica B Replica C

Digest QueryDigest Response Digest Response

Result

Client

Read repair if digests differ

Routing mechanism Typically used in hierarchy architecture See MongoDB mongos implementation, hide the back end server changing

Failure detection Distributed coordination.

Usually used in master-slave architecture, such as zookeeper in Hbase and chubby in BigTable

Gossip protocol Usually used in P2P architecture, e.g. Dynamo & Cassandra

Master election Hinted handoff

Availability mechanism

Mongoddown

Is Used for failover When a cluster consist of a

group of n and one of them act as master/primary node.

If the node fails, the cluster will elect a new master/primary node.

Availability mechanismMaster election

Mongodprimary

Mongodsecondary

MongodArbiter

MongoDB replica set

• Each node can be primary• Secondary nodes can only act as arbiter or data nodes and arbiter

Mongodprimary

MongodrecoveringMongod

secondary

NegotiateNew master

HBase Master election

Zookeeper

HMaster

Secondary HMaster

Region Server

• Zookeeper act as a Arbiter, and keep a “token” for Hbase master, The node which get the “token” will act as master.

• If HMaster fails, the “token” that it toke form zookeeper will be released , the secondary HMaster will act as Hmaster

• Then, Zookeeper will send the change to Every nodes in the cluster

×

Writes are performed on the first N healthy nodes found by the coordinator.

If a node is down, data will be sent to the next node in the ring.

This node will keep track of the intended recipient and send later.

Replicas are stored at multiple data centers for handing the failure of the whole data center

Hinted HandoffFor temporary failure

A

B

C

DE

F

G

Hash(k)

• So called always writeable in Cassandra

Data partitioning & Scalability mechanismHierarchically structure

Multi-levels hierarchy organization 3 levels in BigTable, HBase and Hypertable(root->meta->user) 2 levels in mongo DB(meta->user)

Key range split/auto sharding for data partitioning•Advantages

– Automatic balancing for changes in data distribution

– High performance in range query

– Nearly unlimited data storage•Disadvantages

– Sequence write not efficient

Scalability mechanismConsistent hash

45

•Advantages– Nature balancing for data

partitioning &distribution – High performance in

random operations•Disadvantages

– Non-uniform data/load distribution

– Disregard of the heterogeneity of node performance

– Moving data when nod in/out

– Not good for sequence operations and range query

01

1/2

F

E

D

C

B

A N=3

h(key2)

h(key1)

Data Durability mechanism

Write ahead log Is a family of techniques for providing atomicity and durability (two of the ACID properties)

in database systems. In a system using WAL, all modifications are written to a log before they are applied.

Usually both redo and undo information is stored in the log.

Data replica DFS (Hbase, hypertable,bigtable) Embedded Redundancy(cassandra, mongo DB)

http://en.wikipedia.org/wiki/Atomic_(computer_science)

http://en.wikipedia.org/wiki/Durability_(computer_science)

http://en.wikipedia.org/wiki/ACID

http://en.wikipedia.org/wiki/Database_system

http://en.wikipedia.org/wiki/Database_log

Data Durability mechanismAn example: HBase WAL

• Log Flushing Data streams written to a file system• Log Rolling Back check database persistence and the logs, then remove all the logs before last database persistence operations.• Log Replaying Replaying a log is simply done by reading a log and adding its entries to the database and then flush the data to disks. It can be used for fault recovery

Summary

Consistency Avalaibility Data Partitioning Data Durability Scalability failover

Two phase submit Routing mechanism Table split/auto sharding DFSHierarchically structure Reassign

Master-slave Failure detection consistent Hash Data Redundancy Consistent Hash Master election

Quorum Master electionMulti-routing process

Hinted handoff Hinted handoff

replica set/group replica factor

no sql introduction_v1.1.1

Technology

data stores

data item

kinds of data

changedsemistructured

data structure array

data model stores content

column oriented databases

key keyvalue stores