distributed database architecture - unimib

Distributed Database

Architecture

• Data distribution

• Data replication

Outline

@source IBM

Distributed data:

summary

Appl

DBMS

DB

Basic (single db)

connect

Appl

DBMS

DB

DBMS

DB

connect

Fed Srv

Federation

Appl

DBMS

DB

Appl

DBMS

DB

Repl

Srv

Replication

Appl

Appl

DBMS

DB

EP

Srv

Event Publishing

Appl

DBMS

DB

Appl

DBMS

DB

ETL srv

DW

DBMS

Extract Trasform

& Load

Appl

DBMS

DB

DBMS

DB

connect connect

Distributed Access

DA

TA

MO

VE

: N

OD

ATA

MO

VE

: Y

ES

Data distribution

• Shared everything

• Shared disk

• Shared nothing

Type of architecture

Business Logic Presentation logic

Mainframe

Shared everything

Database

Business LogicPresentation logic

Dumbterminal

Dumbterminal

Dumbterminal

Database server

Database

Dumbterminal

Dumbterminal

Dumbterminal

Web Browser

Shared everything

Presentation logic (javascript)

Database server

Database

Dumbterminal

Dumbterminal

Dumbterminal

Application serverApplication server



Shared disk

• Adopted solution of Nosql database architecturesupporting scale out

Shared nothing

• http://www.mullinsconsulting.com/db2arch-sd-sn.html

Evaluation

• What is high availability? Is a mix of

• Architecture design

• people!

• process

• technology

• What is NOT high availability

– A pure technology solution

– A close term to

• scalability

• manageability

Scalability or availability?

How many 9?

availability Downtime (in one year)

100% Never

99.999% < 5.26 minutes

99.99% 5.26 – 52 minutes

99.9 % 52 minutes – 8 hours and 45 minutes

99 % 8 hours and 45 minutes –

87 hours and 36 minutes

90% 788 hours and 24 minutes –

875 hours and 54 minutes

Replication

• A log is a sequential file that is stored in a stable memory (that it a “conceptual” storage the will never fail)

• It stores all activitied realized by all transactions in a chronological order.

• Two type of record are stored:

– Transaction log • It includes operation on tables

– System events• Checkpoint

• Dump

System log

– It dependes on the specific relational operation

– Legenda

O=object, AS = After State, BS = Before State

Possible operation in a transaction

– begin, B(T)

– insert, I(T,O, AS)

– delete, D(T,O,BS)

– update, U(T,O,BS,AS)

– commit, C(T), o abort, A(T)

Transaction log

A log example

B(T1)B(T2) C(T2) B(T3)

U(T3,…)U(T1,…)

U(T1,…)U(T2,…) U(T1,…)

Checkpoint

• checkpoint is in charge to storing the set of running transcations in a giventime point T1, …, Tn

Example of transaction log and

checkpoing

dump

CK

B(T1) B(T2) C(T2) B(T3)

U(T3,…)U(T1,…)

U(T1,…)U(T2,…)U(T1,…)

T1

T2

T3

committed

uncommitted

It is not started yet

• A dump is a full copy of the entire state of a DB in a stable memory

• offline execution

• It generates a backup

• After the backup is completed the dump record of log is written

2.3 Dump

Log example

CKB(T1)B(T2) C(T2) B(T3)

U(T3,…)U(T1,…)

U(T1,…)U(T2,…)U(T1,…)

dump

21 @source IBM

CD1SOURCE

TARGET TARGET TARGET

Data Distribution (1:many)

CD1SOURCE CD1SOURCE CD1SOURCE

TARGET

Data Consolidation (many:1)

CD1SOURCE

CD1STAGING CD1STAGING

TARGETTARGET

Multi-Tier Staging

TARGETTARGET

CD1SOURCE

Peer-to-Peer

CD1SOURCE CD1SOURCE

CD1PRIMARY

Bi-directional

SECONDARY

Conflic

t D

ete

ction/R

esolu

tion

Replica architecture

22

How to create a replica

1. Detach 2. Copy 4. Attach

3. Attach

23


1. Backup (2. Copy) 3. Restore


Full backupTransaction log


TX1: INSERT S1

TX2: INSERT S2

TX3: ROLLBACK

TX1: COMMIT

TX1: UPDATE S1

TX3: DELETE S1

D2 Log

Q-SUBS

Q-PUBS

SOURCE2

SOURCE1

TX1: INSERT S1

TX1: COMMIT

TX1: UPDATE S1

CAPTURE

In-Memory-Transactions

Transaction is still „in-flight“

Nothing inserted yet.

„Zapped“ at Abort

Never makes it to send queue

TX3: DELETE S1

TX3: ROLLBACK TX2: INSERT S2

Restart

Queue

MQ Put when Commit

record is found

Send Queue

Source

SOURCE2

SOURCE1

DB Log

Capture

• From a conceptual view point it is a replica without apply

Target

SOA/User

Application

User

Application

WBI Event

Broker

TARGET

TARGET

TARGET

Event Publishing

Replica execution

Primary

Full backup Full restore

Copy

Secondary

Log backup Log restore

Copy

Inizializzazione

Sincronizzazione

Monitor

Another architecture

Subscribers

Distributor

Publisher

Distribution in NoSQL

MongoDB's Approach to Sharding

Partitioning

• User defines shard key

• Shard key defines range of data

• Key space is like points on a line

• Range is a segment of that line

Initially 1 chunk

Default max chunk size: 64mb

MongoDB automatically splits & migrates chunks when max reached

Data Distribution

Queries routed to specific shards

MongoDB balances cluster

MongoDB migrates data to new nodes

Routing and Balancing

MongoDB Auto-Sharding

• Minimal effort required

– Same interface as single mongod

• Two steps

– Enable Sharding for a database

– Shard collection within database

Architecture

What is a Shard?

• Shard is a node of the cluster

• Shard can be a single mongod or a replica set

Meta Data Storage

• Config Server

– Stores cluster chunk ranges and locations

– Can have only 1 or 3 (production must have 3)

– Not a replica set

Routing and Managing Data

• Mongos

– Acts as a router / balancer

– No local data (persists to config database)

– Can have 1 or many

Sharding infrastructure

Configuration

Example Cluster

mongod --configsvr

Starts a configuration server on the default port (27019)

Starting the Configuration Server

mongos --configdb <hostname>:27019

For 3 configuration servers:

mongos --configdb<host1>:<port1>,<host2>:<port2>,<host3>:<port3>

This is always how to start a new mongos, even if the cluster is already running

Start the mongos Router

mongod --shardsvr

Starts a mongod with the default shard port (27018)

Shard is not yet connected to the rest of the cluster

Shard may have already been running in production

Start the shard database

On mongos:

– sh.addShard(‘<host>:27018’)

Adding a replica set:

– sh.addShard(‘<rsname>/<seedlist>’)

Add the Shard

db.runCommand({ listshards:1 })

{ "shards" :

[{"_id”: "shard0000”,"host”: ”<hostname>:27018” } ],

"ok" : 1

}

Verify that the shard was added

Enabling Sharding

• Enable sharding on a database

sh.enableSharding(“<dbname>”)

• Shard a collection with the given key

sh.shardCollection(“<dbname>.people”,{“country”:1})

• Use a compound shard key to prevent duplicates

sh.shardCollection(“<dbname>.cars”,{“year”:1,

”uniqueid”:1})

Tag Aware Sharding

• Tag aware sharding allows you to control the distribution of your data

• Tag a range of shard keys

– sh.addTagRange(<collection>,<min>,<max>,<tag>)

• Tag a shard

– sh.addShardTag(<shard>,<tag>)

Mechanics

Partitioning

• Remember it's based on ranges

Chunk is a section of the entire range

A chunk is split once it exceeds the maximum size

There is no split point if all documents have the same shard key

Chunk split is a logical operation (no data is moved)

Chunk splitting

Balancer is running on mongos

Once the difference in chunks between the most dense shard and the least dense shard is above the migration threshold, a balancing round starts

Balancing

The balancer on mongos takes out a “balancer lock”

To see the status of these locks:use config

db.locks.find({ _id: “balancer” })

Acquiring the Balancer Lock

The mongos sends a moveChunk command to source shard

The source shard then notifies destination shard

Destination shard starts pulling documents from source shard

Moving the chunk

When complete, destination shard updates

config server

– Provides new locations of the chunks

Committing Migration

Source shard deletes moved data

– Must wait for open cursors to either close or time out– NoTimeout cursors may prevent the release of the lock

The mongos releases the balancer lock after old chunks are deleted

Cleanup

Routing Requests

Cluster Request Routing

• Targeted Queries

• Scatter Gather Queries

• Scatter Gather Queries with Sort

Cluster Request Routing: Targeted

Query

Routable request received

Request routed to appropriate shard

Shard returns results

Mongos returns results to client

Cluster Request Routing: Non-Targeted

Query

Non-Targeted Request Received

Request sent to all shards

Shards return results to mongos

Cluster Request Routing: Non-Targeted

Query with Sort

Non-Targeted request with sort

received

Request sent to all shards

Query and sort performed locally

Shards return results to mongos

Mongos merges sorted results

Shard Key

Shard Key

• Shard key is immutable

• Shard key values are immutable

• Shard key must be indexed

• Shard key limited to 512 bytes in size

• Shard key used to route queries

– Choose a field commonly used in queries

• Only shard key can be unique across shards

– `_id` field is only unique within individual shard

Shard Key Considerations

• Cardinality

• Write Distribution

• Query Isolation

• Reliability

• Index Locality

HBase Architecture

87

Three Major Components

88

• The HBaseMaster

– One master

• The HRegionServer

– Many region servers

• The HBase client

HBase Components

• Region– A subset of a table’s rows, like horizontal range

partitioning– Automatically done

• RegionServer (many slaves)– Manages data regions– Serves data for reads and writes (using a log)

• Master– Responsible for coordinating the slaves– Assigns regions, detects failures– Admin functions

89

Big Picture

90

Hbase architecture

ZooKeeper

• HBase depends on

ZooKeeper

• By default HBase manages

the ZooKeeper instance

– E.g., starts and stops

ZooKeeper

• HMaster and HRegionServers

register themselves with

ZooKeeper

92

Cassandra Architecture

Cassandra Architecture Overview

○ Cassandra was designed with the understanding that system/

hardware failures can and do occur

○ Peer-to-peer, distributed system

○ All nodes are the same

○ Data partitioned among all nodes in the cluster

○ Custom data replication to ensure fault tolerance

○ Read/Write-anywhere design

○ Google BigTable - data model

○ Column Families

○ Memtables

○ SSTables

○ Amazon Dynamo - distributed systems technologies

○ Consistent hashing

○ Partitioning

○ Replication

○ One-hop routing

Transparent Elasticity

Nodes can be added and removed from Cassandra online, with no downtime being experienced.

1

2

3

4

5

6

1

7

10 4

2

3

5

68

9

11

12

Transparent Scalability

Addition of Cassandra nodes increases performance linearly and ability to manage TB’s-PB’s of data.

1

2

3

4

5

6

1

7

10 4

2

3

5

68

9

11

12

Performance throughput = N

Performance throughput = N x 2

High Availability

Cassandra, with its peer-to-peer architecture has no single point of failure.

Multi-Geography/Zone Aware

Cassandra allows a single logical database to span 1-N datacenters that are geographically dispersed. Also supports a hybrid on-premise/Cloud implementation.

Data Redundancy

Cassandra allows for customizable data redundancy so that data is completely protected. Also supports rack awareness (data can be replicated between different racks to guard against machine/rack failures).

uses ‘Zookeeper’ to choose a leader which tells nodes the range they are replicas for

• Nodes are logically structured in Ring Topology.

• Hashed value of key associated with data partition is used to assign it to a node in the ring.

• Hashing rounds off after certain value to support ring structure.

• Lightly loaded nodes moves position to alleviate highly loaded

nodes.

Partitioning

01

1/2

F

E

D

C

B

A N=3

h(key2)

h(key1)

Partitioning & Replication

• Used to discover location and state information about the other nodes participating in a Cassandra cluster

• Network Communication protocols inspired for real life rumor spreading.

• Periodic, Pairwise, inter-node communication.

• Low frequency communication ensures low cost.

• Random selection of peers.

• Example – Node A wish to search for pattern in data

– Round 1 – Node A searches locally and then gossips with node B.

– Round 2 – Node A,B gossips with C and D.

– Round 3 – Nodes A,B,C and D gossips with 4 other nodes ……

• Round by round doubling makes protocol very robust.

Gossip Protocols

• Gossip process tracks heartbeats from other nodes both directly and indirectly

• Node Fail state is given by variable Φ

– tells how likely a node might fail (suspicion level) instead of simple binary value (up/down).

• This type of system is known as Accrual Failure Detector

• Takes into account network conditions, workload, or other conditions that might affect perceived heartbeat rate

• A threshold for Φ tells is used to decide if a node is dead

• If node is correct, phi will be constant set by application.

Generally Φ(t) = 0

Failure Detection

Write Operation Stages

• Logging data in the commit log

• Writing data to the memtable

• Flushing data from the memtable

• Storing data on disk in SSTables

• Commit Log

– First place a write is recorded– Crash recovery mechanism– Write not successful until recorded in commit log– Once recorded in commit log, data is written to Memtable

• Memtable

– Data structure in memory– Once memtable size reaches a threshold, it is flushed (appended) to SSTable– Several may exist at once (1 current, any others waiting to be flushed)– First place read operations look for data

• SSTable

– Kept on disk– Immutable once written– Periodically compacted for performance

Write Operations

Write Operations

Consistency

• Read Consistency– Number of nodes that must agree before read request

returns– ONE to ALL

• Write Consistency– Number of nodes that must be updated before a write is

considered successful– ANY to ALL– At ANY, a hinted handoff is all that is needed to return.

• QUORUM– Commonly used middle-ground consistency level– Defined as (replication_factor / 2) + 1

Write Consistency (ONE)

0

Node 1

Node 2

Node 3Node 4

Node 5

Node 6

replication_factor = 3

R1

R2

R3Client

INSERT INTO table (column1, …) VALUES (value1, …) USING CONSISTENCY ONE

Write Consistency (QUORUM)

0

Node 1

Node 2

Node 3Node 4

Node 5

Node 6


R1

R2

R3Client

INSERT INTO table (column1, …) VALUES (value1, …) USING CONSISTENCY QUORUM

• Write intended for a node that’s offline

• An online node, processing the request, makes a note to carry out the write once the node comes back online.

Write Operations: Hinted

Handoff

Hinted Handoff

0

Node 1

Node 2

Node 3Node 4

Node 5

Node 6

replication_factor = 3and

hinted_handoff_enabled = true

R1

R2

R3Client

INSERT INTO table (column1, …) VALUES (value1, …) USING CONSISTENCY ANY

Write locally: system.hints

Note: Doesn’t not count toward consistency level (except ANY)

• Tombstones

– On delete request, records are marked for deletion.

– Similar to “Recycle Bin.”

– Data is actually deleted on major compaction or configurable timer

Delete Operations

• Compaction runs periodically to merge multiple SSTables– Reclaims space– Creates new index– Merges keys– Combines columns– Discards tombstones– Improves performance by minimizing disk seeks

• Two types– Major– Read-only

Compaction

Compaction

• Ensures synchronization of data across nodes

• Compares data checksums against neighboring nodes

• Uses Merkle trees (hash trees)

• Snapshot of data sent to neighboring nodes

• Created and broadcasted on every major compaction

• If two nodes take snapshots within TREE_STORE_TIMEOUT of each other, snapshots are compared and data is synced.

Anti-Entropy

Merkle Tree

• Read Repair

– On read, nodes are queried until the number of nodes which respond with the most recent value meet a specified consistency level from ONE to ALL.

– If the consistency level is not met, nodes are updated with the most recent value which is then returned.

– If the consistency level is met, the value is returned and any nodes that reported old values are then updated.

Read Operations

Read Repair

0

Node 1

Node 2

Node 3Node 4

Node 5

Node 6R1

R2

R3Client

SELECT * FROM table USING CONSISTENCY ONE


• Bloom filters provide a fast way of checking if a value is not in a set.

Read Operations: Bloom Filters

Read

MemoryDisk

Bloom Filter

Key Cache

Partition Summary

Compression Offsets

Partition Index Data

Cache Hit

Cache Miss

= Off-heap

key_cache_size_in_mb > 0

index_interval = 128 (default)

distributed database architecture - unimib

Documents