distributed database architecture - unimib

113
Distributed Database Architecture

Upload: others

Post on 25-Dec-2021

20 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Distributed Database Architecture - UNIMIB

Distributed Database

Architecture

Page 2: Distributed Database Architecture - UNIMIB

• Data distribution

• Data replication

Outline

Page 3: Distributed Database Architecture - UNIMIB

@source IBM

Distributed data:

summary

Appl

DBMS

DB

Basic (single db)

connect

Appl

DBMS

DB

DBMS

DB

connect

Fed Srv

Federation

Appl

DBMS

DB

Appl

DBMS

DB

Repl

Srv

Replication

Appl

Appl

DBMS

DB

EP

Srv

Event Publishing

Appl

DBMS

DB

Appl

DBMS

DB

ETL srv

DW

DBMS

Extract Trasform

& Load

Appl

DBMS

DB

DBMS

DB

connect connect

Distributed Access

DA

TA

MO

VE

: N

OD

ATA

MO

VE

: Y

ES

Page 4: Distributed Database Architecture - UNIMIB

Data distribution

Page 5: Distributed Database Architecture - UNIMIB

• Shared everything

• Shared disk

• Shared nothing

Type of architecture

Page 6: Distributed Database Architecture - UNIMIB

Business Logic Presentation logic

Mainframe

Shared everything

Database

Business LogicPresentation logic

Dumbterminal

Dumbterminal

Dumbterminal

Database server

Database

Dumbterminal

Dumbterminal

Dumbterminal

Page 7: Distributed Database Architecture - UNIMIB

Web Browser

Shared everything

Presentation logic (javascript)

Database server

Database

Dumbterminal

Dumbterminal

Dumbterminal

Application serverApplication server

Business LogicPresentation logic

Business LogicPresentation logic

Page 8: Distributed Database Architecture - UNIMIB

Shared disk

Page 9: Distributed Database Architecture - UNIMIB

• Adopted solution of Nosql database architecturesupporting scale out

Shared nothing

Page 10: Distributed Database Architecture - UNIMIB

• http://www.mullinsconsulting.com/db2arch-sd-sn.html

Evaluation

Page 11: Distributed Database Architecture - UNIMIB

• What is high availability? Is a mix of

• Architecture design

• people!

• process

• technology

• What is NOT high availability

– A pure technology solution

– A close term to

• scalability

• manageability

Scalability or availability?

Page 12: Distributed Database Architecture - UNIMIB

How many 9?

availability Downtime (in one year)

100% Never

99.999% < 5.26 minutes

99.99% 5.26 – 52 minutes

99.9 % 52 minutes – 8 hours and 45 minutes

99 % 8 hours and 45 minutes –

87 hours and 36 minutes

90% 788 hours and 24 minutes –

875 hours and 54 minutes

Page 13: Distributed Database Architecture - UNIMIB

Replication

Page 14: Distributed Database Architecture - UNIMIB

• A log is a sequential file that is stored in a stable memory (that it a “conceptual” storage the will never fail)

• It stores all activitied realized by all transactions in a chronological order.

• Two type of record are stored:

– Transaction log • It includes operation on tables

– System events• Checkpoint

• Dump

System log

Page 15: Distributed Database Architecture - UNIMIB

– It dependes on the specific relational operation

– Legenda

O=object, AS = After State, BS = Before State

Possible operation in a transaction

– begin, B(T)

– insert, I(T,O, AS)

– delete, D(T,O,BS)

– update, U(T,O,BS,AS)

– commit, C(T), o abort, A(T)

Transaction log

Page 16: Distributed Database Architecture - UNIMIB

A log example

B(T1)B(T2) C(T2) B(T3)

U(T3,…)U(T1,…)

U(T1,…)U(T2,…) U(T1,…)

Page 17: Distributed Database Architecture - UNIMIB

Checkpoint

• checkpoint is in charge to storing the set of running transcations in a giventime point T1, …, Tn

Page 18: Distributed Database Architecture - UNIMIB

Example of transaction log and

checkpoing

dump

CK

B(T1) B(T2) C(T2) B(T3)

U(T3,…)U(T1,…)

U(T1,…)U(T2,…)U(T1,…)

T1

T2

T3

committed

uncommitted

It is not started yet

Page 19: Distributed Database Architecture - UNIMIB

• A dump is a full copy of the entire state of a DB in a stable memory

• offline execution

• It generates a backup

• After the backup is completed the dump record of log is written

2.3 Dump

Page 20: Distributed Database Architecture - UNIMIB

Log example

CKB(T1)B(T2) C(T2) B(T3)

U(T3,…)U(T1,…)

U(T1,…)U(T2,…)U(T1,…)

dump

Page 21: Distributed Database Architecture - UNIMIB

21 @source IBM

CD1SOURCE

TARGET TARGET TARGET

Data Distribution (1:many)

CD1SOURCE CD1SOURCE CD1SOURCE

TARGET

Data Consolidation (many:1)

CD1SOURCE

CD1STAGING CD1STAGING

TARGETTARGET

Multi-Tier Staging

TARGETTARGET

CD1SOURCE

Peer-to-Peer

CD1SOURCE CD1SOURCE

CD1PRIMARY

Bi-directional

SECONDARY

Conflic

t D

ete

ction/R

esolu

tion

Replica architecture

Page 22: Distributed Database Architecture - UNIMIB

22

How to create a replica

1. Detach 2. Copy 4. Attach

3. Attach

Page 23: Distributed Database Architecture - UNIMIB

23

How to create a replica

1. Backup (2. Copy) 3. Restore

Page 24: Distributed Database Architecture - UNIMIB

How to create a replica

Full backupTransaction log

Page 25: Distributed Database Architecture - UNIMIB

How to create a replica

TX1: INSERT S1

TX2: INSERT S2

TX3: ROLLBACK

TX1: COMMIT

TX1: UPDATE S1

TX3: DELETE S1

D2 Log

Q-SUBS

Q-PUBS

SOURCE2

SOURCE1

TX1: INSERT S1

TX1: COMMIT

TX1: UPDATE S1

CAPTURE

In-Memory-Transactions

Transaction is still „in-flight“

Nothing inserted yet.

„Zapped“ at Abort

Never makes it to send queue

TX3: DELETE S1

TX3: ROLLBACK TX2: INSERT S2

Restart

Queue

MQ Put when Commit

record is found

Send Queue

Page 26: Distributed Database Architecture - UNIMIB

Source

SOURCE2

SOURCE1

DB Log

Capture

• From a conceptual view point it is a replica without apply

Target

SOA/User

Application

User

Application

WBI Event

Broker

TARGET

TARGET

TARGET

Event Publishing

Page 27: Distributed Database Architecture - UNIMIB

Replica execution

Primary

Full backup Full restore

Copy

Secondary

Log backup Log restore

Copy

Inizializzazione

Sincronizzazione

Monitor

Page 28: Distributed Database Architecture - UNIMIB

Another architecture

Subscribers

Distributor

Publisher

Page 29: Distributed Database Architecture - UNIMIB

Distribution in NoSQL

Page 30: Distributed Database Architecture - UNIMIB

MongoDB's Approach to Sharding

Page 31: Distributed Database Architecture - UNIMIB

Partitioning

• User defines shard key

• Shard key defines range of data

• Key space is like points on a line

• Range is a segment of that line

Page 32: Distributed Database Architecture - UNIMIB

Initially 1 chunk

Default max chunk size: 64mb

MongoDB automatically splits & migrates chunks when max reached

Data Distribution

Page 33: Distributed Database Architecture - UNIMIB

Queries routed to specific shards

MongoDB balances cluster

MongoDB migrates data to new nodes

Routing and Balancing

Page 34: Distributed Database Architecture - UNIMIB

MongoDB Auto-Sharding

• Minimal effort required

– Same interface as single mongod

• Two steps

– Enable Sharding for a database

– Shard collection within database

Page 35: Distributed Database Architecture - UNIMIB

Architecture

Page 36: Distributed Database Architecture - UNIMIB

What is a Shard?

• Shard is a node of the cluster

• Shard can be a single mongod or a replica set

Page 37: Distributed Database Architecture - UNIMIB

Meta Data Storage

• Config Server

– Stores cluster chunk ranges and locations

– Can have only 1 or 3 (production must have 3)

– Not a replica set

Page 38: Distributed Database Architecture - UNIMIB

Routing and Managing Data

• Mongos

– Acts as a router / balancer

– No local data (persists to config database)

– Can have 1 or many

Page 39: Distributed Database Architecture - UNIMIB

Sharding infrastructure

Page 40: Distributed Database Architecture - UNIMIB

Configuration

Page 41: Distributed Database Architecture - UNIMIB

Example Cluster

Page 42: Distributed Database Architecture - UNIMIB

mongod --configsvr

Starts a configuration server on the default port (27019)

Starting the Configuration Server

Page 43: Distributed Database Architecture - UNIMIB

mongos --configdb <hostname>:27019

For 3 configuration servers:

mongos --configdb<host1>:<port1>,<host2>:<port2>,<host3>:<port3>

This is always how to start a new mongos, even if the cluster is already running

Start the mongos Router

Page 44: Distributed Database Architecture - UNIMIB

mongod --shardsvr

Starts a mongod with the default shard port (27018)

Shard is not yet connected to the rest of the cluster

Shard may have already been running in production

Start the shard database

Page 45: Distributed Database Architecture - UNIMIB

On mongos:

– sh.addShard(‘<host>:27018’)

Adding a replica set:

– sh.addShard(‘<rsname>/<seedlist>’)

Add the Shard

Page 46: Distributed Database Architecture - UNIMIB

db.runCommand({ listshards:1 })

{ "shards" :

[{"_id”: "shard0000”,"host”: ”<hostname>:27018” } ],

"ok" : 1

}

Verify that the shard was added

Page 47: Distributed Database Architecture - UNIMIB

Enabling Sharding

• Enable sharding on a database

sh.enableSharding(“<dbname>”)

• Shard a collection with the given key

sh.shardCollection(“<dbname>.people”,{“country”:1})

• Use a compound shard key to prevent duplicates

sh.shardCollection(“<dbname>.cars”,{“year”:1,

”uniqueid”:1})

Page 48: Distributed Database Architecture - UNIMIB

Tag Aware Sharding

• Tag aware sharding allows you to control the distribution of your data

• Tag a range of shard keys

– sh.addTagRange(<collection>,<min>,<max>,<tag>)

• Tag a shard

– sh.addShardTag(<shard>,<tag>)

Page 49: Distributed Database Architecture - UNIMIB

Mechanics

Page 50: Distributed Database Architecture - UNIMIB

Partitioning

• Remember it's based on ranges

Page 51: Distributed Database Architecture - UNIMIB

Chunk is a section of the entire range

Page 52: Distributed Database Architecture - UNIMIB

A chunk is split once it exceeds the maximum size

There is no split point if all documents have the same shard key

Chunk split is a logical operation (no data is moved)

Chunk splitting

Page 53: Distributed Database Architecture - UNIMIB

Balancer is running on mongos

Once the difference in chunks between the most dense shard and the least dense shard is above the migration threshold, a balancing round starts

Balancing

Page 54: Distributed Database Architecture - UNIMIB

The balancer on mongos takes out a “balancer lock”

To see the status of these locks:use config

db.locks.find({ _id: “balancer” })

Acquiring the Balancer Lock

Page 55: Distributed Database Architecture - UNIMIB

The mongos sends a moveChunk command to source shard

The source shard then notifies destination shard

Destination shard starts pulling documents from source shard

Moving the chunk

Page 56: Distributed Database Architecture - UNIMIB

When complete, destination shard updates

config server

– Provides new locations of the chunks

Committing Migration

Page 57: Distributed Database Architecture - UNIMIB

Source shard deletes moved data

– Must wait for open cursors to either close or time out– NoTimeout cursors may prevent the release of the lock

The mongos releases the balancer lock after old chunks are deleted

Cleanup

Page 58: Distributed Database Architecture - UNIMIB

Routing Requests

Page 59: Distributed Database Architecture - UNIMIB

Cluster Request Routing

• Targeted Queries

• Scatter Gather Queries

• Scatter Gather Queries with Sort

Page 60: Distributed Database Architecture - UNIMIB

Cluster Request Routing: Targeted

Query

Page 61: Distributed Database Architecture - UNIMIB

Routable request received

Page 62: Distributed Database Architecture - UNIMIB

Request routed to appropriate shard

Page 63: Distributed Database Architecture - UNIMIB

Shard returns results

Page 64: Distributed Database Architecture - UNIMIB

Mongos returns results to client

Page 65: Distributed Database Architecture - UNIMIB

Cluster Request Routing: Non-Targeted

Query

Page 66: Distributed Database Architecture - UNIMIB

Non-Targeted Request Received

Page 67: Distributed Database Architecture - UNIMIB

Request sent to all shards

Page 68: Distributed Database Architecture - UNIMIB

Shards return results to mongos

Page 69: Distributed Database Architecture - UNIMIB

Mongos returns results to client

Page 70: Distributed Database Architecture - UNIMIB

Cluster Request Routing: Non-Targeted

Query with Sort

Page 71: Distributed Database Architecture - UNIMIB

Non-Targeted request with sort

received

Page 72: Distributed Database Architecture - UNIMIB

Request sent to all shards

Page 73: Distributed Database Architecture - UNIMIB

Query and sort performed locally

Page 74: Distributed Database Architecture - UNIMIB

Shards return results to mongos

Page 75: Distributed Database Architecture - UNIMIB

Mongos merges sorted results

Page 76: Distributed Database Architecture - UNIMIB

Mongos returns results to client

Page 77: Distributed Database Architecture - UNIMIB

Shard Key

Page 78: Distributed Database Architecture - UNIMIB

Shard Key

• Shard key is immutable

• Shard key values are immutable

• Shard key must be indexed

• Shard key limited to 512 bytes in size

• Shard key used to route queries

– Choose a field commonly used in queries

• Only shard key can be unique across shards

– `_id` field is only unique within individual shard

Page 79: Distributed Database Architecture - UNIMIB

Shard Key Considerations

• Cardinality

• Write Distribution

• Query Isolation

• Reliability

• Index Locality

Page 80: Distributed Database Architecture - UNIMIB

HBase Architecture

87

Page 81: Distributed Database Architecture - UNIMIB

Three Major Components

88

• The HBaseMaster

– One master

• The HRegionServer

– Many region servers

• The HBase client

Page 82: Distributed Database Architecture - UNIMIB

HBase Components

• Region– A subset of a table’s rows, like horizontal range

partitioning– Automatically done

• RegionServer (many slaves)– Manages data regions– Serves data for reads and writes (using a log)

• Master– Responsible for coordinating the slaves– Assigns regions, detects failures– Admin functions

89

Page 83: Distributed Database Architecture - UNIMIB

Big Picture

90

Page 84: Distributed Database Architecture - UNIMIB

Hbase architecture

Page 85: Distributed Database Architecture - UNIMIB

ZooKeeper

• HBase depends on

ZooKeeper

• By default HBase manages

the ZooKeeper instance

– E.g., starts and stops

ZooKeeper

• HMaster and HRegionServers

register themselves with

ZooKeeper

92

Page 86: Distributed Database Architecture - UNIMIB

Cassandra Architecture

Page 87: Distributed Database Architecture - UNIMIB

Cassandra Architecture Overview

○ Cassandra was designed with the understanding that system/

hardware failures can and do occur

○ Peer-to-peer, distributed system

○ All nodes are the same

○ Data partitioned among all nodes in the cluster

○ Custom data replication to ensure fault tolerance

○ Read/Write-anywhere design

○ Google BigTable - data model

○ Column Families

○ Memtables

○ SSTables

○ Amazon Dynamo - distributed systems technologies

○ Consistent hashing

○ Partitioning

○ Replication

○ One-hop routing

Page 88: Distributed Database Architecture - UNIMIB

Transparent Elasticity

Nodes can be added and removed from Cassandra online, with no downtime being experienced.

1

2

3

4

5

6

1

7

10 4

2

3

5

68

9

11

12

Page 89: Distributed Database Architecture - UNIMIB

Transparent Scalability

Addition of Cassandra nodes increases performance linearly and ability to manage TB’s-PB’s of data.

1

2

3

4

5

6

1

7

10 4

2

3

5

68

9

11

12

Performance throughput = N

Performance throughput = N x 2

Page 90: Distributed Database Architecture - UNIMIB

High Availability

Cassandra, with its peer-to-peer architecture has no single point of failure.

Page 91: Distributed Database Architecture - UNIMIB

Multi-Geography/Zone Aware

Cassandra allows a single logical database to span 1-N datacenters that are geographically dispersed. Also supports a hybrid on-premise/Cloud implementation.

Page 92: Distributed Database Architecture - UNIMIB

Data Redundancy

Cassandra allows for customizable data redundancy so that data is completely protected. Also supports rack awareness (data can be replicated between different racks to guard against machine/rack failures).

uses ‘Zookeeper’ to choose a leader which tells nodes the range they are replicas for

Page 93: Distributed Database Architecture - UNIMIB

• Nodes are logically structured in Ring Topology.

• Hashed value of key associated with data partition is used to assign it to a node in the ring.

• Hashing rounds off after certain value to support ring structure.

• Lightly loaded nodes moves position to alleviate highly loaded

nodes.

Partitioning

Page 94: Distributed Database Architecture - UNIMIB

01

1/2

F

E

D

C

B

A N=3

h(key2)

h(key1)

Partitioning & Replication

Page 95: Distributed Database Architecture - UNIMIB

• Used to discover location and state information about the other nodes participating in a Cassandra cluster

• Network Communication protocols inspired for real life rumor spreading.

• Periodic, Pairwise, inter-node communication.

• Low frequency communication ensures low cost.

• Random selection of peers.

• Example – Node A wish to search for pattern in data

– Round 1 – Node A searches locally and then gossips with node B.

– Round 2 – Node A,B gossips with C and D.

– Round 3 – Nodes A,B,C and D gossips with 4 other nodes ……

• Round by round doubling makes protocol very robust.

Gossip Protocols

Page 96: Distributed Database Architecture - UNIMIB

• Gossip process tracks heartbeats from other nodes both directly and indirectly

• Node Fail state is given by variable Φ

– tells how likely a node might fail (suspicion level) instead of simple binary value (up/down).

• This type of system is known as Accrual Failure Detector

• Takes into account network conditions, workload, or other conditions that might affect perceived heartbeat rate

• A threshold for Φ tells is used to decide if a node is dead

• If node is correct, phi will be constant set by application.

Generally Φ(t) = 0

Failure Detection

Page 97: Distributed Database Architecture - UNIMIB

Write Operation Stages

• Logging data in the commit log

• Writing data to the memtable

• Flushing data from the memtable

• Storing data on disk in SSTables

Page 98: Distributed Database Architecture - UNIMIB

• Commit Log

– First place a write is recorded– Crash recovery mechanism– Write not successful until recorded in commit log– Once recorded in commit log, data is written to Memtable

• Memtable

– Data structure in memory– Once memtable size reaches a threshold, it is flushed (appended) to SSTable– Several may exist at once (1 current, any others waiting to be flushed)– First place read operations look for data

• SSTable

– Kept on disk– Immutable once written– Periodically compacted for performance

Write Operations

Page 99: Distributed Database Architecture - UNIMIB

Write Operations

Page 100: Distributed Database Architecture - UNIMIB

Consistency

• Read Consistency– Number of nodes that must agree before read request

returns– ONE to ALL

• Write Consistency– Number of nodes that must be updated before a write is

considered successful– ANY to ALL– At ANY, a hinted handoff is all that is needed to return.

• QUORUM– Commonly used middle-ground consistency level– Defined as (replication_factor / 2) + 1

Page 101: Distributed Database Architecture - UNIMIB

Write Consistency (ONE)

0

Node 1

Node 2

Node 3Node 4

Node 5

Node 6

replication_factor = 3

R1

R2

R3Client

INSERT INTO table (column1, …) VALUES (value1, …) USING CONSISTENCY ONE

Page 102: Distributed Database Architecture - UNIMIB

Write Consistency (QUORUM)

0

Node 1

Node 2

Node 3Node 4

Node 5

Node 6

replication_factor = 3

R1

R2

R3Client

INSERT INTO table (column1, …) VALUES (value1, …) USING CONSISTENCY QUORUM

Page 103: Distributed Database Architecture - UNIMIB

• Write intended for a node that’s offline

• An online node, processing the request, makes a note to carry out the write once the node comes back online.

Write Operations: Hinted

Handoff

Page 104: Distributed Database Architecture - UNIMIB

Hinted Handoff

0

Node 1

Node 2

Node 3Node 4

Node 5

Node 6

replication_factor = 3and

hinted_handoff_enabled = true

R1

R2

R3Client

INSERT INTO table (column1, …) VALUES (value1, …) USING CONSISTENCY ANY

Write locally: system.hints

Note: Doesn’t not count toward consistency level (except ANY)

Page 105: Distributed Database Architecture - UNIMIB

• Tombstones

– On delete request, records are marked for deletion.

– Similar to “Recycle Bin.”

– Data is actually deleted on major compaction or configurable timer

Delete Operations

Page 106: Distributed Database Architecture - UNIMIB

• Compaction runs periodically to merge multiple SSTables– Reclaims space– Creates new index– Merges keys– Combines columns– Discards tombstones– Improves performance by minimizing disk seeks

• Two types– Major– Read-only

Compaction

Page 107: Distributed Database Architecture - UNIMIB

Compaction

Page 108: Distributed Database Architecture - UNIMIB

• Ensures synchronization of data across nodes

• Compares data checksums against neighboring nodes

• Uses Merkle trees (hash trees)

• Snapshot of data sent to neighboring nodes

• Created and broadcasted on every major compaction

• If two nodes take snapshots within TREE_STORE_TIMEOUT of each other, snapshots are compared and data is synced.

Anti-Entropy

Page 109: Distributed Database Architecture - UNIMIB

Merkle Tree

Page 110: Distributed Database Architecture - UNIMIB

• Read Repair

– On read, nodes are queried until the number of nodes which respond with the most recent value meet a specified consistency level from ONE to ALL.

– If the consistency level is not met, nodes are updated with the most recent value which is then returned.

– If the consistency level is met, the value is returned and any nodes that reported old values are then updated.

Read Operations

Page 111: Distributed Database Architecture - UNIMIB

Read Repair

0

Node 1

Node 2

Node 3Node 4

Node 5

Node 6R1

R2

R3Client

SELECT * FROM table USING CONSISTENCY ONE

replication_factor = 3

Page 112: Distributed Database Architecture - UNIMIB

• Bloom filters provide a fast way of checking if a value is not in a set.

Read Operations: Bloom Filters

Page 113: Distributed Database Architecture - UNIMIB

Read

MemoryDisk

Bloom Filter

Key Cache

Partition Summary

Compression Offsets

Partition Index Data

Cache Hit

Cache Miss

= Off-heap

key_cache_size_in_mb > 0

index_interval = 128 (default)