introduction to · 2012-07-17 · introduction to andrás garzó (garzo@ilab.sztaki.hu) ... hbase...

Post on 23-Jun-2020

11 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

2012.04.27.

Introduction to

András Garzó (garzo@ilab.sztaki.hu)

Google BigTable (2006)

● Distributed multi level map

● Fault tolerant

● Horizontal Scalability

● Runs on commodity hardware

● Self managing

● Large number of R/W

● Tight integration with MapReduce

HBase

● Open source BigTable implementation

● Based on Hadoop stack

● Data stored on HDFS

● Integrated with MapReduce

HBase Data Model

Column-oriented database

HBase column oriented only in column-family level!

HBase Architecture

Auto Sharding

Distribution

Auto Sharding and Distribution

● Region is the unit of scalability

● Sorted, contiguous range of rows

● Load balancing and failover

● Split automatically or manually to scale with

growing data

● Capacity is solely a factor of cluster nodes vs.

region per node

Regions and Splitting

Regions and Splitting

Regions and Splitting

HMaster

● Coordinates region splitting

● Load balancing

● Table management

● Multiple masters for failover

Zookeeper

● Master election

● Locate –ROOT- region

● Region Server membership

Where is my row?

Where is my row?

Where is my row?

Where is my row?

Where is my row?

Inside the Region

Writing to HBase

Writing to HBase

Writing to HBase

Writing to HBase

Writing to HBase

Writing to HBase

Reading from HBase

Reading from HBase

Merge Reads

Logical and physical layout

Logical and physical layout

● Logical layout does not match physical one

● All values stored with the full coordinates,

including Row Key, Column Family, Column

Qualifier and Timestamp

● Folds columns into „row per column”

● NULLs are cost free

● Versions are multiple „rows” in folded table

Key Cardinality

Key Cardinality

● Best performance is gained from using row

keys

● Selecting column families reduces the amount

of the data to be scanned

● Time range bound reads can skip store files

(and Bloom Filters too!)

● Pure value based filtering is a full table

scan!

Tall-Narrow or Flat-Wide tables?

● Same storage footprint

● Atomicity only on row level

● Rows do not split

● Put more details into the row keys

● Tall with scans, wide with gets

Example

Sequential read and write

HBase and MapReduce

How data locality achieved?

● Region server and data node runs on the same

node

● HBase shuts down very rarely

● DataNodes help: when Region Servers write to

HDFS, data blocks will be stored locally also

HBase vs. RDBMS

HBase RDBMS

Column oriented Row oriented

Flexible scheme, add columns on the fly Fixed scheme

Good with sparse table Not optimized for sparse tables

No query language (just Scan and Get) SQL

No joins (but we can do with MapReduce) Optimized for joins

Horizontal scalability Hard to scale

No transactions Transactional

Consistent Consistent

It’s a really wrong comparison!

HBase on the CAP triangle

Q&A

top related