google - bigtable

1

Bigtable : A Distributed Storage System for Struc-tured Data

Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C.Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew

Fikes,Robert E. Gruber

Google, Inc.

2

IndexIntroductionData ModelAPIBuilding BlocksImplementationRefinementsReal ApplicationsConclusions

3

Introduction1. Motivation2. What is a Bigtable?3. Why not a DBMS?

4

Introduction : MotivationLot of structured data at Google

◦Web page, Geographic Info. , User data, Mail

Millions of machinesDifferent projects/applications

5

Introduction : Why not a DBMS?Provide more than Google needsRequired DB with wide scalability,

wide applicability, high perfor-mance and high availability

Low-level storage optimizations help performance significantly

Cost would be very high◦Most DBMSs require very expensive

infrastructure

6

Introduction : What is a Bigtable?Bigtable is a distributed storage system

for managing structured dataAchieved several goals

◦wide applicability, scalability, high perfor-mance

Scalable◦ Terabytes of in-memory data◦ Petabyte of disk-based data◦ Millions of reads/writes per second, efficient scans

Self-managing◦Servers can be added/removed dynamically◦Servers adjust to load imbalance

7

Data Model1. Row2. Column families3. Timestamps

8

Data Model : RowThe row keys in a table are arbi-

trary stringsData is maintained lexicographic

older by row keyRow range is called a “tablet”,

which is the unit of distribution and load balancing

Sorted by row key in tablet

9

Data Model : Column Fam-iliesColumn keys are grouped into

sets called “column families”Basic unit of access controlA column key is named using the

this syntax “ family:qualifier”Access control and disk/memory ac-

counting are performed at the col-umns-family level

10

Data Model : TimestampsEach cell in a Bigtable can con-

tain multiple versions of the same data

sorted by timestamp order by descending

64-bit integersreal time in microseconds or as-

signed by client application

11

Data Model : Example

Row

Columns Columns family

Timestamps

12

APIThe Bigtable API provieds functions

◦Create/delete table and column families◦Change table, column family metadata◦Look up values from individual rows◦Iterate over a subset of the data

Supports single-row trancsactionsCan be used with

MapReduce(HBase)

13

API : ExampleUses a Scanner to iterate over all

anchors in particular rowTable *T = OpenOrDie(“/bigtable/web/webtable”);

14

Building BlocksUses the distributed Google File

System(GFS) to store log and data files

A Bigtable cluster typically oper-ates in a shared pool of machines

Depend on cluster management system

The Google SSTable file format is used internally to store Bigtable data

Relies on a highly-available and persistent distributed lock service called Chubby

15

Building Blocks : GFS & SSTable & ChubbyGoogle File System:

◦Google File System grew out of an earlier Google effort, "BigFiles”

◦Select for high data throughputs

16

Building Blocks : GFS & SSTable & ChubbySSTable:

◦provides a persistent, ordered map from keys to values

◦Contains a sequence of index block

17

Building Blocks : GFS & SSTable & ChubbyChubby:

◦ensure that there is at most one ac-tive master at any time

◦store the bootstrap location of Bigtable data

◦discover tablet servers and finalize tablet server deaths

◦store Bigtable schema information (the column family information for each table)

18

Implementation1. Tablet Location2. Tablet Assignment3. Tablet Serving

19

ImplementationThree major components

◦Library that is linked every client◦One master server◦Many tablet servers

20

Implementation : Tablet LocationUse three-level hierarchy analogous to

that of a B+tree to store tablet loca-tion information(Maximum three level)

The first level is a file stored in Chubby that contains the location of the root tablet

21

Implementation : Tablet LocationRoot tablet

◦First tablet in the METADATA table◦Never split to ensure that the tablet

location hierarchy has no more than three levels

METADATA tablet◦Stores the location of a tablet under

a row key that is an encoding of the tablet’s table identifier and its end row

Implementation : Tablet Assign-ment

Master server◦assign tablets to tablet servers◦detect presence of absence(expiration) of

tablet servers◦balance tablet-server load◦handle schema changes such as table and

column family creationsTablet server

◦manage a set of tablets(ten to a thousand tablets per tablet server)

◦handle read/write requests to the tablets◦split tablets that have grown too large

23

Implementation : Tablet ServingUpdates are committed to a

commit log that stores redo records.

Recently committed ones are store in memtable

Older updates are stored in a se-quence of SSTables

24

Refinements1. Locality groups2. Compression3. Caching for read performance4. Bloom filters5. Commit-log implementation

25

RefinementsLocality groups

◦Client can group multiple column fami-lies together into a locality group

Compression◦We benefit in that small portions of an

SSTable can be read without decom-pressing the entire file

◦Encode at 100-200MB/s◦Decode at 400-1000MB/s◦10-to-1 reduction in space

26

RefinementsCaching for read performance

◦Tablet servers use two levels of caching Scan/Block Cache

Bloom filters◦Should be created for SSTable in a

particular locality groupCommit-log implementation

◦Co-mingling mutations for different tablets in the same physical log file

27

Real Applications1. Google Analytics2. Personalized Search

28

Real ApplicationsGoogle Analytics

◦Use two of the tables The raw click table(~200TB) The summary table(~20TB)

◦Use a MapReducePersonalized Search

◦History of users◦Use a MapReduce

29

ConclusionsBigtable clusters have been in

production use since April 2005 at Google

Provide Performance and high availability

Found that there are significant ad-vantages to building storage solution at Google

Apache Hbase based on Bigtable

30

Thank you!

google - bigtable

Engineering