introduction to apache hbase
DESCRIPTION
This presentation gives a introduction to HBase, the Hadoop database.TRANSCRIPT
![Page 1: Introduction to Apache HBase](https://reader033.vdocuments.mx/reader033/viewer/2022061613/554bc4dfb4c90594278b5468/html5/thumbnails/1.jpg)
Introduction to HBase
Gokuldas K Pillai@gokool
![Page 2: Introduction to Apache HBase](https://reader033.vdocuments.mx/reader033/viewer/2022061613/554bc4dfb4c90594278b5468/html5/thumbnails/2.jpg)
HBase - The Hadoop Database
• Based on Google’s BigTable (OSDI’06)• Runs on top of Hadoop but provides real time
read/write access• Distributed Column Oriented Database
![Page 3: Introduction to Apache HBase](https://reader033.vdocuments.mx/reader033/viewer/2022061613/554bc4dfb4c90594278b5468/html5/thumbnails/3.jpg)
HBase Strengths
• Can scale to billions of rows X millions of columns
• Relatively cheap & easy to scale• Random real time access read/write access to
very large data• Support for update, delete
![Page 4: Introduction to Apache HBase](https://reader033.vdocuments.mx/reader033/viewer/2022061613/554bc4dfb4c90594278b5468/html5/thumbnails/4.jpg)
Who is using it
• StumpleUpon/ su.pr – Uses Hbase as a realtime data storage and analytics platform
• Twitter– Distributed read/write backup of all mySQL instances. Powers
“people search”. • Powerset (Now part of MS)• Adobe• Yahoo• Ning• Meetup• More at http://wiki.apache.org/hadoop/Hbase/PoweredBy
![Page 5: Introduction to Apache HBase](https://reader033.vdocuments.mx/reader033/viewer/2022061613/554bc4dfb4c90594278b5468/html5/thumbnails/5.jpg)
Key features
• Column Oriented store– Table costs only for the data stored– NULLs in rows are free
• Rows stored in sorted order• Can scale to Petabytes (At Google)
![Page 6: Introduction to Apache HBase](https://reader033.vdocuments.mx/reader033/viewer/2022061613/554bc4dfb4c90594278b5468/html5/thumbnails/6.jpg)
Comparing to RDBMS
• No Joins• No Query engine• No transactions• No column typing• No SQL, No ODBC/JDBC (Hbql is there now)
![Page 7: Introduction to Apache HBase](https://reader033.vdocuments.mx/reader033/viewer/2022061613/554bc4dfb4c90594278b5468/html5/thumbnails/7.jpg)
Data Model - Tables
• Tables consisting of rows and columns• Table cells are versioned (by timestamp)• Tables are sorted by row keys• Table access is via primary key• Row updates lock the row no matter how
many columns are involved
![Page 8: Introduction to Apache HBase](https://reader033.vdocuments.mx/reader033/viewer/2022061613/554bc4dfb4c90594278b5468/html5/thumbnails/8.jpg)
Column Families
• Row’s columns are grouped into families• Column family members identified by a
common ‘printable’ prefix• Column family should be predefined – but column family members can be added
dynamically– member name can be bytes
• All column family members are collocated on disk
![Page 9: Introduction to Apache HBase](https://reader033.vdocuments.mx/reader033/viewer/2022061613/554bc4dfb4c90594278b5468/html5/thumbnails/9.jpg)
![Page 10: Introduction to Apache HBase](https://reader033.vdocuments.mx/reader033/viewer/2022061613/554bc4dfb4c90594278b5468/html5/thumbnails/10.jpg)
![Page 11: Introduction to Apache HBase](https://reader033.vdocuments.mx/reader033/viewer/2022061613/554bc4dfb4c90594278b5468/html5/thumbnails/11.jpg)
Server Architecture
• Similar to HDFS– HbaseMaster ~ NameNode– RegionServer ~ DataNode
• HBase stores state via the Hadoop FS API• Can persist to :– Local– Amazon S3– HDFS (Default)
![Page 12: Introduction to Apache HBase](https://reader033.vdocuments.mx/reader033/viewer/2022061613/554bc4dfb4c90594278b5468/html5/thumbnails/12.jpg)
HBaseMaster
What it does:• Bootstrapping a new instance• Assignment and handling RegionServer problems
– Each region from every table is assigned to a RegionServer• When machines fail, move regions• When regions split, move regions to balance
What it does NOT do:– Handle write requests (Not a DB Master)– Handle location finding requests (handled by RegionServer)
![Page 13: Introduction to Apache HBase](https://reader033.vdocuments.mx/reader033/viewer/2022061613/554bc4dfb4c90594278b5468/html5/thumbnails/13.jpg)
RegionServer
• Carry the regions• Handle client read/write requests• Manage region splits (inform the Master)
![Page 14: Introduction to Apache HBase](https://reader033.vdocuments.mx/reader033/viewer/2022061613/554bc4dfb4c90594278b5468/html5/thumbnails/14.jpg)
Regions
• Horizontal Partitioning • Every region has a subset of the table’s rows• Region identified as– [table, first row(+), last row(-)]
• Table starts on a single region• Splits into two equal sized regions as the
original region grows bigger and so on..
![Page 15: Introduction to Apache HBase](https://reader033.vdocuments.mx/reader033/viewer/2022061613/554bc4dfb4c90594278b5468/html5/thumbnails/15.jpg)
Zookeeper
• Master election and server availability• Cluster management– Assignment transaction state management
• Client contacts ZooKeeper to bootstrap connection to the Hbase cluster
• Region key ranges, region server addresses• Guarantees consistency of data across clients
![Page 16: Introduction to Apache HBase](https://reader033.vdocuments.mx/reader033/viewer/2022061613/554bc4dfb4c90594278b5468/html5/thumbnails/16.jpg)
Workflow (Client connecting first time)
• Client ZooKeeper (returns –ROOT- )• Client -ROOT- (returns .META.)• Client .META. (returns RegionServer)• To avoid 3-lookups everytime, client caches
this info.– Recache on fault
![Page 17: Introduction to Apache HBase](https://reader033.vdocuments.mx/reader033/viewer/2022061613/554bc4dfb4c90594278b5468/html5/thumbnails/17.jpg)
Write/Read Operation
• Write request from Client RegionServer Commit log (on HDFS), memstore
• Flush to filesystem when memstore fills
• Read request from Client RegionServerLookup the memstore if available
If not, lookup flush files (reverse chrono. Order)
![Page 18: Introduction to Apache HBase](https://reader033.vdocuments.mx/reader033/viewer/2022061613/554bc4dfb4c90594278b5468/html5/thumbnails/18.jpg)
Integration
• Java HBase Client API• High performance Thrift gateway• A REST-ful Web service gateway (Stargate)– Supports XML, binary dat encoding options
• Cascading, Hive and Pig integration• HBase shell (jruby)• TableInput/TableOutputFormat for MR
![Page 19: Introduction to Apache HBase](https://reader033.vdocuments.mx/reader033/viewer/2022061613/554bc4dfb4c90594278b5468/html5/thumbnails/19.jpg)
Main Classes
• HBaseAdmin – Create table, drop table, list and alter table
• HTable– Put– Get– Scan
![Page 20: Introduction to Apache HBase](https://reader033.vdocuments.mx/reader033/viewer/2022061613/554bc4dfb4c90594278b5468/html5/thumbnails/20.jpg)
Alternatives to HBase
• Cassandra (From Facebook)– Based on Amazon’s Dynamo– No Master-slave but P2P– Tunable: Consistency Vs Latency
• Yahoo’s PNUTS– Not Open source– Works well for multi DC/geographical disbursed servers
![Page 21: Introduction to Apache HBase](https://reader033.vdocuments.mx/reader033/viewer/2022061613/554bc4dfb4c90594278b5468/html5/thumbnails/21.jpg)
References
• Hadoop – The Definitive Guide • Cloudera website• http://wiki.hbase.apache.org• Lars George,
– http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html
• Comparing Hbase, Cassandra and PNUTS– http://blog.amandeepkhurana.com/2010/05/comparing-pnuts-
hbase-and-cassandra.html• ACID compliance of Hbase -
http://hbase.apache.org/docs/r0.89.20100621/acid-semantics.html