cx4242: data & visual analytics scaling...
TRANSCRIPT
![Page 1: CX4242: Data & Visual Analytics Scaling Uppoloclub.gatech.edu/.../CSE6242-13-ScalingUp-hbase.pdf · 2016-10-25 · CSE6242 / CX4242: Data & Visual Analytics Scaling Up HBase Duen](https://reader033.vdocuments.mx/reader033/viewer/2022042302/5ecd583c539fc95edb18eb4f/html5/thumbnails/1.jpg)
http://poloclub.gatech.edu/cse6242CSE6242 / CX4242: Data & Visual Analytics
Scaling Up HBase
Duen Horng (Polo) Chau Assistant ProfessorAssociate Director, MS AnalyticsGeorgia Tech
Partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos, Parishit Ram (GT PhD alum; SkyTree), Alex Gray
![Page 2: CX4242: Data & Visual Analytics Scaling Uppoloclub.gatech.edu/.../CSE6242-13-ScalingUp-hbase.pdf · 2016-10-25 · CSE6242 / CX4242: Data & Visual Analytics Scaling Up HBase Duen](https://reader033.vdocuments.mx/reader033/viewer/2022042302/5ecd583c539fc95edb18eb4f/html5/thumbnails/2.jpg)
What if you need real-time read/write for large datasets?
2
![Page 3: CX4242: Data & Visual Analytics Scaling Uppoloclub.gatech.edu/.../CSE6242-13-ScalingUp-hbase.pdf · 2016-10-25 · CSE6242 / CX4242: Data & Visual Analytics Scaling Up HBase Duen](https://reader033.vdocuments.mx/reader033/viewer/2022042302/5ecd583c539fc95edb18eb4f/html5/thumbnails/3.jpg)
Lecture based on these two books.
3
http://goo.gl/YNCWN http://goo.gl/svzTV
![Page 4: CX4242: Data & Visual Analytics Scaling Uppoloclub.gatech.edu/.../CSE6242-13-ScalingUp-hbase.pdf · 2016-10-25 · CSE6242 / CX4242: Data & Visual Analytics Scaling Up HBase Duen](https://reader033.vdocuments.mx/reader033/viewer/2022042302/5ecd583c539fc95edb18eb4f/html5/thumbnails/4.jpg)
Built on top of HDFS
Supports real-time read/write random access
Scale to very large datasets, many machines
Not relational, does NOT support SQL (“NoSQL” = “not only SQL”) http://en.wikipedia.org/wiki/NoSQL
Supports billions of rows, millions of columns (e.g., serving Facebook’s Messaging Platform)
Written in Java; works with other APIs/languages (REST, Thrift, Scala)
Where does HBase come from?4
http://hbase.apache.org
http://radar.oreilly.com/2014/04/5-fun-facts-about-hbase-that-you-didnt-know.htmlhttp://wiki.apache.org/hadoop/Hbase/PoweredBy
![Page 5: CX4242: Data & Visual Analytics Scaling Uppoloclub.gatech.edu/.../CSE6242-13-ScalingUp-hbase.pdf · 2016-10-25 · CSE6242 / CX4242: Data & Visual Analytics Scaling Up HBase Duen](https://reader033.vdocuments.mx/reader033/viewer/2022042302/5ecd583c539fc95edb18eb4f/html5/thumbnails/5.jpg)
HBase’s “history”Hadoop & HDFS based on...
• 2003 Google File System (GFS) paper • 2004 Google MapReduce paper
HBase based on ...
• 2006 Google Bigtable paper
5
Designed for batch processing
Designed for random accesshttp://static.googleusercontent.com/media/research.google.com/en/us/archive/bigtable-osdi06.pdf
http://cracking8hacking.com/cracking-hacking/Ebooks/Misc/pdf/The%20Google%20filesystem.pdf
http://static.googleusercontent.com/media/research.google.com/en/us/archive/mapreduce-osdi04.pdf
![Page 6: CX4242: Data & Visual Analytics Scaling Uppoloclub.gatech.edu/.../CSE6242-13-ScalingUp-hbase.pdf · 2016-10-25 · CSE6242 / CX4242: Data & Visual Analytics Scaling Up HBase Duen](https://reader033.vdocuments.mx/reader033/viewer/2022042302/5ecd583c539fc95edb18eb4f/html5/thumbnails/6.jpg)
How does HBase work?Column-oriented
Column is the most basic unit (instead of row)
• Multiple columns form a row• A column can have multiple versions, each
version stored in a cellRows form a table
• Row key locates a row• Rows sorted by row key lexicographically
(~= alphabetically)6
![Page 7: CX4242: Data & Visual Analytics Scaling Uppoloclub.gatech.edu/.../CSE6242-13-ScalingUp-hbase.pdf · 2016-10-25 · CSE6242 / CX4242: Data & Visual Analytics Scaling Up HBase Duen](https://reader033.vdocuments.mx/reader033/viewer/2022042302/5ecd583c539fc95edb18eb4f/html5/thumbnails/7.jpg)
Row key is uniqueThink of row key as the “index” of an HBase table
• You look up a row using its row key
Only one “index” per table (via row key)
HBase does not have built-in support for multiple indices; support enabled via extensions
7
![Page 8: CX4242: Data & Visual Analytics Scaling Uppoloclub.gatech.edu/.../CSE6242-13-ScalingUp-hbase.pdf · 2016-10-25 · CSE6242 / CX4242: Data & Visual Analytics Scaling Up HBase Duen](https://reader033.vdocuments.mx/reader033/viewer/2022042302/5ecd583c539fc95edb18eb4f/html5/thumbnails/8.jpg)
Rows sorted lexicographically (=alphabetically)
8
hbase(main):001:0> scan 'table1'ROW COLUMN+CELLrow-1 column=cf1:, timestamp=1297073325971 ... row-10 column=cf1:, timestamp=1297073337383 ... row-11 column=cf1:, timestamp=1297073340493 ... row-2 column=cf1:, timestamp=1297073329851 ... row-22 column=cf1:, timestamp=1297073344482 ... row-3 column=cf1:, timestamp=1297073333504 ... row-abc column=cf1:, timestamp=1297073349875 ... 7 row(s) in 0.1100 seconds
“row-10” comes before “row-2”. How to fix?
![Page 9: CX4242: Data & Visual Analytics Scaling Uppoloclub.gatech.edu/.../CSE6242-13-ScalingUp-hbase.pdf · 2016-10-25 · CSE6242 / CX4242: Data & Visual Analytics Scaling Up HBase Duen](https://reader033.vdocuments.mx/reader033/viewer/2022042302/5ecd583c539fc95edb18eb4f/html5/thumbnails/9.jpg)
Rows sorted lexicographically (=alphabetically)
8
hbase(main):001:0> scan 'table1'ROW COLUMN+CELLrow-1 column=cf1:, timestamp=1297073325971 ... row-10 column=cf1:, timestamp=1297073337383 ... row-11 column=cf1:, timestamp=1297073340493 ... row-2 column=cf1:, timestamp=1297073329851 ... row-22 column=cf1:, timestamp=1297073344482 ... row-3 column=cf1:, timestamp=1297073333504 ... row-abc column=cf1:, timestamp=1297073349875 ... 7 row(s) in 0.1100 seconds
“row-10” comes before “row-2”. How to fix?
Pad “row-2” with a “0”.i.e., “row-02”
![Page 10: CX4242: Data & Visual Analytics Scaling Uppoloclub.gatech.edu/.../CSE6242-13-ScalingUp-hbase.pdf · 2016-10-25 · CSE6242 / CX4242: Data & Visual Analytics Scaling Up HBase Duen](https://reader033.vdocuments.mx/reader033/viewer/2022042302/5ecd583c539fc95edb18eb4f/html5/thumbnails/10.jpg)
Columns grouped into column families
• Why?
• Helps with organization, understanding, optimization, etc.
• In details...
• Columns in the same family stored in same file called HFile
• inspired by Google’s SSTable = large map whose keys are sorted
• Apply compression on the whole family
• ...
9
![Page 11: CX4242: Data & Visual Analytics Scaling Uppoloclub.gatech.edu/.../CSE6242-13-ScalingUp-hbase.pdf · 2016-10-25 · CSE6242 / CX4242: Data & Visual Analytics Scaling Up HBase Duen](https://reader033.vdocuments.mx/reader033/viewer/2022042302/5ecd583c539fc95edb18eb4f/html5/thumbnails/11.jpg)
More on column family, columnColumn family
• An HBase table supports only few families (e.g., <10)• Due to limitations in implementation
• Family name must be printable• Should be defined when table is created
• Shouldn not be changed oftenEach column referenced as “family:qualifier”
• Can have millions of columns• Values can be anything that’s arbitrarily long
10
![Page 12: CX4242: Data & Visual Analytics Scaling Uppoloclub.gatech.edu/.../CSE6242-13-ScalingUp-hbase.pdf · 2016-10-25 · CSE6242 / CX4242: Data & Visual Analytics Scaling Up HBase Duen](https://reader033.vdocuments.mx/reader033/viewer/2022042302/5ecd583c539fc95edb18eb4f/html5/thumbnails/12.jpg)
Cell ValueTimestamped
• Implicitly by system• Or set explicitly by user
Let you store multiple versions of a value
• = values over timeValues stored in decreasing time order
• Most recent value can be read first11
![Page 13: CX4242: Data & Visual Analytics Scaling Uppoloclub.gatech.edu/.../CSE6242-13-ScalingUp-hbase.pdf · 2016-10-25 · CSE6242 / CX4242: Data & Visual Analytics Scaling Up HBase Duen](https://reader033.vdocuments.mx/reader033/viewer/2022042302/5ecd583c539fc95edb18eb4f/html5/thumbnails/13.jpg)
Time-oriented view of a row
12
![Page 14: CX4242: Data & Visual Analytics Scaling Uppoloclub.gatech.edu/.../CSE6242-13-ScalingUp-hbase.pdf · 2016-10-25 · CSE6242 / CX4242: Data & Visual Analytics Scaling Up HBase Duen](https://reader033.vdocuments.mx/reader033/viewer/2022042302/5ecd583c539fc95edb18eb4f/html5/thumbnails/14.jpg)
Concise way to describe all these?
HBase data model (= Bigtable’s model)
• Sparse, distributed, persistent, multidimensional map
• Indexed by row key + column key + timestamp
13
(Table, RowKey, Family, Column, Timestamp) ! Value
![Page 15: CX4242: Data & Visual Analytics Scaling Uppoloclub.gatech.edu/.../CSE6242-13-ScalingUp-hbase.pdf · 2016-10-25 · CSE6242 / CX4242: Data & Visual Analytics Scaling Up HBase Duen](https://reader033.vdocuments.mx/reader033/viewer/2022042302/5ecd583c539fc95edb18eb4f/html5/thumbnails/15.jpg)
An exerciseHow would you use HBase to create a webtable store snapshots of every webpage on the planet, over time?
14
![Page 16: CX4242: Data & Visual Analytics Scaling Uppoloclub.gatech.edu/.../CSE6242-13-ScalingUp-hbase.pdf · 2016-10-25 · CSE6242 / CX4242: Data & Visual Analytics Scaling Up HBase Duen](https://reader033.vdocuments.mx/reader033/viewer/2022042302/5ecd583c539fc95edb18eb4f/html5/thumbnails/16.jpg)
Details: How does HBase scale up storage & balance load?
Automatically divide contiguous ranges of rows into regions
Start with one region, split into two when getting too large
15
![Page 17: CX4242: Data & Visual Analytics Scaling Uppoloclub.gatech.edu/.../CSE6242-13-ScalingUp-hbase.pdf · 2016-10-25 · CSE6242 / CX4242: Data & Visual Analytics Scaling Up HBase Duen](https://reader033.vdocuments.mx/reader033/viewer/2022042302/5ecd583c539fc95edb18eb4f/html5/thumbnails/17.jpg)
Details: How does HBase scale up storage & balance load?
16http://blog.cloudera.com/blog/2013/04/how-scaling-really-works-in-apache-hbase/
![Page 18: CX4242: Data & Visual Analytics Scaling Uppoloclub.gatech.edu/.../CSE6242-13-ScalingUp-hbase.pdf · 2016-10-25 · CSE6242 / CX4242: Data & Visual Analytics Scaling Up HBase Duen](https://reader033.vdocuments.mx/reader033/viewer/2022042302/5ecd583c539fc95edb18eb4f/html5/thumbnails/18.jpg)
How to use HBaseInteractive shell
• Will show you an example, locally (on your computer, without using HDFS)
Programmatically
• e.g., via Java, Python, etc.
17
![Page 19: CX4242: Data & Visual Analytics Scaling Uppoloclub.gatech.edu/.../CSE6242-13-ScalingUp-hbase.pdf · 2016-10-25 · CSE6242 / CX4242: Data & Visual Analytics Scaling Up HBase Duen](https://reader033.vdocuments.mx/reader033/viewer/2022042302/5ecd583c539fc95edb18eb4f/html5/thumbnails/19.jpg)
Example, using interactive shell
18
Start HBase
Start Interactive Shell
Check HBase is running
![Page 20: CX4242: Data & Visual Analytics Scaling Uppoloclub.gatech.edu/.../CSE6242-13-ScalingUp-hbase.pdf · 2016-10-25 · CSE6242 / CX4242: Data & Visual Analytics Scaling Up HBase Duen](https://reader033.vdocuments.mx/reader033/viewer/2022042302/5ecd583c539fc95edb18eb4f/html5/thumbnails/20.jpg)
Example: Create table, add values
19
![Page 21: CX4242: Data & Visual Analytics Scaling Uppoloclub.gatech.edu/.../CSE6242-13-ScalingUp-hbase.pdf · 2016-10-25 · CSE6242 / CX4242: Data & Visual Analytics Scaling Up HBase Duen](https://reader033.vdocuments.mx/reader033/viewer/2022042302/5ecd583c539fc95edb18eb4f/html5/thumbnails/21.jpg)
Example: Scan (show all cell values)
20
![Page 22: CX4242: Data & Visual Analytics Scaling Uppoloclub.gatech.edu/.../CSE6242-13-ScalingUp-hbase.pdf · 2016-10-25 · CSE6242 / CX4242: Data & Visual Analytics Scaling Up HBase Duen](https://reader033.vdocuments.mx/reader033/viewer/2022042302/5ecd583c539fc95edb18eb4f/html5/thumbnails/22.jpg)
Example: Get (look up a row)
21
Can also look up a particular cell value, with a certain timestamp, etc.
![Page 23: CX4242: Data & Visual Analytics Scaling Uppoloclub.gatech.edu/.../CSE6242-13-ScalingUp-hbase.pdf · 2016-10-25 · CSE6242 / CX4242: Data & Visual Analytics Scaling Up HBase Duen](https://reader033.vdocuments.mx/reader033/viewer/2022042302/5ecd583c539fc95edb18eb4f/html5/thumbnails/23.jpg)
Example: Delete a value
22
![Page 24: CX4242: Data & Visual Analytics Scaling Uppoloclub.gatech.edu/.../CSE6242-13-ScalingUp-hbase.pdf · 2016-10-25 · CSE6242 / CX4242: Data & Visual Analytics Scaling Up HBase Duen](https://reader033.vdocuments.mx/reader033/viewer/2022042302/5ecd583c539fc95edb18eb4f/html5/thumbnails/24.jpg)
Example: Disable & drop table
23
![Page 25: CX4242: Data & Visual Analytics Scaling Uppoloclub.gatech.edu/.../CSE6242-13-ScalingUp-hbase.pdf · 2016-10-25 · CSE6242 / CX4242: Data & Visual Analytics Scaling Up HBase Duen](https://reader033.vdocuments.mx/reader033/viewer/2022042302/5ecd583c539fc95edb18eb4f/html5/thumbnails/25.jpg)
RDBMS vs HBaseRDBMS (=Relational Database Management System)
• MySQL, Oracle, SQLite, Teradata, etc.• Really great for many applications
• Ensure strong data consistency, integrity • Supports transactions (ACID guarantees)• ...
24
![Page 26: CX4242: Data & Visual Analytics Scaling Uppoloclub.gatech.edu/.../CSE6242-13-ScalingUp-hbase.pdf · 2016-10-25 · CSE6242 / CX4242: Data & Visual Analytics Scaling Up HBase Duen](https://reader033.vdocuments.mx/reader033/viewer/2022042302/5ecd583c539fc95edb18eb4f/html5/thumbnails/26.jpg)
RDBMS vs HBaseHow are they different? When to use what?
25
![Page 27: CX4242: Data & Visual Analytics Scaling Uppoloclub.gatech.edu/.../CSE6242-13-ScalingUp-hbase.pdf · 2016-10-25 · CSE6242 / CX4242: Data & Visual Analytics Scaling Up HBase Duen](https://reader033.vdocuments.mx/reader033/viewer/2022042302/5ecd583c539fc95edb18eb4f/html5/thumbnails/27.jpg)
RDBMS vs HBaseHow are they different?
• Hbase when you don’t know the structure/schema• HBase supports sparse data (many columns, most values are not
there)
• Use RDBMS if you only work with a small number of columns• Relational databases good for getting “whole” rows
• HBase: Multiple versions of data
• RDBMS support multiple indices, minimize duplications• Generally a lot cheaper to deploy HBase, for same size of data
(petabytes)
26
![Page 28: CX4242: Data & Visual Analytics Scaling Uppoloclub.gatech.edu/.../CSE6242-13-ScalingUp-hbase.pdf · 2016-10-25 · CSE6242 / CX4242: Data & Visual Analytics Scaling Up HBase Duen](https://reader033.vdocuments.mx/reader033/viewer/2022042302/5ecd583c539fc95edb18eb4f/html5/thumbnails/28.jpg)
More topics to learn aboutOther ways to get, put, delete... (e.g., programmatically via Java)
• Doing them in batchMaintaining your cluster
• Configurations, specs for “master” and “slaves”?• Administrating cluster• Monitoring cluster’s health
Key design (http://hbase.apache.org/book/rowkey.design.html)
• bad keys can decrease performanceIntegrating with MapReduce
Cassandra, MongoDB, etc.27http://db-engines.com/en/system/Cassandra%3BHBase%3BMongoDB
http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis