low latency “olap” with hbase - hbasecon 2012

35
© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. Cosmin Lehene | Adobe Low Latency “OLAP” with HBase

Upload: cosmin-lehene

Post on 29-Nov-2014

15.187 views

Category:

Technology


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Low Latency “OLAP” with HBase - HBaseCon 2012

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Cosmin Lehene | AdobeLow Latency “OLAP” with HBase

Page 2: Low Latency “OLAP” with HBase - HBaseCon 2012

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

What we needed … and built

OLAP Semantics Low Latency Ingestion High Throughput Real-time Query API

Not hardcoded to web analytics or x-, y-, z- analytics, but extensible

2

Page 3: Low Latency “OLAP” with HBase - HBaseCon 2012

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Building Blocks

Dimensions, Metrics Aggregations Roll-up, drill-down, slicing and dicing, sorting

3

Page 4: Low Latency “OLAP” with HBase - HBaseCon 2012

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

OLAP 101 – Queries example

4

Date Country

City OS Browser Sale

2012-05-21

USA NY Windows FF 0.0

2012-05-21

USA NY Windows FF 10.0

2012-05-22

USA SF OSX Chrome 25.0

2012-05-22

Canada Ontario Linux Chrome 0.0

2012-05-23

USA Chicago OSX Safari 15.0

5 visits,3 days

2 countriesUSA: 4Canada: 1

4 cities:NY: 2SF: 1

3 OS-esWin: 2OSX: 2

3 browsersFF: 2Chrome:2

50.03 sales

Page 5: Low Latency “OLAP” with HBase - HBaseCon 2012

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

OLAP 101 – Queries example

Rolling up to country level:

SELECT COUNT(visits), SUM(sales)

GROUP BY country

“Slicing” by browser

SELECT COUNT(visits), SUM(sales)

GROUP BY country

HAVING browser = “FF”

Top browsers by sales

SELECT SUM(sales), COUNT(visits)

GROUP BY browser

ORDER BY sales5

Country visits

sales

USA 4 $50

Canada 1 0

Country visits

sales

USA 2 $10

Canada 0 0

Browser sales visits

Chrome $25 2

Safari $15 1

FF $10 2

Page 6: Low Latency “OLAP” with HBase - HBaseCon 2012

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Aggregate at runtime Most flexible

Fast – scatter gather

Space efficient

But I/O, CPU intensive

slow for larger data

low throughput

Pre-aggregate Fast

Efficient – O(1)

High throughput

But More effort to process

(latency)

Combinatorial explosion (space)

No flexibility

OLAP – Runtime Aggregation vs. Pre-aggregation

6

Page 7: Low Latency “OLAP” with HBase - HBaseCon 2012

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Pre-aggregation

Data needs to be summarized

Can’t visualize 1B data points (no, not even with Retina display)

Difficult to comprehend correlations among more than 3 dimensions

Not all dimension groups are relevant

Index on a needed basis (view selection problem)

Runtime aggregation == TeraSort for every query?

Pre-aggregate to reduce cardinality

7

Page 8: Low Latency “OLAP” with HBase - HBaseCon 2012

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

SaasBase

We tune both

pre-aggregation level vs. runtime post-aggregation

(ingestion speed + space ) vs. (query speed)

Think materialized views from RDBMS

8

Page 9: Low Latency “OLAP” with HBase - HBaseCon 2012

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

SaasBase Domain Model Mapping

9

Page 10: Low Latency “OLAP” with HBase - HBaseCon 2012

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

SaasBase - Domain Model Mapping

10

Page 11: Low Latency “OLAP” with HBase - HBaseCon 2012

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

SaasBase - Ingestion, Processing, Indexing, Querying

11

Page 12: Low Latency “OLAP” with HBase - HBaseCon 2012

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

SaasBase - Ingestion, Processing, Indexing, Querying

12

Page 13: Low Latency “OLAP” with HBase - HBaseCon 2012

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Ingestion

13

Page 14: Low Latency “OLAP” with HBase - HBaseCon 2012

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Ingestion throughput vs. latency

Historical data (large batches) Optimize for throughput

Increments (latest data, smaller) Optimize for latency

14

Page 15: Low Latency “OLAP” with HBase - HBaseCon 2012

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Large, granular input strategies

Slow listing in HDFS Archive processed files

Filtering input FileDateFilter (log name patterns: log-YYYY-MM-dd-HH.log)

TableInputFormat start/stop row

File Index in HBase (track processed/new files)

Map tasks overhead - stitching input splits 400K files => 400K map tasks => overhead, slow reduce copy

CombineFileInputFormat – 2GB-splits => 500 splits for 1TB

FixedMappersTableInputFormat (e.g. 5-region splits)15

Page 16: Low Latency “OLAP” with HBase - HBaseCon 2012

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Ingestion – Bulk Import

HFileOutputFormat (HFOF)

100s X faster than HBase API

No need to recover from failed jobs

No unnecessary load on machines

* No shuffle - global reduce order required!

e.g. first reduce key needs to be in the first region, last one in the last region

Watch for uneven partitions

16

Page 17: Low Latency “OLAP” with HBase - HBaseCon 2012

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

1 partition(reduce) / day for initial import

Uneven reduce (partitions) due to data growth over time Reduce k: 2010-12-04 = 500MB

Reduce n: 2012-05-22 = 5GB => slow and will result in a 5GB region

Balance reduce buckets based on input file sizes and the reduce key

Generate sub-partitions based on predefined size (e.g. 1GB)

HFOF – FileSizeDatePartitioner

17

Page 18: Low Latency “OLAP” with HBase - HBaseCon 2012

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Processing

18

Page 19: Low Latency “OLAP” with HBase - HBaseCon 2012

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Processing

Processing involves reading the Input (files, tables, events), pre-aggregating it (reducing cardinality) and generating tables that can be queried in real-time

1 year: 1B events => 100B data points indexed

Query => scan 365 data points (e.g. daily page views)

Processing could be either MR or real-time (e.g. Storm)

19

Page 20: Low Latency “OLAP” with HBase - HBaseCon 2012

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Processing for OLAP semantics

GROUP BY (process, query)

COUNT, SUM, AVG, etc. (process, query)

SORT (process, query)

HAVING (mostly query, can define pre-process constraints)

20

Page 21: Low Latency “OLAP” with HBase - HBaseCon 2012

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

SaasBase vs. SQL Views Comparison

21

Page 22: Low Latency “OLAP” with HBase - HBaseCon 2012

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

reports.json entities definition

22

Page 23: Low Latency “OLAP” with HBase - HBaseCon 2012

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Processing Performance

read, map, partition, combine, copy, sort, reduce, write

Read:

Scan.setCaching() (I/O ~ buffer)

Scan.setBatching() (avoid timeouts for abnormal input, e.g. 1M hits/visit)

Even region distribution across cluster (distributes CPU, I/O)

Map:

No unnecessary transformations: Bytes.toString(bytes) + Bytes.toBytes(string) (CPU)

Avoid GC : new X() (CPU, Memory)

Avoid system calls (context switching)

Stripping unnecessary data (I/O)

23

Page 24: Low Latency “OLAP” with HBase - HBaseCon 2012

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Processing Performance

Hot (in memory) vs. Cold (on disk, on network) data

Minimize I/O from disk/network

Single shot MR job: SuperProcessor

Emit all groups from one map() call

Incremental processing

Data format YYYY-MM-DD prefixed rowkey (HH:mm for more granularity)

24

Page 25: Low Latency “OLAP” with HBase - HBaseCon 2012

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 25

Indexing

Page 26: Low Latency “OLAP” with HBase - HBaseCon 2012

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

HBase natural order: hierarchical representation

26

Page 27: Low Latency “OLAP” with HBase - HBaseCon 2012

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Indexing - Why

Example: top 10 cities ~50K [country, city] combinations per day

Top 10 cities for 1 year =>

365 (days) X 50K ~=15M data points scanned

If you add gender => 30M

If you add Device, OS, Browser …

Might compress well, but think about the environment

How much energy would you spend for just top 10 cities?

* Image from: http://my.neutralexistence.com/images/Green-Earth.jpg

27

Page 28: Low Latency “OLAP” with HBase - HBaseCon 2012

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Indexing with HBase “10” < “2”

GROUP BY year, month, country, city ORDER BY visits DESC LIMIT 10

Lexicographic sorting

2012/05/USA/0000000000/

2012/05/USA/4294961296/San Francisco = 1000 visits*

2012/05/USA/4294961396/New York = 900 visits*

. . .

2012/05/USA/9999999999/

scan “t” startrow => “2012/05/USA/”, limit => 10

* Padding numbers for lexicographic sorting:

1000 -> Long.MAX_VALUE – 1000 = 4294961296

28

Page 29: Low Latency “OLAP” with HBase - HBaseCon 2012

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Query Engine

Always reads indexed, compact data

Query parsing

Scan strategy

Single vs. multiple scans

Start/stop rows (prefixes, index positions, etc.)

Index selection (volatile indexes with incremental processing)

Deserialization

Post-aggregation, sorting, fuzzy-sorting etc.

Paging

Custom dimension/metric class loading

29

Page 30: Low Latency “OLAP” with HBase - HBaseCon 2012

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Conclusions

OLAP semantics on a simple data model

Data as first class citizen

Domain Specific “Language” for Dimensions, Metrics, Aggregations

Tunable performance, resource allocation

Framework for vertical analytics systems

30

Page 31: Low Latency “OLAP” with HBase - HBaseCon 2012

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Thank you!Cosmin Lehene @clehene

http://hstack.orgCredits:

Andrei Dragomir

Adrian Muraru

Andrei Dulvac

Raluca Podiuc

Tudor Scurtu

Bogdan Dragu

Bogdan Drutu

31

Page 32: Low Latency “OLAP” with HBase - HBaseCon 2012

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Page 33: Low Latency “OLAP” with HBase - HBaseCon 2012

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

OLAP 101 - Rollup

Rollup: SELECT COUNT(visits), SUM(sales) GROUP BY country

33

Country

Visits Sale

USA 4 $50

Canada 1 $0

Page 34: Low Latency “OLAP” with HBase - HBaseCon 2012

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

OLAP 101 - Slicing

Filter or Segment or Slice (WHERE or HAVING)

34

Date Country

City OS Browser Sale

2012-03-02

USA NY Windows FF 0.0

2012-03-02

USA NY Windows FF 10.0

2012-03-03

USA S OSX Chrome 25.0

2012-03-03

Canada Ontario Linux Chrome 0.0

2012-03-04

USA Chicago OSX Safari 15.0

5 visits,3 days

2 countriesUSA: 4Canada: 1

4 cities:NY: 2SF: 1

3 OS-esWin: 2OSX: 2

3 browsersFF: 2Chrome:2

50.03 sales

Page 35: Low Latency “OLAP” with HBase - HBaseCon 2012

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

OLAP 101 – Sorting, TOP n

SELECT SUM(sales) as total GROUP BY browser ORDER BY total

35

Date Country

City OS Browser Sale

Chrome $25

Safari $15

Firefox $10