batch processing and stream processing by sql

39
Batch processing and Stream processing by SQL @tagomoris (TAGOMORI Satoshi) 2014/07/08 Hadoop Conference Japan 2014 #hcj2014 1478日火曜日

Upload: satoshi-tagomori

Post on 26-Jan-2015

123 views

Category:

Technology


2 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Batch processing and Stream processing by SQL

Batch processing andStream processing by

SQL@tagomoris (TAGOMORI Satoshi)2014/07/08Hadoop Conference Japan 2014 #hcj2014

14年7月8日火曜日

Page 2: Batch processing and Stream processing by SQL

TAGOMORI Satoshi (@tagomoris)LINE Corporation

Analytics Platform Team

14年7月8日火曜日

Page 3: Batch processing and Stream processing by SQL

14年7月8日火曜日

Page 4: Batch processing and Stream processing by SQL

14年7月8日火曜日

Page 5: Batch processing and Stream processing by SQL

14年7月8日火曜日

Page 6: Batch processing and Stream processing by SQL

SQL

14年7月8日火曜日

Page 7: Batch processing and Stream processing by SQL

BATCHand/or

STREAM14年7月8日火曜日

Page 8: Batch processing and Stream processing by SQL

Analytics data flow overviewservers Fluentd

Cluster

archive

visualization

notifications

Hadoop / HivePresto

Fluentd

Norikra

applicationmetrics

“Log analysis systems and its designs in LINE corp. 2014 early”http://www.slideshare.net/tagomoris/log-analysis-system-and-its-designs-in-line-corp-2014-early14年7月8日火曜日

Page 9: Batch processing and Stream processing by SQL

servers FluentdCluster

archive

visualization

notifications

Hadoop / HivePresto

Fluentd

Norikra

applicationmetrics

STREAM

BATCH

14年7月8日火曜日

Page 10: Batch processing and Stream processing by SQL

servers FluentdCluster

archive

visualization

notifications

Hadoop / HivePresto

Fluentd

Norikra

applicationmetrics

STREAM

BATCHSQL

14年7月8日火曜日

Page 11: Batch processing and Stream processing by SQL

SQL is NOT the best.

But,SQL is better than NONE.

14年7月8日火曜日

Page 12: Batch processing and Stream processing by SQL

What supports SQL:

RDBMSApache Hive (on MR/Spark/Tez)

Facebook Presto, Cloudera Impala, Apache DrillGoogle BigQuery, ......

14年7月8日火曜日

Page 13: Batch processing and Stream processing by SQL

14年7月8日火曜日

Page 14: Batch processing and Stream processing by SQL

SQL

SQLSQL

SQL (2/6)SQL

SQL

SQL SQL

14年7月8日火曜日

Page 15: Batch processing and Stream processing by SQL

DB Batch ShortBatch

non-SQL NoSQL HadoopMRPig ----

SQL RDBMS HivePrestoImpala

Drill

14年7月8日火曜日

Page 16: Batch processing and Stream processing by SQL

Batch processing.

ORStream processing?

14年7月8日火曜日

Page 17: Batch processing and Stream processing by SQL

Batch processing

Hadoop/Hive

Target window: hours - weeks (or more)

Total throuput: HIGHEST

Query Latency: LARGEST (20sec - mins - hours)

14年7月8日火曜日

Page 18: Batch processing and Stream processing by SQL

Short Batch processing

Presto, Impala, Drill

Target window: seconds - hours (- days)

Total throughput: Normal

Query latency: Small (seconds - mins)

14年7月8日火曜日

Page 19: Batch processing and Stream processing by SQL

Stream processing

Storm, Kafka, Esper, Norikra, Fluentd, ....

Spark streaming(?)

Target window: seconds - hours

Total throughput: Normal

Query latency: SMALLEST (milliseconds)

Queries must be written BEFORE DATA

Once registered, runs forever

14年7月8日火曜日

Page 20: Batch processing and Stream processing by SQL

Data flow and latencydata windowquery execution

BatchShortBatch Stream

incrementalquery exection

14年7月8日火曜日

Page 21: Batch processing and Stream processing by SQL

Data windowTarget time (or size) range of queries

Batch (or short-batch)

FROM-TO: WHERE dt >= ‘2014-07-07 00:00:00‘

AND dt <= ‘2014-07-08 23:59:59’

Stream

“Calculate this query for every 3 minutes”

Extended SQL required

14年7月8日火曜日

Page 22: Batch processing and Stream processing by SQL

Stream processing with SQLEsper: Java library to process StreamWith schema

14年7月8日火曜日

Page 23: Batch processing and Stream processing by SQL

Stream processing with SQLEsper: Java library to process StreamEsper EPL

SELECT param1, param2FROM tblWHERE age > 30

14年7月8日火曜日

Page 24: Batch processing and Stream processing by SQL

Stream processing with SQL

SELECT param, COUNT(*) AS cFROM tblWHERE age > 30GROUP BY param

Esper: Java library to process StreamEsper EPL

14年7月8日火曜日

Page 25: Batch processing and Stream processing by SQL

Stream processing with SQL

SELECT param, COUNT(*) AS cFROM tbl.win:time_batch(1 hour)WHERE age > 30GROUP BY param

Esper: Java library to process StreamEsper EPL

14年7月8日火曜日

Page 26: Batch processing and Stream processing by SQL

14年7月8日火曜日

Page 27: Batch processing and Stream processing by SQL

Norikra:Schema-less Stream Processing with SQL

OSS, based on Esper EPL, GPLv2

Without pre-defined schema

Complex event processing (w/ nested hash/array) w/ SQL

HTTP RPC w/ JSON or MessagePack (fluentd plugin available!)

Dynamic query registration/removing

Ultra fast bootstrap (in 3 minutes!)

UDF plugins by Java/Rubyhttp://norikra.github.io/

14年7月8日火曜日

Page 28: Batch processing and Stream processing by SQL

Distributed processing OR NOT?

Norikra is NOT a distributed processing platform.

Of course, SCALE OUT IS FANTASTIC.

Is non-distributed software useless?

MySQL

MySQL Cluster

Norikra can handle 10k events/sec

on 2CPU (8core) server

14年7月8日火曜日

Page 29: Batch processing and Stream processing by SQL

DB Batch ShortBatch Stream

non-SQL NoSQL HadoopMRPig ----

StormKafka

Dataflow(G)

SQL RDBMS HivePrestoImpala

DrillNorikra

14年7月8日火曜日

Page 30: Batch processing and Stream processing by SQL

Lambda architecture

Just same 2 process on:Stream processingBatch processing

http://lambda-architecture.net/

14年7月8日火曜日

Page 31: Batch processing and Stream processing by SQL

Replayable processing

Stream processingMUST NOT be replayable

Queries on stream processingSHOULD be replayable

14年7月8日火曜日

Page 32: Batch processing and Stream processing by SQL

Hybrid processing: for fault-torelance

Stream processing: executes queries in normalBatch processing: executes recovery queries

14年7月8日火曜日

Page 33: Batch processing and Stream processing by SQL

Hybrid processing: for latency-reduction + accuracy

Stream processing: for prompt reports (速報値)

Batch processing: for fixed reports (確定値)

14年7月8日火曜日

Page 34: Batch processing and Stream processing by SQL

Hybrid stream processing: against complexity

Non-SQL stream processing: for simple, fixed, high-traffic eventsSQL stream processing: for complex, fragile events

14年7月8日火曜日

Page 35: Batch processing and Stream processing by SQL

Case study in LINE

Prompt-report & fixed-report

Norikra + Hive Hybrid

Error detection from application and access logs

Norikra + Fluentd Hybrid

Realtime aggregation for complex and simple(fixed) objects

Norikra + Fluentd Hybrid

14年7月8日火曜日

Page 36: Batch processing and Stream processing by SQL

Case study in LINE

Prompt-report & fixed-report

Norikra + Hive Hybrid

Error detection from application and access logs

Norikra + Fluentd Hybrid

Realtime aggregation for complex and simple(fixed) objects

Norikra + Fluentd Hybrid

14年7月8日火曜日

Page 37: Batch processing and Stream processing by SQL

Hive: fixed-reportsSELECT yyyymmdd, hh, campaign_id, region, lang, COUNT(*) AS click, COUNT(DISTINCT member_id) AS uuFROM ( SELECT yyyymmdd, hh, get_json_object(log, '$.campaign.id') AS campaign_id, get_json_object(log, '$.member.region') AS region, get_json_object(log, '$.member.lang') AS lang, get_json_object(log, '$.member.id') AS member_id FROM applog WHERE service='myservice' AND yyyymmdd='20140708' AND hh='00' AND get_json_object(log, '$.type')='click') xGROUP BY yyyymmdd, hh, campaign_id, region, lang

14年7月8日火曜日

Page 38: Batch processing and Stream processing by SQL

Norikra: prompt-reports

SELECT campaign.id AS campaign_id, member.region AS region, member.lang AS lang, COUNT(*) AS click, COUNT(DISTINCT member.id) AS uuFROM myservice.win:time_batch(1 hours)WHERE type="click"GROUP BY campaign.id, member.region, member.lang

14年7月8日火曜日

Page 39: Batch processing and Stream processing by SQL

More queries, more simplicityand less latency.

Thanks!

14年7月8日火曜日