hadoop in adtech

43
Hadoop in adtech world Yuta Imai Solu,ons Engineer, Hortonworks © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Upload: yuta-imai

Post on 09-Jan-2017

1.109 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Hadoop in adtech

Hadoop in adtech world

YutaImaiSolu,onsEngineer,Hortonworks

©HortonworksInc.2011–2015.AllRightsReserved

Page 2: Hadoop in adtech

WhatisApacheHadoop?

Page 3: Hadoop in adtech

3 ©HortonworksInc.2011–2016.AllRightsReserved runson

ETL

RDBMSImport/Export

DistributedStorage&ProcessingFramework

SecureNoSQLDB

SQLonHBase

NoSQLDB

WorkflowManagement

SQL

StreamingDataIngesFon

ClusterSystemOperaFons

SecureGateway

DistributedRegistry

ETL

Search&Indexing

EvenFasterDataProcessing

DataManagement

MachineLearning

HadoopEcosystem

Page 4: Hadoop in adtech

4 ©HortonworksInc.2011–2016.AllRightsReserved

HortonworksDataPla:orm(HDP)

Page 5: Hadoop in adtech

5 ©HortonworksInc.2011–2016.AllRightsReserved

1stGenHadoop:CostEffecBveBatchatScale

HADOOP1.0BuiltforWeb-ScaleBatchApps

SingleAppBATCH

HDFS

SingleAppINTERACTIVE

SingleAppBATCH

HDFS

Siloscreatedfordis,nctusecasesSingleApp

BATCH

HDFS

SingleAppONLINE

Page 6: Hadoop in adtech

6 ©HortonworksInc.2011–2016.AllRightsReserved

HadoopBeyondBatchwithYARN

SingleUseSysztemBatchApps

Mul2UseDataPla6ormBatch,InteracFve,Online,Streaming,…

AshiHfromtheoldtothenew…

HADOOP 1

MapReduce (cluster resource management

& data processing)

Data Flow Pig

SQL Hive

Others

API, Engine,

and System

YARN (Data Operating System: resource management, etc.)

Data Flow Pig

SQL Hive

Other ISV

Apache Yarn as a Base

System

Engine

API’s

1 ° ° ° ° °

° ° ° ° ° N

HDFS (redundant, reliable storage)

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° ° N

HDFS (redundant, reliable storage)

Batch MapReduce

Tez Tez

MapReduce as the Base HADOOP 2

Page 7: Hadoop in adtech

7 ©HortonworksInc.2011–2016.AllRightsReserved

ArchitectureEnabledbyYARNAsinglesetofdataacrosstheen,reclusterwithmul,pleaccessmethodsusing“zones”forprocessing

1 ° ° ° ° ° ° °

° ° ° ° ° ° ° °

° ° ° ° ° ° ° n

SQLHive

Interac,veSQLQueryforAnaly,cs

PigScript-basedETL

AlgorithmexecutedinbatchtoreworkdatausedbyHiveandHBaseconsumers

• Maximize compute resources to lower TCO

• No standalone, silo’d clusters

• Simple management & operations

…all enabled by YARN

StreamProcessingStorm

Iden,fy&actonreal-,meevents

NoSQLHbase

Accumulo

Low-latencyaccessservingupawebfrontend

Page 8: Hadoop in adtech

8 ©HortonworksInc.2011–2016.AllRightsReserved

HadoopWorkloadEvoluBon

SingleUseSystemBatchApps

Mul2UseDataPla6ormBatch,InteracFve,Online,Streaming,…

AshiHfromtheoldtothenew… Mul2UsePla6ormData&Beyond

HADOOP 1

YARN

HADOOP 2

1 ° ° ° °

° ° ° ° N

HDFS (redundant, reliable storage)

1 ° ° °

° ° ° N

HDFS

MapReduce

HADOOP.Next

YARN ‘

1 ° ° ° ° ° °

° ° ° ° ° ° N

HDFS (redundant, reliable storage)

DATA ACCESS APPS

Docker

MySQL MR2 Others (ISV Engines)

Multiple (Script, SQL, NoSQL, …)

MR2 Others (ISV Engines)

Multiple (Script, SQL, NoSQL, …)

Docker

Tomcat

Docker

Other

Page 9: Hadoop in adtech

HadoopOperaBons&Tools

Page 10: Hadoop in adtech

10 ©HortonworksInc.2011–2016.AllRightsReserved

How Do You Operate a Hadoop Cluster?

Apache™Ambariisapla:ormtoprovision,manageandmonitorHadoopclusters

Page 11: Hadoop in adtech

11 ©HortonworksInc.2011–2016.AllRightsReserved

Ambari Core Features and Extensibility

Install&Configure

Operate,Manage&Administer

Develop

OpBmize&Tune

Developer

DataArchitect

AmbariprovidescoreservicesforoperaBons,developmentandextensionspointsforboth

ExtensibilityFeatures

Stacks,Blueprints&RESTAPIs

CoreFeatures

InstallWizard&Web

Web,OperatorViews,Metrics&Alerts

UserViews

UserViews

ViewsFramework&RESTAPIs

ViewsFramework

ViewsFramework

How?ClusterAdmin

Page 12: Hadoop in adtech

12 ©HortonworksInc.2011–2016.AllRightsReserved

Newuserinterfaceenablesfast&easySQLdefini,onandexecu,on.

Page 13: Hadoop in adtech

13 ©HortonworksInc.2011–2016.AllRightsReserved

New User Views for DevOps

CapacitySchedulerViewBrowseandmanageYARNqueues

TezViewViewinforma,onrelatedtoTezjobsthatareexecu,ngonthecluster

Page 14: Hadoop in adtech

14 ©HortonworksInc.2011–2016.AllRightsReserved

NewUserViewsforDevelopment

PigViewAuthorandexecutePigScripts.

HiveViewAuthor,executeanddebugHive

queries.

FilesViewBrowseHDFSfilesystem.

Page 15: Hadoop in adtech

15 ©HortonworksInc.2011–2016.AllRightsReserved

ApacheZeppelin

•  Web-basednotebookfordataengineers,dataanalystsanddatascien,sts•  Bringsinterac,vedatainges,on,data

explora,on,visualiza,on,sharingandcollabora,onfeaturestoHadoopandSpark

•  Moderndatasciencestudio•  ScalawithSpark•  PythonwithSpark•  SparkSQL•  ApacheHive,andmore.

Page 16: Hadoop in adtech

Hadoopusecasesinadtechworld

Page 17: Hadoop in adtech

17 ©HortonworksInc.2011–2016.AllRightsReserved

Hadoopの多くのユースケースはHive•  例えばWebサービスのアクセスレポートの作成などによく利⽤され、以下の

様なアーキテクチャが⾮常にメジャーだった。•  クエリにはそれなりに時間がかかることが多く、定期ジョブとして実⾏され

ることが多かった。

Web

Web

Web

Hadoop

log

log

log

Page 18: Hadoop in adtech

18 ©HortonworksInc.2011–2016.AllRightsReserved

Hadoopの多くのユースケースはHive•  例えばWebサービスのアクセスレポートの作成などによく利⽤され、以下の

様なアーキテクチャが⾮常にメジャーだった。•  クエリにはそれなりに時間がかかることが多く、定期ジョブとして実⾏され

ることが多かった。

Web

Web

Web

Hadoop

log

log

log

⼤量のデータに対して⼤きな処理をするために利⽤されるのがHadoopでありMapReduceだった。

MySQLReportUI

Page 19: Hadoop in adtech

19 ©HortonworksInc.2011–2016.AllRightsReserved

SQL on ビッグデータを⾼速化する試み

Hive(MapReduce)の速度はインタラクティブなクエリには不⼗分だった。•  Presto•  Impala•  Drill•  Shark(今のSparkSQL)

Page 20: Hadoop in adtech

20 ©HortonworksInc.2011–2016.AllRightsReserved

Hadoopの多くのユースケースはHive

•  PrestoやMySQL(データマートとして)などと組み合わせた構成が⼀般的になってきている

Web

Web

Web

Hadoop

log

log

log

ReportUI

Page 21: Hadoop in adtech

21 ©HortonworksInc.2011–2016.AllRightsReserved

SQL on ビッグデータ - クラウドサービスの登場

•  Amazon Redshift•  Google BigQuery

Page 22: Hadoop in adtech

22 ©HortonworksInc.2011–2016.AllRightsReserved

Sub-secondショートクエリで

1秒以下のレスポンスを⽬指す

Ã ~Hive1.2.1– Tez– Cost Based Optimizer(CBO)– ORC File format– Vectorization

Ã Hive2.0– LLAP

Stinger InitiativeHiveを100倍以上⾼速化

Already available on HDP!

もちろんHive⾃⾝も⾼速化している

Page 23: Hadoop in adtech

23 ©HortonworksInc.2011–2016.AllRightsReserved

Hiveの⾼速化

Web

Web

Web

Hadoop

log

log

log

ReportUI

•  Hiveで直接インタラクティブクエリを処理できるようになった

Page 24: Hadoop in adtech

24 ©HortonworksInc.2011–2016.AllRightsReserved

今では様々なところに利⽤されるHadoopエコシステム

Web

Web

Web

HadoopHDFS

log

log

log

ReportUI

レポート

すべてのログの⻑期保存

ETLやもろもろのバッチ処理

Page 25: Hadoop in adtech

25 ©HortonworksInc.2011–2016.AllRightsReserved

今では様々なところに利⽤されるHadoopエコシステム

Web

Web

Web

HadoopHDFS

log

log

log

ReportUI

Adsserver

配信DB

⼊札やオプティマイゼーションのモデル⽣成

Page 26: Hadoop in adtech

26 ©HortonworksInc.2011–2016.AllRightsReserved

今では様々なところに利⽤されるHadoopエコシステム

Web

Web

Web

HadoopHDFS

log

log

log

ReportUI

Adsserver

リアルタイムなログ収集

リアルタイムトラッキング

Page 27: Hadoop in adtech

27 ©HortonworksInc.2011–2016.AllRightsReserved

今では様々なところに利⽤されるHadoopエコシステム

Web

Web

Web

HadoopHDFS

log

log

log

ReportUI

Adsserver

配信DB

レポート

⼊札やオプティマイゼーションのモデル⽣成

リアルタイムトラッキング

すべてのログの⻑期保存

リアルタイムなログ収集

ETLやもろもろのバッチ処理

Page 28: Hadoop in adtech

28 ©HortonworksInc.2011–2016.AllRightsReserved

今では様々なところに利⽤されるHadoopエコシステム

Web

Web

Web

HadoopHDFS

log

log

log

ReportUI

Adsserver

配信DB

レポート

⼊札やオプティマイゼーションのモデル⽣成

リアルタイムトラッキング

すべてのログの⻑期保存

リアルタイムなログ収集

ETLやもろもろのバッチ処理

Provision, Manage & Monitor

Ambari

Zookeeper

Scheduling

Oozie

Loaddataandmanageaccordingtopolicy

Providelayeredapproachto

securitythroughAuthen,ca,on,Authoriza,on,Accoun,ng,andDataProtec,on

SECURITYGOVERNANCE

Deployandeffec,velymanagetheplahorm

° ° ° ° ° ° ° ° ° ° ° ° ° ° °

Script

Pig

SQL

Hive

Java Scala

Cascadin

g

Stream

Storm

Search

Solr

NoSQL

HBase Accumulo

BATCH, INTERACTIVE & REAL-TIME DATA ACCESS

In-Memory

Spark

Others

ISV Engines

1 ° ° ° ° ° ° ° ° ° ° ° ° ° °

YARN: Data Operating System (ClusterResourceManagement)

HDFS (Hadoop Distributed File System)

Tez Slider Slider Tez Tez

OPERATIONS

Page 29: Hadoop in adtech

Key highlightsin recent Hadoop evolution

Page 30: Hadoop in adtech

30 ©HortonworksInc.2011–2016.AllRightsReserved

昨今のHadoopの進化

Ã  LLAP

Ã  HCatalog Stream Mutation API

Ã  Cloudbreak

Page 31: Hadoop in adtech

31 ©HortonworksInc.2011–2016.AllRightsReserved

昨今のHadoopの進化

Ã Hive– LLAP– ACID, HCatalog Stream Mutation API

Ã Cloudbreak

Page 32: Hadoop in adtech

32 ©HortonworksInc.2011–2016.AllRightsReserved

ApacheHive:FastFacts

MostQueriesPerHour

100,000QueriesPerHour

AnalyBcsPerformance

100Millionrows/sPerNode(withHiveLLAP)

LargestHiveWarehouse

300+PBRawStorage(Facebook)

LargestCluster

4,500+Nodes(Yahoo)

Page 33: Hadoop in adtech

33 ©HortonworksInc.2011–2016.AllRightsReserved

SQL evolution on HadoopCa

pabi

litie

s

Batch SQL OLAP / CubeInteractive SQL

Sub-Second SQL

ACID / MERGE

Speed Feature

Hive0.x(MapReduce)

Hive1.2-(Tez, Vectorize, ORC, CBO)

Hive2.0(LLAP)

PrestoImpala

Drill

Spark SQLHAWQ

MPP

KylinDruid

CommercialKyvos Insights

AtScaleSource

Page 34: Hadoop in adtech

34 ©HortonworksInc.2011–2016.AllRightsReserved

Hive2withLLAP:ArchitectureOverview

Deep

Storage

HDFS S3+OtherHDFSCompa,bleFilesystems

YARNCluster

LLAPDaemon

QueryExecutors

LLAPDaemon

QueryExecutors

LLAPDaemon

QueryExecutors

LLAPDaemon

QueryExecutors

QueryCoordinators

Coord-inator

Coord-inator

Coord-inator

HiveServer2(Query

Endpoint)

ODBC/JDBC SQL

Queries In-MemoryCache(SharedAcrossAllUsers)

Page 35: Hadoop in adtech

35 ©HortonworksInc.2011–2016.AllRightsReserved

Hive2withLLAP:ArchitectureOverview

Deep

Storage

HDFS S3+OtherHDFSCompa,bleFilesystems

YARNCluster

LLAPDaemon

QueryExecutors

LLAPDaemon

QueryExecutors

LLAPDaemon

QueryExecutors

LLAPDaemon

QueryExecutors

QueryCoordinators

Coord-inator

Coord-inator

Coord-inator

HiveServer2(Query

Endpoint)

ODBC/JDBC SQL

Queries In-MemoryCache(SharedAcrossAllUsers)

MPP型に近いアーキテクチャを取りながら・・・•  キャッシュレイヤを持ったり•  YARNによるスケール機能を利⽤したり•  低いレイテンシが必要ないクエリは通常のTezコンテナで処理できたりといろいろおいしいどころどりな設計

Page 36: Hadoop in adtech

36 ©HortonworksInc.2011–2016.AllRightsReserved

0

5

10

15

20

25

30

35

40

45

50

0

50

100

150

200

250

Speedu

p(xFactor)

Que

ryTim

e(s)(Low

erisBep

er)

Hive2withLLAPaverages26xfasterthanHive1

Hive1/TezTime(s) Hive2/LLAPTime(s) Speedup(xFactor)

Hive2withLLAP:25+xPerformanceBoost

Page 37: Hadoop in adtech

37 ©HortonworksInc.2011–2016.AllRightsReserved

HiveACIDProducBon-ReadywithHDP2.5

Ã  Testedatmul,-TBscaleusingTPC-Hbenchmark.–  Reliablyingest400GB+perdaywithina

par,,on.–  10TB+rawdatainasinglepar,,on.–  Simultaneousingest,deleteandquery.

Ã  70+stabiliza,onimprovements.

Ã  Supported:–  SQLINSERT,UPDATE,DELETE.–  StreamingAPI.

Ã  Future:SQLMERGEunderdevelopment(HIVE-10924).

NotableImprovements

0MB

1TB

1TB

2TB

2TB

3TB

3TB

4TB

4TB

5TB

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

16/05/24 16/05/25 16/05/26 16/05/27 16/05/28 16/05/29 16/05/30 16/05/31 16/06/01

Time(s)

QueryTimeversusDataSize

Run,meforAllQueries(s) TotalCompressedData

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

16/05/23 16/05/24 16/05/25 16/05/26 16/05/27 16/05/28 16/05/29 16/05/30 16/05/31 16/06/01

Time(s)

TimesforInsertsandDeletes

,me_insert_lineitem ,me_insert_orders ,me_delete_lineitem ,me_delete_orders

Page 38: Hadoop in adtech

38 ©HortonworksInc.2011–2016.AllRightsReserved

HiveACIDProducBon-ReadywithHDP2.5

Ã  Testedatmul,-TBscaleusingTPC-Hbenchmark.–  Reliablyingest400GB+perdaywithina

par,,on.–  10TB+rawdatainasinglepar,,on.–  Simultaneousingest,deleteandquery.

Ã  70+stabiliza,onimprovements.

Ã  Supported:–  SQLINSERT,UPDATE,DELETE.–  StreamingAPI.

Ã  Future:SQLMERGEunderdevelopment(HIVE-10924).

NotableImprovements

0MB

1TB

1TB

2TB

2TB

3TB

3TB

4TB

4TB

5TB

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

16/05/24 16/05/25 16/05/26 16/05/27 16/05/28 16/05/29 16/05/30 16/05/31 16/06/01

Time(s)

QueryTimeversusDataSize

Run,meforAllQueries(s) TotalCompressedData

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

16/05/23 16/05/24 16/05/25 16/05/26 16/05/27 16/05/28 16/05/29 16/05/30 16/05/31 16/06/01

Time(s)

TimesforInsertsandDeletes

,me_insert_lineitem ,me_insert_orders ,me_delete_lineitem ,me_delete_orders

分析/集計⽤DBのつらいところとして、データをバッチ処理的に投⼊してやる必要があった。ストリームインサートができるのは⼤きなメリット。

Page 39: Hadoop in adtech

39 ©HortonworksInc.2011–2016.AllRightsReserved

HCatalog Stream Mutation API

ORCORC

ORCORC

ORCORC

HDFS

Table

Bucket

Bucket

Bucket

ORC

Page 40: Hadoop in adtech

40 ©HortonworksInc.2011–2016.AllRightsReserved

昨今のHadoopの進化

Ã Hive– LLAP– ACID, HCatalog Stream Mutation API

Ã Cloudbreak

Page 41: Hadoop in adtech

41 ©HortonworksInc.2011–2016.AllRightsReserved

Cloudbreak

BI/AnalyBcs(Hive)

IoTApps(Storm,HBase,Hive)

Dev/Test(allHDPservices)DataScience

(Spark)

Cloudbreak

1.  PickaBlueprint2.  ChooseaCloud3.  LaunchHDP!

ExampleAmbariBlueprints:IoTApps,BI/Analy,cs,DataScience,

Dev/Test

クラウドへのHDPデプロイの実⾏を容易に

Page 42: Hadoop in adtech

42 ©HortonworksInc.2011–2016.AllRightsReserved

昨今のHadoopの進化:まとめると・・・

Ã Hive– LLAP– ACID, HCatalog Stream Mutation API

Ã Cloudbreak

Page 43: Hadoop in adtech

43 ©HortonworksInc.2011–2016.AllRightsReserved

昨今のHadoopの進化: クラウドとうまく共存できる⽅向に

CacheCache

Cache

リアルタイムなデータ収集

クラ

ウド

内外

への

オン

デマ

ンド

なク

ラス

タデ

プロ

クラウドストレージを活⽤しながら低レイテンシ

なクエリ処理