storm demo talk - denver apr 2015
TRANSCRIPT
Page 1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Real-Time Processing in Hadoop Big Data for Business
Shane Kumpf & Mac Moore SoluEons Engineers, Hortonworks April 2015
Page 2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services
Agenda
§ IntroducEon & about Hortonworks HDP § Overview of logisEcs industry scenario § Overview of streaming architecture on HDP § Streaming Demo #1 § IntegraEng PredicEve AnalyEcs in streaming scenarios § Streaming Demo with PredicEve addiEons § Q & A
Page 2
Page 3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services
Preface: Enabling Technologies
Page 3
• Problems solved at scale, via fundamentally new approaches…• Make it possible, even simple, to produce new products/applications that would have been too cost prohibitive – or simply impossible - beforehand.
• Where foundation tech like Li-‐Ion baUeries, reEna displays, & Eny HD cameras (from smartphones) have enabled Electric cars, quad-‐copters, VR displays, & more…
• Hadoop has similarly led to breakthroughs in big data capability, and enables new real-‐Eme advanced analyEc applicaEons.
Page 5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Traditional systems under pressure Challenges • Constrains data to app • Can’t manage new data • Costly to Scale
Business Value
Clickstream
GeolocaEon
Web Data
Internet of Things
Docs, emails
Server logs
2012 2.8 Ze5abytes
2020 40 Ze5abytes
LAGGARDS
INDUSTRY LEADERS
1
2 New Data
ERP CRM SCM
New
TradiKonal
Page 6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hadoop for the Enterprise: Implement a Modern Data Architecture with HDP
Spring 2015
Hortonworks. We do Hadoop.
Page 7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hadoop for the Enterprise: Implement a Modern Data Architecture with HDP
Customer Momentum
• 330+ customers (as of year-end 2014)
Hortonworks Data Platform • Completely open multi-tenant platform for any app & any data. • A centralized architecture of consistent enterprise services for
resource management, security, operations, and governance.
Partner for Customer Success • Open source community leadership focus on enterprise needs • Unrivaled world class support
• Founded in 2011 • Original 24 architects, developers,
operators of Hadoop from Yahoo! • 600+ Employees • 1000+ Ecosystem Partners
Page 8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Customer Partnerships matter Driving our innovaKon through
Apache SoSware FoundaKon Projects
Apache Project Commi5ers PMC Members
Hadoop 27 21
Pig 5 5
Hive 18 6
Tez 16 15
HBase 6 4
Phoenix 4 4
Accumulo 2 2
Storm 3 2
Slider 11 11
Falcon 5 3
Flume 1 1
Sqoop 1 1
Ambari 34 27
Oozie 3 2
Zookeeper 2 1
Knox 13 3
Ranger 10 n/a
TOTAL 161 108 Source: Apache Sobware FoundaEon. As of 11/7/2014.
Hortonworkers are the architects and engineers that lead development of open source Apache Hadoop at the ASF
• ExperKse Uniquely capable to solve the most complex issues & ensure success with latest features
• ConnecKon Provide customers & partners direct input into the community roadmap
• Partnership We partner with customers with subscripEon offering. Our success is predicated on yours.
27
Cloudera: 11
Facebook: 5
LinkedIn: 2
IBM: 2
Others: 23
Yahoo 10
Page 9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Technology Partnerships matter
Apache Project Hortonworks
RelaKonship Named Partner
CerEfied SoluEon Resells Joint
Engr
MicrosoS u u u u
HP u u u u
SAS u u u
SAP u u u u
IBM u u u
Pivotal u u u
Redhat u u u
Teradata u u u u
InformaKca u u u
Oracle u u
It is not just about packaging and cerEfying sobware… Our joint engineering with our partners drives open source standards for Apache Hadoop HDP is Apache Hadoop
Page 10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
HDP delivers a Centralized Architecture
Modern Data Architecture • Unifies data and processing.
• Enables applications to have access to all your enterprise data through an efficient centralized platform
• Supported with a centralized approach governance, security and operations
• Versatile to handle any applications and datasets no matter the size or type
Clickstream Web & Social
GeolocaKon Sensor & Machine
Server Logs
Unstructured
SOURC
ES
ExisKng Systems
ERP CRM SCM
ANAL
YTICS
Data Marts
Business AnalyKcs
VisualizaKon & Dashboards
ANAL
YTICS
ApplicaKons Business AnalyKcs
VisualizaKon & Dashboards
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
HDFS (Hadoop Distributed File System)
YARN: Data OperaKng System
Interactive Real-Time Batch Partner ISV Batch Batch MPP EDW
Page 11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Real World Use Case: Trucking Company
Spring 2015
Hortonworks. We do Hadoop.
Page 13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services
Trucking company w/ large fleet of trucks in Midwest
A truck generates millions of events for a given route; an event could be:
§ 'Normal' events: starEng / stopping of the vehicle
§ ‘ViolaEon’ events: speeding, excessive acceleraEon and breaking, unsafe tail distance
Company uses an applicaKon that monitors truck locaKons and violaKons from the truck/driver in real-‐Kme
Route? Truck? Driver? Analysts query a broad history to understand if today’s violaEons are part of a larger problem with specific routes, trucks, or drivers
Page 14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services
Distributed Storage: HDFS
Many Workloads: YARN
Trucking Company’s YARN-‐enabled Architecture
Stream Processing (Storm)
Inbound Messaging (Kara)
Real-‐Eme Serving (HBase)
Alerts & Events (AcEveMQ)
Real-‐Time User Interface
One cluster with consistent security, governance & operaKons
SQL
InteracEve Query (Hive on Tez)
Truck Sensors
Page 15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services
Distributed Storage: HDFS
Many Workloads: YARN
Trucking Company’s YARN-‐enabled Architecture
Stream Processing (Storm)
Inbound Messaging (Kara)
Real-‐Eme Serving (HBase)
Alerts & Events (AcEveMQ)
Real-‐Time User Interface
One cluster with consistent security, governance & operaKons
SQL
InteracEve Query (Hive on Tez)
Truck Sensors
Page 16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services
What is Kara? APACHE KAFKA
§ High throughput distributed messaging system
§ Publish-‐Subscribe semanEcs but re-‐imagined at the implementaEon level to operate at speed with big data volumes
§ Kara @LinkedIn:
§ 800 billion messages per day § 175 terabytes of data wriUen per day § 650 terabytes of data read per day § Over 13 million messages/2.75GB of data
per second
Kaga Cluster
producer
producer
producer
consumer
consumer
consumer
Page 17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services
Kara: Anatomy of a Topic ParKKon 0 ParKKon 1 ParKKon 2
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
9 9 9
10 10
11 11
12
Writes
Old
New
APACHE KAFKA
§ ParEEoning allows topics to scale beyond a single machine/node
§ Topics can also be replicated, for high availability.
Page 18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services
Distributed Storage: HDFS
Many Workloads: YARN
Trucking Company’s YARN-‐enabled Architecture
Stream Processing (Storm)
Inbound Messaging (Kara)
Real-‐Eme Serving (HBase)
Alerts & Events (AcEveMQ)
Real-‐Time User Interface
One cluster with consistent security, governance & operaKons
SQL
InteracEve Query (Hive on Tez)
Truck Sensors
Page 19 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services
Apache Storm
• Distributed, real Eme, fault tolerant Stream Processing plaxorm. • Provides processing guarantees. • Key concepts include:
• Tuples • Streams • Spouts • Bolts • Topology
Page 19
Page 20 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services
Tuples and Streams
• What is a Tuple? – Fundamental data structure in Storm. Is a named list of values that can be of any data type.
Page 20
• What is a Stream? – An unbounded sequences of tuples. – Core abstracEon in Storm and are what you “process” in Storm
Page 21 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services
Spouts
• What is a Spout? – Generates or a source of Streams – E.g.: JMS, TwiUer, Log, Kara Spout – Can spin up mulEple instances of a Spout and dynamically adjust as needed
Page 21
Page 22 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services
Bolts
• What is a Bolt? – Processes any number of input streams and produces output streams – Common processing in bolts are funcEons, aggregaEons, joins, read/write to data stores, alerEng logic – Can spin up mulEple instances of a Bolt and dynamically adjust as needed
• Bolts used in the Use Case: 1. HBaseBolt: persisEng and counEng in Hbase 2. HDFSBolt: persisEng into HFDS as Avro Files using Flume 3. MonitoringBolt: Read from Hbase and create alerts via email and a message to AcEveMQ if the
number of illegal driver incidents exceed a given threshhold.
Page 22
Page 23 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services
Topology
• What is a Topology? – A network of spouts and bolts wired together into a workflow
Page 23
Truck-Event-Processor Topology
Kafka Spout
HBase BoltMonitoring
Bolt
HDFS Bolt
WebSocket Bolt
Stream Stream
Stream
Stream
Page 24 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services
Distributed Storage: HDFS
Many Workloads: YARN
Trucking Company’s YARN-‐enabled Architecture
Stream Processing (Storm)
Inbound Messaging (Kara)
Real-‐Eme Serving (HBase)
Alerts & Events (AcEveMQ)
Real-‐Time User Interface
One cluster with consistent security, governance & operaKons
SQL
InteracEve Query (Hive on Tez)
Truck Sensors
Page 25 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services
Key Constructs in Apache HBase • HBase = Key / Value store• Designed for petabyte scale• Supports low latency reads, writes and updates
• Key features– Updateable records– Versioned Records– Distributed across a cluster of machines– Low Latency– Caching
• Popular use cases:– User profiles and session state– Object store– Sensor apps
Page 25
Page 26 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services
Data Assignment
Page 26
HBase Table
Keys within HBase Divided among
different RegionServers
Page 27 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services
Data Access
• Get– Retrieves a single cell, all cells with a matching rowkey, or all cells in a column family with a
matching rowkey
• Put– Inserts a new version of a cell.
• Scan– The whole table, row by row, or a section of that table starting at a particular start key and ending
at a particular end key
• Delete– It is actually a version of put(Add a new version with put with a deletion marker)
• SQL via Apache Phoenix– Unique capability in the NoSQL market
Page 27
Page 28 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services
Distributed Storage: HDFS
Many Workloads: YARN
Trucking Company’s YARN-‐enabled Architecture
Stream Processing (Storm)
Inbound Messaging (Kara)
Real-‐Eme Serving (HBase)
Alerts & Events (AcEveMQ)
Real-‐Time User Interface
One cluster with consistent security, governance & operaKons
SQL
InteracEve Query (Hive on Tez)
Truck Sensors
Page 29 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
2009 2006
1 ° ° ° ° °
° ° ° ° ° N
HDFS (Hadoop Distributed File System)
MapReduce Largely Batch Processing
Hadoop w/ MapReduce
YARN: Data Operating System
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° °
°
° N
HDFS (Hadoop Distributed File System)
Hadoop2 & YARN based Architecture
Silo’d clusters Largely batch system Difficult to integrate
MR-‐279: YARN
Hadoop 2 & YARN
Interactive Real-Time Batch
Architected & led development of YARN to enable the Modern Data Architecture
October 23, 2013
Page 30 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services
Benefits of YARN as the Data OperaEng System
• The container based model allows for running nearly any workload.– Enables the centralized architecture.
– No longer is MapReduce the only data processing engine.
– Docker containers managed by YARN. Yes Please!
• Decouples resource scheduling from application lifecycle.– Improved scalability and fault tolerence
• Dynamically allocated resources, resulting in HUGE utilization gains– Versus static allocation of “slots” in Hadoop 1.0
Page 30
Yahoo has over 30000 nodes running YARN across over 365PB of data. They calculate running about 400,000 jobs per day for about 10 million hours of compute time.
They also have estimated a 60% – 150% improvement on node usage per day since moving to YARN.
Page 31 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services
Distributed Storage: HDFS
Many Workloads: YARN
Trucking Company’s YARN-‐enabled Architecture
Stream Processing (Storm)
Inbound Messaging (Kara)
Real-‐Eme Serving (HBase)
Alerts & Events (AcEveMQ)
Real-‐Time User Interface
One cluster with consistent security, governance & operaKons
SQL
InteracEve Query (Hive on Tez)
Truck Sensors
Page 32 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services
Apache HDFS – Hadoop Distributed File System • Very large scale distributed file system
• 10K nodes, tens of millions files and PBs of data• Supports large files
• Designed to run on commodity hardware, assumes hardware failures• Files are replicated to handle hardware failure• Detect failures and recovers from them automatically
• Optimized for Large Scale Processing• Data locations are exposed so that the computations can move to where data resides
• Data Coherency• Write once and read many times access pattern
• Files are broken up in chunks called ‘blocks’• Blocks are distributed over nodes
Page 32
Page 33 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services
Streaming Demo -‐ High Level Architecture
Distributed Storage: HDFS
YARN
Storm Stream Processing
Kakfa Spout
HBase
Dangerous Events Table Hbase
Bolt HDFS Bolt
Truck Events
AcKve MQ
Monitoring Bolt
Web App
Truck Streaming Data
T(1) T(2) T(N)
Inbound Messaging (Kaga)
Truck Events Topic
Page 36 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
CDO’s vision: Build a PredicEve Business, not a ReacEve one
CDO’s Requirements § Offline predicKons
§ IdenKfy investments that will increase safety and reduce company’s liabiliKes
§ Real-‐Kme predicKons § AnKcipate driver violaKons before they
happen and take precauKonary acKons
Data ScienKst’s Response § Need to explore data & form a hypothesis § Verify trends against TBs of events data via
machine learning § Generate predicEve models with Spark
MLlib on HDP § Plug models into the Storm topology to predict
driver violaEons in real-‐Eme
♬ I’ve been wai+ng for this moment all my life ♬
Page 38 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services
Analyzing Raw Events – dangerous drivers
Page 38
Page 39 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services
Analyzing Raw Events – dangerous routes
Page 39
Page 40 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services
Analyzing Raw Events – violations by location
Page 40
Page 41 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Enriching truck events for analysis with Pig
HDFS Raw Truck Events Weather Data Sets
Raw Weather Data
HCatalog (Metadata)
Payroll Data
HR & Payroll DBs
Load Raw Truck Events
Clean & Filter
Cleaned Events
Transformed Events
Transform
Join with HR & weather data
Enriched Events
Enriched Events
Store
Tableau
Page 42 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services
Analyzing Enriched Events – noncertified and fatigued drivers more dangerous
Page 42
Page 43 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services
Analyzing Enriched Events – top 3 dangerous routes seem to be driven by fatigued drivers
Page 43
Page 44 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services
Analyzing Enriched Events – foggy weather leads to violations
Page 44
Page 45 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services
Analyzing Enriched Events – but top 3 safest routes are also foggy
Page 45
Page 47 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Building the PredicEve Model on HDP
Tableau Explore small subset of events to idenEfy predicEve features and make a hypothesis. E.g. hypothesis: “foggy weather causes driver viola+ons”
1
IdenEfy suitable ML algorithms to train a model – we will use classificaEon algorithms as we have labeled events data
2
Transform enriched events data to a format that is friendly to Spark MLlib – many ML libs expect training data in a certain format
3
Train a logisEc regression model in Spark on YARN, with above events as training input, and iterate to fine tune the generated model
4
Integrate Spark MLlib model in a Storm bolt to predict violaEons in real Eme
5
Page 48 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Truck Sensors
HDFS
YARN
Integrate PredicEve AnalyEcs in Stream Processing
Stream Processing (Storm)
Inbound Messaging (Kara)
InteracEve Query (Hive on Tez)
Real-‐Eme Serving (HBase)
Millions of Enriched Truck Events
PredicEon Bolt
Plug Spark model into Storm bolt
Machine Learning (Spark)
Train Spark ML model with millions of truck events
Page 49 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services
Streaming Demo -‐ Updated Architecture
Distributed Storage: HDFS
YARN
Storm Stream Processing
Kakfa Spout
HBase
PayRoll Table HBase
Bolt HDFS Bolt
Truck Events
AcKve MQ
Monitoring Bolt
Web App
Truck Streaming Data
T(1) T(2) T(N)
Inbound Messaging (Kaga)
Truck Events Topic
PredicKon Bolt
Enrich Event
Predict violaKon in real Kme & alert via MQ
Render Real Kme predicKons on UI
Page 50 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Transforming training data for Spark MLlib Enriched Events Data
Event Type Is Driver CerKfied?
Wage Plan
Hours Driven
Miles Driven
Longitude LaKtude Weather Foggy
Weather Rainy
Weather Windy
Normal Yes Hourly 45 2721 -‐91.3 38.14 No No No
Overspeed No Miles 72 4152 -‐94.23 37.09 Yes Yes No
… … … … … … … … … …
Spark MLlib Training Data Label Is Driver
CerKfied? Wage Plan
Hours Driven
Miles Driven
Weather Foggy
Weather Rainy
Weather Windy
0 1 1 0.45 0.2721 0 0 0
1 0 0 0.72 0.4152 1 1 0
… … … … … … … …
Normal events labeled as 0 and
violaEon events as 1
Feature scaling applied to hours and miles to improve algorithm performance
Features with binary values denoted as 0 and 1
Page 51 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Running Spark ML on YARN
1 spark-‐submit -‐-‐class org.apache.spark.examples.mllib.BinaryClassifica+on -‐-‐master yarn-‐cluster -‐-‐num-‐executors 3 -‐-‐driver-‐memory 512m -‐-‐executor-‐memory 512m -‐-‐executor-‐cores 1 truckml.jar -‐-‐algorithm LR -‐-‐regType L2 -‐-‐regParam 1.0 /user/root/truck_training -‐-‐numItera3ons 100
Run spark-‐submit script to launch a Spark job on YARN.
Training data locaEon on HDFS
2 Monitor progress of Spark job in YARN Resource Mgr UI
Page 52 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
InterpreEng Spark LogisEc Regression Results
Precision: 87.5% Recall: 88%
Top three predictors of violaKons 1. Foggy Weather 2. Rainy Weather 3. Driver CerEficaEon
Page 53 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
IntegraEng Spark model in Storm
Kara Spout
Storm PredicEon Bolt
§ IniEalize Spark model § Parse truck event § Enrich event with HBase data § Predict violaEon with model § Send Alert if violaEon predicted
Real-‐Eme Serving (HBase)
AcKve MQ
Ops Center LOB Dashboards
Page 55 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Value of large scale ML on HDP § Accelerate Kme to market/value
§ Test out mulEple ML algorithms against TBs of training data in reasonable Eme frames
§ Confirm hypothesis against TBs of training data with confidence § We confirmed that fog does impact safety and wage plans do not,
whereas BI tools indicated otherwise
§ Easily integrate predicKve models in data driven apps § Run predicEve models in Storm or any other app in your enterprise
§ Run all of the above in a mulK-‐tenant YARN cluster
§ Large scale ML on YARN respects other tenants in an HDP cluster
Page 56 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
RecommendaEons to CDO
§ Investment recommendaKons, in order of priority 1. Invest in visibility sensors and auto braking systems to deal with foggy condiEons 2. Invest in slip resistant Eres to fight rainy condiEons 3. Invest in cerEfying drivers to reduce violaEon probability
§ Power of real Kme predicKons § 40% reducEon in violaEon rates by predicEng high risk situaEons in real-‐Eme and
sending immediate alerts to drivers