the analytics frontier of the hadoop eco-system

56
The Analytics Frontier of the Hadoop Eco-system Ted Willke Senior Principal Engineer and GM

Upload: insidehpc

Post on 28-Nov-2014

449 views

Category:

Technology


1 download

DESCRIPTION

In this video from the ISC Big Data'14 Conference, Ted Willke from Intel presents: The Analytics Frontier of the Hadoop Eco-System. "The Hadoop MapReduce framework grew out of an effort to make it easy to express and parallelize simple computations that were routinely performed at Google. It wasn’t long before libraries, like Apache Mahout, were developed to enable matrix factorization, clustering, regression, and other more complex analyses on Hadoop. Now, many of these libraries and their workloads are migrating to Apache Spark because it supports a wider class of applications than MapReduce and is more appropriate for iterative algorithms, interactive processing, and streaming applications. What’s next beyond Spark? Where is big data analytics processing headed? How will data scientists program these systems? In this talk, we will explore the current analytics frontier, the popular debates, and discuss some potentially clever additions. We will also share the emergent data science applications and collaborative university research that inform our thinking." Learn more: http://www.isc-events.com/bigdata14/schedule.html and http://www.intel.com/content/www/us/en/software/intel-graph-solutions.html Watch the video presentation: https://www.youtube.com/watch?v=qlfx495Ekw0

TRANSCRIPT

Page 1: The Analytics Frontier of the Hadoop Eco-System

The Analytics Frontier of the Hadoop Eco-system

Ted Willke Senior Principal Engineer and GM

Page 2: The Analytics Frontier of the Hadoop Eco-System

• Scalable commodity processing established with Hadoop MapReduce, with good libraries for machine learning and data mining

• Twitter libraries like Scalding improve upon MapReduce, providing a more generalized dataflow model

• YARN opened door for in-memory iterative processing with Apache Spark, with its own libraries and others being ported

Today | Hadoop Analytics

Page 3: The Analytics Frontier of the Hadoop Eco-System

• Variety - Expansion of data primitives in commercial use

• Speed - Data processing models evolving (batch streaming)

• Complexity – Monolithic analytics analytics pipelines

• Intelligence – Prescriptive ML Applying ML to ML itself

• Ease of Use – Gap between skills and needs growing

Trends | Hadoop Eco-System Analytics

Page 4: The Analytics Frontier of the Hadoop Eco-System

Life Sciences Personalized medicine, drug repurposing predictions, integration of

heterogeneous data

Education Personalized instruction, outcomes measurement and intervention

Network Security Data fusion, threat assessment and identification

Retail Inventory management, product display management, demand forecasting

Trends | Areas of Application (to name a few)

Page 5: The Analytics Frontier of the Hadoop Eco-System

Variety

Page 6: The Analytics Frontier of the Hadoop Eco-System

Variety | Primitives Usage Patterns

Key-Value Document Graph

Off-line (Queue) Async (Bus) Sync (I/O)

API (Remote) LIB (Local)

Model

Access

Implementation

SQL Column

Page 7: The Analytics Frontier of the Hadoop Eco-System

• When the problem is an information network

• When a graph is a natural way of expressing the algorithm

• When you want to study specific relationships

• When you want faster machine learning or solvers on sparse data

shortest path

central

influence

sub networks

triangle count

Variety | Graphical Model

Page 8: The Analytics Frontier of the Hadoop Eco-System

High

Program Importance (Centrality)

Low

Graph of channel viewing behavior

Current popular surfing patterns

SH002463130000 EP005544723744

Changes in surfing behavior may predict

customer churn.

Variety | Graph Statistics

Page 9: The Analytics Frontier of the Hadoop Eco-System

Preference and Similarity Recommendations

User

Movie

1.7MM Nodes 23.9MM Edges

similar cast

prefers

similar topic

userId: A0A22A5

title: The Godfather genre: Crime drama cast: [M. Brando, Al Pacino]

title: Scarface genre: Crime drama cast: [Al Pacino, M. Pfeiffer]

title: The Departed genre: Crime drama cast: [L. DiCaprio, M. Damon]

weight=11.8

weight=0.67

weight=0.03

weight=14.98

Variety | Graph Search

Page 10: The Analytics Frontier of the Hadoop Eco-System

10

URL Ground-Truth Data

IP/Domain Reputations

420MM Records

74.5MM Nodes 185MM Edges

URL

Domain

IP Address

Calculation of priors

LBP Messaging 84.231.82.93

86.39.155.137

forum.vsichko.com

hermansonskok.se

euskzzbz.nonetheups.com

keesenbep.spaces.live.com

Variety | Graphical Machine Learning

Page 11: The Analytics Frontier of the Hadoop Eco-System

Variety | Loopy Belief Propagation on the (semantic) web

Reputations

Neutral

Good

Bad

Suspect

Page 12: The Analytics Frontier of the Hadoop Eco-System

Variety | Unification with Apache Spark

Image Source: Databricks

• In-memory structures (RDDs) support both table and graph abstractions

• Batch processing and Spark streaming

Spark

RDDs, Transformations, and Actions

Spark Streaming

real-time

Spark

SQL

MLLib

machine learning

DStream’s:

Streams of RDD’s

SchemaRDD’s

RDD-Based

Matrices

GraphX

graph processing/

machine learning

RDD-Based

Graphs

Page 13: The Analytics Frontier of the Hadoop Eco-System

Source: Marcus Paradies, GRAph Data-management & ExperienceS Workshop (GRADES 2014)

Variety | Unification within the In-Memory Database (IMDB)

• Index data

structure for

graph traversal

• Prototyped in

SAP HANA

distributed

columnar IMDB

• Lays foundation

for complex

graph query and

algorithms

Page 14: The Analytics Frontier of the Hadoop Eco-System

Variety | Graph Traversal

Source: Marcus Paradies, GRAph Data-management & ExperienceS Workshop (GRADES 2014)

Page 15: The Analytics Frontier of the Hadoop Eco-System

Variety | Graph Indexing

Source: Marcus Paradies, GRAph Data-management & ExperienceS Workshop (GRADES 2014)

Page 16: The Analytics Frontier of the Hadoop Eco-System

Variety | Graph Traversal Results

Source: Marcus Paradies, GRAph Data-management & ExperienceS Workshop (GRADES 2014)

Page 17: The Analytics Frontier of the Hadoop Eco-System

Speed

Page 18: The Analytics Frontier of the Hadoop Eco-System

Cloud Infrastructure

UI

Data Platform

Analytics Platform

Datacenter Network Gateway Thing

Services

Speed | Hadoop Meets The Internet of Things

Page 19: The Analytics Frontier of the Hadoop Eco-System

Source: http://spark.apache.org/docs/1.0.0/streaming-programming-guide.html, accessed on 9/26/2014

Data Stream

Feature Processing

Model Updates

Learning

Distributed Messaging System

(e.g., Kafka)

Speed | Stream Processing Pipeline

Page 20: The Analytics Frontier of the Hadoop Eco-System

• Data replay (e.g., a bug is found or application improved)

• Getting faster and more efficient than “fast batch”

• Time-evolving models and computation

Speed | Challenges

Page 21: The Analytics Frontier of the Hadoop Eco-System

Source: Jay Kreps, http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html, accessed on 9/26/2014

Lambda

Kappa

• Implement transform logic twice

• Federate information at query time

• Retains input data unchanged

The thinking continues to evolve....

• Retain full replay window

• 2nd instance can re-process

• Query against latest table

Speed | Cluster-Scale Stream Processing

Page 22: The Analytics Frontier of the Hadoop Eco-System

Source: http://spark.apache.org/docs/1.0.0/streaming-programming-guide.html, accessed on 9/26/2014

Apply Dstreams to built-in:

• Machine Learning

• Graph Processing

Speed | Spark Streaming

Page 23: The Analytics Frontier of the Hadoop Eco-System

Source: http://spark.apache.org/docs/1.0.0/streaming-programming-guide.html, accessed on 9/26/2014

• Mini-batch +/- windowing

• Analytics can be run on

any of the resultant RDDs

• No provisions for merging

RDDs

Speed | Spark (Discretized) Streaming

(Mini-) batch Streaming (Mini-) batch Analytics

Page 24: The Analytics Frontier of the Hadoop Eco-System

Image Source: GraphX project

• Graph processing engine on Spark

• Supports Pregel-style vertex programming

• View same data as either graphs or collections

Speed | GraphX API for Spark

Page 25: The Analytics Frontier of the Hadoop Eco-System

• Current Spark streaming provides mini-batch streaming

• No concept of data (model) merging

• GraphX is currently designed for static graphs: 1. Merge table data prior to graph pipeline

2. Re-generate entire (accumulated) graph 3. Re-run machine learning at each window

Speed | Spark Streaming for GraphX?

Straightforward, but wastes computation and time. Can we do better?

Page 26: The Analytics Frontier of the Hadoop Eco-System

• Merge information directly into data model used by algorithms

• Static algorithms -> Online algorithms • Incremental re-computation triggered by changes in data values or

data structure

• Possible with many machine learning algos (PageRank as example)

• Evolve IM data stores to maximize performance and freshness

• Better partitioning algorithms reduced data replication

• Dynamic indexing fast retrieval

Speed | Online Version of GraphX

Page 27: The Analytics Frontier of the Hadoop Eco-System

Static PageRank (delta method)

Online PageRank Speed | Online PageRank

Good for algos with abelian accumulators (commutative, associative, with inverse)

Page 28: The Analytics Frontier of the Hadoop Eco-System

0.0

0.2

0.4

0.6

0.8

1.0

1.2

50K 100K 200K 400K 600K 800K 1M

Co

nve

rgen

ce R

ate

(

No

rmal

ized

Exe

cuti

on

TIm

e)

Throughput (Edges/Second)

Convergence Rate

Naive incremental

• Algorithm: Page Rank

• Reset probability: 0.15

• Convergence Threshold: 0.001

• 1 Master + 3 workers

• Distributed Messaging System: Kafka

• Spark 1.1.0 + our graph streaming

0%

20%

40%

60%

80%

100%

120%

50K 100K 200K 400K 600K 800K 1M

No

rmal

ized

Mes

sage

s Se

nt

Throughput (Edges/Second)

Communication Overhead

naiveincremental

Speed | (Really) Early Results for Online PageRank

Page 29: The Analytics Frontier of the Hadoop Eco-System

Complexity

Page 30: The Analytics Frontier of the Hadoop Eco-System

Complexity | Challenges

• Feature Engineering for Data Science

• Monolithic Analytics Complex Pipeline Analytics

Page 31: The Analytics Frontier of the Hadoop Eco-System

Complexity | Directed Acyclic Graphs of Actions

• Common in Data Science

“feature engineering”

• Developed iteratively • Becomes a new tool in

the toolbox A A

B

C

Page 32: The Analytics Frontier of the Hadoop Eco-System

Source: ISTC-Pervasive Computing

Discriminative structures come at multiple scales and varying deformations

Complexity | Hierarchical Matching Pursuit for Image Classification

• Feature learning

• Multiple layers to learn

• Multipath sparse

coding

Page 33: The Analytics Frontier of the Hadoop Eco-System

Source: ISTC-Pervasive Computing

• Robustness

– Local deformations such as translation, rotation and scaling

– Lighting condition changes

– Viewpoint and pose changes

– Large intra-class variations

• Hierarchy

– Sparse data: The total number of possible image patches grows

exponentially with their sizes

– Shared structure: Large patches could share similar or even same small

patches

Complexity | Robust Hierarchical Representations?

Page 34: The Analytics Frontier of the Hadoop Eco-System

Source: Bo, Ren, & Fox, “Multipath Sparse Coding Using Hierarchical Pursuit,” IEEE CVPR 2013

Complexity | Object Recognition on Caltech 256 Benchmark

#Training

Images

15 30 45 60

Local NBB [1] 33.5 40.1 - -

LLC [2] 34.4 41.2 45.3 47.7

CRBM [3] 35.1 42.1 45.7 47.9

LASERC [4] 35.2 43.6 - -

LP-beta [5] - 45.8 - -

Our Work 41.1 48.7 52.8 56.2

[1] S. McCann and D. Lowe, CVPR 12

[2] J. Wang et al, CVPR 10

[3] K Sohn et al, ICCV 11

[4] K. Nguyen et al, ECCV 12

[5] P. Gehler and S. Nowozin, ICCV 09

Much better than the state of the art

(especially when given more data)

Page 35: The Analytics Frontier of the Hadoop Eco-System

Source: Intel Collaborative Research Institute for Computational Intelligence (ICRI-CI)

Distributed Deep Learning Library

Spark

Hadoop

IA/MIC IA/MIC IA/MIC IA/MIC IA/MIC IA/MIC

Complexity | Deep Learning Library for Spark

• Open source

• Spark MLlib

contribution

• Optimized for IA

Complete POC in 2015

Page 36: The Analytics Frontier of the Hadoop Eco-System

Intelligence

Page 37: The Analytics Frontier of the Hadoop Eco-System

• Selecting the right data to process

• Selecting the right features to engineer

• Selecting the right algorithm to run

Intelligence | Challenges

Page 38: The Analytics Frontier of the Hadoop Eco-System

Image Source: University of Nebraska-Lincoln

Intelligence | Ensemble Learning (Wisdom of the Crowd)

Trade computational power for automated experimentation

• Tackles the data and algorithm

selection problem

• Diversification methods vary

• Bagging

• Boosting

• Combining techniques vary

• Majority vote on label

• Bucket of models

• N bagged predictors

N times the computation

Page 39: The Analytics Frontier of the Hadoop Eco-System

Intelligence | Beyond Ensemble Learning

• Downsides of ensemble learning include the number of:

• Tunable parameters • Selection criteria

• Companies claim that non-parametric methods that require

no selection of criteria are in development

For now, it’s the Wisdom of the Crowd. Stay tuned!

Page 40: The Analytics Frontier of the Hadoop Eco-System

Ease of Use

Page 41: The Analytics Frontier of the Hadoop Eco-System

Ingest &

Clean

Engineer

Features Structure

Model

Train

Model Query &

Analyze

Learn

Visualize

Skills shortage at intersection of

systems engineering

and data analysis

Painful data ingestion

and preparation

Tools that are not designed

with loopbacks in mind

Pipeline state not

easy to manage, especially for collaboration

Composing

pipeline is DIY

Ease of Use | Data Science Workflow

Page 42: The Analytics Frontier of the Hadoop Eco-System

Congratulations! You are a

data scientist!

Page 43: The Analytics Frontier of the Hadoop Eco-System

Intel Confidential

Decomposing the “data scientist”

Source: 2013 Report from Accenture Institute for High Performance

Page 44: The Analytics Frontier of the Hadoop Eco-System

Source: http://www.cloudera.com/content/cloudera-content/cloudera-docs/HadoopTutorial/CDH4/Hadoop-Tutorial/ht_wordcount1_source.html,

accessed on 9/30/2014

Ease of Use | Programming Languages

WordCount: The “Hello World” for Big Data In Java MapReduce

Page 45: The Analytics Frontier of the Hadoop Eco-System

Python

R

Dataflow

GUI

...

Datacenter / Cloud Network Client

“Data Science”

API

Connect

Manage

Secure

Analyzedistributed and parallel

Manage Secure

Connect

Analyzelocal

Query

Big Data Java/Scala/C++ Computational Frameworks

Big Data Algorithms

Cluster Workload Mgmt

Cluster Storage

Machine Learning & Statistics

Data Wrangling Analyst Skills

The Other

Skills

Ease of Use | Making Big Data Familiar

Page 46: The Analytics Frontier of the Hadoop Eco-System

• One consistent API for:

– ETL & feature engineering

– Including Spark and whatever comes next

– Graph construction, databases, analytics, query

– Same API for Titan, Neo4j, etc.

– Same API for Giraph, GraphX, GraphLab, etc.

– Machine learning & statistical analytics

• Programming language integration

• Extensibility at the core

Ease of Use | API Functionality

Page 47: The Analytics Frontier of the Hadoop Eco-System

POST https://site.com/joe/graphs/29/transforms

{

operation: "ml.cgd",

arguments: [ {

edge_properties = ["rating"],

output_property_prefix="cgd_",

vertex_type = "vertex_type",

edge_type = "splits",

max_supersteps = 20,

feature_dimension = 3,

convergence_threshold = 0,

cgd_lambda = 0.65,

learning_output_interval = 1,

bias_on = true,

num_iters = 3

}]

}

Ease of Use | Run CGD (REST Call)

Page 48: The Analytics Frontier of the Hadoop Eco-System

201 Created { operation: “ml.gcd", argments : (same as request) id: 2, created: "2014-01-31 10:51:05.1234",

depends_on: [{

link: {method: “GET”, uri: https://site.com/v1/graphs/29/transforms/1}

type: “graphbuilder”, started: “2014-01-31 10:51:02.8899”, eta: null,

status: “pending”

}] links: [ { rel: “self”, method: “GET”, uri: https://site.com/v1/graphs/29/transforms/2},

{rel=“intel:idpat-progress”, method=“GET”,

uri: https://site.com/v1/graphs/29/transforms/2/progress}

{rel=“intel:idpat-cancel”, method=“DELETE”,

uri: https://site.com/user/joe/graphs/29/transforms/2}] }

Ease of Use | Run CGD (REST Response)

Page 49: The Analytics Frontier of the Hadoop Eco-System

FILESYSTEMS AND NOSQL STORAGE

HW PLATFORM

APACHE HADOOP APACHE SPARK

DATA WRANGLING

MACHINE LEARNING AND STATISTICS

Graphical Algorithms

Classical Algorithms

Graph Construction Tools

Useful String Manipulation

Useful Math Operators

“DATA SCIENCE” API

Intel Analytics Toolkit Ease of Use | Delivering It

Unified UI’s

across the workflow

Easier feature & model creation

End-to-end graph

pipeline

Fully scalable throughout

Multiple data

primitives

Optimized for IA

Cloud & On-Prem

Python

Libraries

3rd Party

GUIs/SDKs

Viz

Tools

Future

Libraries BI

Connectors

Query Interfaces

...

Page 50: The Analytics Frontier of the Hadoop Eco-System

Approach Algorithm Category Applications/Use Cases

Loopy Belief Propagation (LBP) Structured Prediction Personalized recs, image de-noising

Label Propagation Structured Prediction Personalized recommendations

Alternating Least Squares (ALS) Collaborative Filtering Recommenders

Conjugate Gradient Descent (CGD) Collaborative Filtering Recommenders

Connected Components Graph Analytics Network manipulation, image analysis

Latent Dirichlet Allocation (LDA) Topic Modeling Document Clustering

Structure Attribute Clustering Network analysis, consumer seg

K-Truss Clustering Social network analysis

KNN* Clustering Recommenders

Logistic Regression* Classification Fraud detection

Random Forest* Classification Fraud detection, consumer seg

Generalized Linear Model (Binomial, Poisson) Non-linear Curve Fitting Forecasting, pricing, market mix models

Association Rule Mining Data Mining Market basket analysis, recommenders

Frequent Pattern Mining* Data Mining Pattern Recognition

Gra

ph

50

Ease of Use | A Full Spectrum of Analytics

Page 51: The Analytics Frontier of the Hadoop Eco-System

Real Time Database

BQL – BigDAWG Query Language &

Compiler

Analytics Libraries

Hardware Platforms

Applications, Visualization, Languages

“Narrow waist”

provides portability

Historical / Analytics Databases

Spill Stream

Ease of Use | Future Vision – BigDAWG

Page 52: The Analytics Frontier of the Hadoop Eco-System

Ease of Use | Future Vision – BigDAWG

Real Time DBMSs

BQL – BigDAWG Query Language &

Compiler

Visualization & Presentation, e.g., ScalaR, imMens, TweetMap, Prefetching

Languages, e.g, Julia, R, MLbase, GraphLab

SciDB

Analytics, e.g., ScaLAPACK, ML algos, plsh, other analytics packages

TupleWare

Hardware Platforms, e.g., NVM simulator, Xeon Phi, Xeon

Applications, e.g., medical data, astronomy, Twitter, urban sensing, IoT

TileDB S-Store

“Narrow waist”

provides portability

MyriaX

Historical / Analytics DBMSs Spill

Stream

Page 53: The Analytics Frontier of the Hadoop Eco-System

Ease of Use | BigDAWG Deliverables ‘15-’16

• Complete prototype “big data” stack and reference

implementation

• Battle-tested on multiple use cases

• Standard federation language (BQL)

• Next-generation interface for analytics

• Next-generation stream processing system

Stay Tuned!

http://istc-bigdata.org/

Page 54: The Analytics Frontier of the Hadoop Eco-System

1. Big Data Visualization, especially graph*

2. Big Data DB that supports relational and graph equally*

3. A better workflow manager (Like Oozie for Hadoop, Spark, etc.)*

4. UI partners (R, Julia, etc.)

5. Better portable machine learning models (like PMML) that also capture feature engineering (not just algos)

6. Cluster monitoring (GUI, etc.) that works across many big data tools

7. Distributed debugger (for Spark clusters, etc.) for profiling and troubleshooting

8. Cluster auto-configuration tools

• * - Open source STRONGLY preferred

Technology Wish List

Page 55: The Analytics Frontier of the Hadoop Eco-System

• Intel Analytics Toolkit Beta program (now-January ’15) Have a POC, particularly graph? [email protected]

• GRADES 2015, Melbourne Australia, May 31, 2015 Papers due March 15. HTTP://EVENT.CWI.NL/GRADES2015/

Call to Action

Page 56: The Analytics Frontier of the Hadoop Eco-System