the analytics frontier of the hadoop eco-system

The Analytics Frontier of the Hadoop Eco-system

Ted Willke Senior Principal Engineer and GM

• Scalable commodity processing established with Hadoop MapReduce, with good libraries for machine learning and data mining

• Twitter libraries like Scalding improve upon MapReduce, providing a more generalized dataflow model

• YARN opened door for in-memory iterative processing with Apache Spark, with its own libraries and others being ported

Today | Hadoop Analytics

• Variety - Expansion of data primitives in commercial use

• Speed - Data processing models evolving (batch streaming)

• Complexity – Monolithic analytics analytics pipelines

• Intelligence – Prescriptive ML Applying ML to ML itself

• Ease of Use – Gap between skills and needs growing

Trends | Hadoop Eco-System Analytics

Life Sciences Personalized medicine, drug repurposing predictions, integration of

heterogeneous data

Education Personalized instruction, outcomes measurement and intervention

Network Security Data fusion, threat assessment and identification

Retail Inventory management, product display management, demand forecasting

Trends | Areas of Application (to name a few)

Variety

Variety | Primitives Usage Patterns

Key-Value Document Graph

Off-line (Queue) Async (Bus) Sync (I/O)

API (Remote) LIB (Local)

Model

Access

Implementation

SQL Column

• When the problem is an information network

• When a graph is a natural way of expressing the algorithm

• When you want to study specific relationships

• When you want faster machine learning or solvers on sparse data

shortest path

central

influence

sub networks

triangle count

Variety | Graphical Model

High

Program Importance (Centrality)

Low

Graph of channel viewing behavior

Current popular surfing patterns

SH002463130000 EP005544723744

Changes in surfing behavior may predict

customer churn.

Variety | Graph Statistics

Preference and Similarity Recommendations

User

Movie

1.7MM Nodes 23.9MM Edges

similar cast

prefers

similar topic

userId: A0A22A5

title: The Godfather genre: Crime drama cast: [M. Brando, Al Pacino]

title: Scarface genre: Crime drama cast: [Al Pacino, M. Pfeiffer]

title: The Departed genre: Crime drama cast: [L. DiCaprio, M. Damon]

weight=11.8

weight=0.67

weight=0.03

weight=14.98

Variety | Graph Search

10

URL Ground-Truth Data

IP/Domain Reputations

420MM Records

74.5MM Nodes 185MM Edges

URL

Domain

IP Address

Calculation of priors

LBP Messaging 84.231.82.93

86.39.155.137

forum.vsichko.com

hermansonskok.se

euskzzbz.nonetheups.com

keesenbep.spaces.live.com

Variety | Graphical Machine Learning

Variety | Loopy Belief Propagation on the (semantic) web

Reputations

Neutral

Good

Bad

Suspect

Variety | Unification with Apache Spark

Image Source: Databricks

• In-memory structures (RDDs) support both table and graph abstractions

• Batch processing and Spark streaming

Spark

RDDs, Transformations, and Actions

Spark Streaming

real-time

Spark

SQL

MLLib

machine learning

DStream’s:

Streams of RDD’s

SchemaRDD’s

RDD-Based

Matrices

GraphX

graph processing/

machine learning

RDD-Based

Graphs

Source: Marcus Paradies, GRAph Data-management & ExperienceS Workshop (GRADES 2014)

Variety | Unification within the In-Memory Database (IMDB)

• Index data

structure for

graph traversal

• Prototyped in

SAP HANA

distributed

columnar IMDB

• Lays foundation

for complex

graph query and

algorithms

Variety | Graph Traversal


Variety | Graph Indexing


Variety | Graph Traversal Results


Cloud Infrastructure

UI

Data Platform

Analytics Platform

Datacenter Network Gateway Thing

Services

Speed | Hadoop Meets The Internet of Things

Source: http://spark.apache.org/docs/1.0.0/streaming-programming-guide.html, accessed on 9/26/2014

Data Stream

Feature Processing

Model Updates

Learning

Distributed Messaging System

(e.g., Kafka)

Speed | Stream Processing Pipeline

• Data replay (e.g., a bug is found or application improved)

• Getting faster and more efficient than “fast batch”

• Time-evolving models and computation

Speed | Challenges

Source: Jay Kreps, http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html, accessed on 9/26/2014

Lambda

Kappa

• Implement transform logic twice

• Federate information at query time

• Retains input data unchanged

The thinking continues to evolve....

• Retain full replay window

• 2nd instance can re-process

• Query against latest table

Speed | Cluster-Scale Stream Processing


Apply Dstreams to built-in:

• Machine Learning

• Graph Processing

Speed | Spark Streaming


• Mini-batch +/- windowing

• Analytics can be run on

any of the resultant RDDs

• No provisions for merging

RDDs

Speed | Spark (Discretized) Streaming

(Mini-) batch Streaming (Mini-) batch Analytics

Image Source: GraphX project

• Graph processing engine on Spark

• Supports Pregel-style vertex programming

• View same data as either graphs or collections

Speed | GraphX API for Spark

• Current Spark streaming provides mini-batch streaming

• No concept of data (model) merging

• GraphX is currently designed for static graphs: 1. Merge table data prior to graph pipeline

2. Re-generate entire (accumulated) graph 3. Re-run machine learning at each window

Speed | Spark Streaming for GraphX?

Straightforward, but wastes computation and time. Can we do better?

• Merge information directly into data model used by algorithms

• Static algorithms -> Online algorithms • Incremental re-computation triggered by changes in data values or

data structure

• Possible with many machine learning algos (PageRank as example)

• Evolve IM data stores to maximize performance and freshness

• Better partitioning algorithms reduced data replication

• Dynamic indexing fast retrieval

Speed | Online Version of GraphX

Static PageRank (delta method)

Online PageRank Speed | Online PageRank

Good for algos with abelian accumulators (commutative, associative, with inverse)

0.0

0.2

0.4

0.6

0.8

1.0

1.2

50K 100K 200K 400K 600K 800K 1M

Co

nve

rgen

ce R

ate

(

No

rmal

ized

Exe

cuti

on

TIm

e)

Throughput (Edges/Second)

Convergence Rate

Naive incremental

• Algorithm: Page Rank

• Reset probability: 0.15

• Convergence Threshold: 0.001

• 1 Master + 3 workers

• Distributed Messaging System: Kafka

• Spark 1.1.0 + our graph streaming

0%

20%

40%

60%

80%

100%

120%

50K 100K 200K 400K 600K 800K 1M

No

rmal

ized

Mes

sage

s Se

nt

Throughput (Edges/Second)

Communication Overhead

naiveincremental

Speed | (Really) Early Results for Online PageRank

Complexity

Complexity | Challenges

• Feature Engineering for Data Science

• Monolithic Analytics Complex Pipeline Analytics

Complexity | Directed Acyclic Graphs of Actions

• Common in Data Science

“feature engineering”

• Developed iteratively • Becomes a new tool in

the toolbox A A

B

C

Source: ISTC-Pervasive Computing

Discriminative structures come at multiple scales and varying deformations

Complexity | Hierarchical Matching Pursuit for Image Classification

• Feature learning

• Multiple layers to learn

• Multipath sparse

coding

Source: ISTC-Pervasive Computing

• Robustness

– Local deformations such as translation, rotation and scaling

– Lighting condition changes

– Viewpoint and pose changes

– Large intra-class variations

• Hierarchy

– Sparse data: The total number of possible image patches grows

exponentially with their sizes

– Shared structure: Large patches could share similar or even same small

patches

Complexity | Robust Hierarchical Representations?

Source: Bo, Ren, & Fox, “Multipath Sparse Coding Using Hierarchical Pursuit,” IEEE CVPR 2013

Complexity | Object Recognition on Caltech 256 Benchmark

#Training

Images

15 30 45 60

Local NBB [1] 33.5 40.1 - -

LLC [2] 34.4 41.2 45.3 47.7

CRBM [3] 35.1 42.1 45.7 47.9

LASERC [4] 35.2 43.6 - -

LP-beta [5] - 45.8 - -

Our Work 41.1 48.7 52.8 56.2

[1] S. McCann and D. Lowe, CVPR 12

[2] J. Wang et al, CVPR 10

[3] K Sohn et al, ICCV 11

[4] K. Nguyen et al, ECCV 12

[5] P. Gehler and S. Nowozin, ICCV 09

Much better than the state of the art

(especially when given more data)

Source: Intel Collaborative Research Institute for Computational Intelligence (ICRI-CI)

Distributed Deep Learning Library

Spark

Hadoop

IA/MIC IA/MIC IA/MIC IA/MIC IA/MIC IA/MIC

Complexity | Deep Learning Library for Spark

• Open source

• Spark MLlib

contribution

• Optimized for IA

Complete POC in 2015

Intelligence

• Selecting the right data to process

• Selecting the right features to engineer

• Selecting the right algorithm to run

Intelligence | Challenges

Image Source: University of Nebraska-Lincoln

Intelligence | Ensemble Learning (Wisdom of the Crowd)

Trade computational power for automated experimentation

• Tackles the data and algorithm

selection problem

• Diversification methods vary

• Bagging

• Boosting

• Combining techniques vary

• Majority vote on label

• Bucket of models

• N bagged predictors

N times the computation

Intelligence | Beyond Ensemble Learning

• Downsides of ensemble learning include the number of:

• Tunable parameters • Selection criteria

• Companies claim that non-parametric methods that require

no selection of criteria are in development

For now, it’s the Wisdom of the Crowd. Stay tuned!

Ease of Use

Ingest &

Clean

Engineer

Features Structure

Model

Train

Model Query &

Analyze

Learn

Visualize

Skills shortage at intersection of

systems engineering

and data analysis

Painful data ingestion

and preparation

Tools that are not designed

with loopbacks in mind

Pipeline state not

easy to manage, especially for collaboration

Composing

pipeline is DIY

Ease of Use | Data Science Workflow

Congratulations! You are a

data scientist!

Intel Confidential

Decomposing the “data scientist”

Source: 2013 Report from Accenture Institute for High Performance

Source: http://www.cloudera.com/content/cloudera-content/cloudera-docs/HadoopTutorial/CDH4/Hadoop-Tutorial/ht_wordcount1_source.html,

accessed on 9/30/2014

Ease of Use | Programming Languages

WordCount: The “Hello World” for Big Data In Java MapReduce

Python

R

Dataflow

GUI

...

Datacenter / Cloud Network Client

“Data Science”

API

Connect

Manage

Secure

Analyzedistributed and parallel

Manage Secure

Connect

Analyzelocal

Query

Big Data Java/Scala/C++ Computational Frameworks

Big Data Algorithms

Cluster Workload Mgmt

Cluster Storage

Machine Learning & Statistics

Data Wrangling Analyst Skills

The Other

Skills

Ease of Use | Making Big Data Familiar

• One consistent API for:

– ETL & feature engineering

– Including Spark and whatever comes next

– Graph construction, databases, analytics, query

– Same API for Titan, Neo4j, etc.

– Same API for Giraph, GraphX, GraphLab, etc.

– Machine learning & statistical analytics

• Programming language integration

• Extensibility at the core

Ease of Use | API Functionality

POST https://site.com/joe/graphs/29/transforms

{

operation: "ml.cgd",

arguments: [ {

edge_properties = ["rating"],

output_property_prefix="cgd_",

vertex_type = "vertex_type",

edge_type = "splits",

max_supersteps = 20,

feature_dimension = 3,

convergence_threshold = 0,

cgd_lambda = 0.65,

learning_output_interval = 1,

bias_on = true,

num_iters = 3

}]

}

Ease of Use | Run CGD (REST Call)

201 Created { operation: “ml.gcd", argments : (same as request) id: 2, created: "2014-01-31 10:51:05.1234",

depends_on: [{

link: {method: “GET”, uri: https://site.com/v1/graphs/29/transforms/1}

type: “graphbuilder”, started: “2014-01-31 10:51:02.8899”, eta: null,

status: “pending”

}] links: [ { rel: “self”, method: “GET”, uri: https://site.com/v1/graphs/29/transforms/2},

{rel=“intel:idpat-progress”, method=“GET”,

uri: https://site.com/v1/graphs/29/transforms/2/progress}

{rel=“intel:idpat-cancel”, method=“DELETE”,

uri: https://site.com/user/joe/graphs/29/transforms/2}] }

Ease of Use | Run CGD (REST Response)

FILESYSTEMS AND NOSQL STORAGE

HW PLATFORM

APACHE HADOOP APACHE SPARK

DATA WRANGLING

MACHINE LEARNING AND STATISTICS

Graphical Algorithms

Classical Algorithms

Graph Construction Tools

Useful String Manipulation

Useful Math Operators

“DATA SCIENCE” API

Intel Analytics Toolkit Ease of Use | Delivering It

Unified UI’s

across the workflow

Easier feature & model creation

End-to-end graph

pipeline

Fully scalable throughout

Multiple data

primitives

Optimized for IA

Cloud & On-Prem

Python

Libraries

3rd Party

GUIs/SDKs

Viz

Tools

Future

Libraries BI

Connectors

Query Interfaces

...

Approach Algorithm Category Applications/Use Cases

Loopy Belief Propagation (LBP) Structured Prediction Personalized recs, image de-noising

Label Propagation Structured Prediction Personalized recommendations

Alternating Least Squares (ALS) Collaborative Filtering Recommenders

Conjugate Gradient Descent (CGD) Collaborative Filtering Recommenders

Connected Components Graph Analytics Network manipulation, image analysis

Latent Dirichlet Allocation (LDA) Topic Modeling Document Clustering

Structure Attribute Clustering Network analysis, consumer seg

K-Truss Clustering Social network analysis

KNN* Clustering Recommenders

Logistic Regression* Classification Fraud detection

Random Forest* Classification Fraud detection, consumer seg

Generalized Linear Model (Binomial, Poisson) Non-linear Curve Fitting Forecasting, pricing, market mix models

Association Rule Mining Data Mining Market basket analysis, recommenders

Frequent Pattern Mining* Data Mining Pattern Recognition

Gra

ph

50

Ease of Use | A Full Spectrum of Analytics

Real Time Database

BQL – BigDAWG Query Language &

Compiler

Analytics Libraries

Hardware Platforms

Applications, Visualization, Languages

“Narrow waist”

provides portability

Historical / Analytics Databases

Spill Stream

Ease of Use | Future Vision – BigDAWG

Ease of Use | Future Vision – BigDAWG

Real Time DBMSs

BQL – BigDAWG Query Language &

Compiler

Visualization & Presentation, e.g., ScalaR, imMens, TweetMap, Prefetching

Languages, e.g, Julia, R, MLbase, GraphLab

SciDB

Analytics, e.g., ScaLAPACK, ML algos, plsh, other analytics packages

TupleWare

Hardware Platforms, e.g., NVM simulator, Xeon Phi, Xeon

Applications, e.g., medical data, astronomy, Twitter, urban sensing, IoT

TileDB S-Store

“Narrow waist”

provides portability

MyriaX

Historical / Analytics DBMSs Spill

Stream

Ease of Use | BigDAWG Deliverables ‘15-’16

• Complete prototype “big data” stack and reference

implementation

• Battle-tested on multiple use cases

• Standard federation language (BQL)

• Next-generation interface for analytics

• Next-generation stream processing system

Stay Tuned!

http://istc-bigdata.org/

1. Big Data Visualization, especially graph*

2. Big Data DB that supports relational and graph equally*

3. A better workflow manager (Like Oozie for Hadoop, Spark, etc.)*

4. UI partners (R, Julia, etc.)

5. Better portable machine learning models (like PMML) that also capture feature engineering (not just algos)

6. Cluster monitoring (GUI, etc.) that works across many big data tools

7. Distributed debugger (for Spark clusters, etc.) for profiling and troubleshooting

8. Cluster auto-configuration tools

• * - Open source STRONGLY preferred

Technology Wish List

• Intel Analytics Toolkit Beta program (now-January ’15) Have a POC, particularly graph? [email protected]

• GRADES 2015, Melbourne Australia, May 31, 2015 Papers due March 15. HTTP://EVENT.CWI.NL/GRADES2015/

Call to Action