ibm analytics the evolution of big data platforms and data

20
©2016 IBM Corporation IBM Analytics The Evolution of Big Data Platforms and Data Science ECC Conference 2016 June 13, 2016 Brandon MacKenzie

Upload: others

Post on 13-Jan-2022

4 views

Category:

Documents


0 download

TRANSCRIPT

©2016 IBM Corporation

IBM Analytics

The Evolution of Big Data Platforms and Data Science

ECC Conference 2016

June 13, 2016

Brandon MacKenzie

© 2016 IBM Corporation2

Hello, I’m Brandon MacKenzie. I work at IBM.

Data Science - Offering Management & Worldwide Technical Sales

Advisory Member - Spark Technology Center Advisory Council

Co-author - IBM Apache Spark Professional Certification Program

MSc - University of Edinburgh

© 2016 IBM Corporation3

Understand

problem and

domain

Ingest

data

Explore

and

understand

data

Transform:

clean

Transform:

shape

Create and

build model

Evaluate

Communicate

results

Deliver and

deploy model

Data volume, variety, and velocity change – but the goal of the Data Science

remains the same

© 2016 IBM Corporation4

1

2

SQL: IBM SystemR 1970s

Execution Plan

SQL Query:

SELECT…

FROM…

WHERE…

Optimizer

Employee Skills

Hardware

Data

Execution Plan

ML Call:

SVM.train()

Optimizer

Declarative Machine Learning: IBM SystemML 2000s

Employee Skills

Hardware

Data

And then there was a movement in open source…

Apache Spark, finally, a chance to push declarative data science mainstream!

Apache Spark…“One of the most important open source projects in a decade”

World's largest investment in Apache Spark

Opened Spark Technology Center

3500 IBM Researchers and Developers to Spark

• Significantly faster than MapReduce• APIs for Python, R, Java, Scala• Combine SQL, streaming, complex analytics• Runs everywhere

© 2016 IBM Corporation9

BigInsights

(HDFS)

Cloudant

(DBaaS)

dashDB

(Analytics)

Swift

(Object

Storage)

SQDB

(Manage

d DB2)

Data Sources

IBM Cloud Public Cloud Cloud Apps On-Premises

Execute SQL

Statements

Streaming

Analytics via

Micro-batch

M.L. and

Statistical

Algorithms

Distributed

Graph Processing

Framework

General compute engine

Basic I/O functions

Task dispatching

Scheduling

Spark Core

Spark SQLSpark

StreamingMLlib

Machine LearningGraph

Apache Spark – Blends Multiple Data Types, Sources, & Workloads

© 2016 IBM Corporation10

Traditional Approach: MapReduce jobs for complex jobs, interactive query, and online event-hub processing

involves lots of (slow) disk I/O

Motivation for Apache Spark

HDFS

Read

HDFS

Write

HDFS

Read

HDFS

WriteInput ResultCPU

Iteration 1

Memory CPU

Iteration 2

Memory

© 2016 IBM Corporation11

Traditional Approach: MapReduce jobs for complex jobs, interactive query, and online event-hub processing

involves lots of (slow) disk I/O

Solution: Keep more data in-memory with a new distributed execution engine

HDFS

ReadInput CPU

Iteration 1

Memory CPU

Iteration 2

Memory

10–100x faster than

network & disk

HDFS

Read

HDFS

Write

HDFS

Read

HDFS

WriteInput ResultCPU

Iteration 1

Memory CPU

Iteration 2

Memory

Zero Read/Write

Disk Bottleneck

Motivation for Apache Spark

Chain Job Output

into New Job Input

© 2016 IBM Corporation12

Comparing MapReduce and Spark

Spark is a unified platform for data integration

Contrast with the many distinct distributions & tools for Hadoop

Spark follows lazy evaluation of execution graphs

Optimizes jobs, reduces wait states, and allows easier pipelining of tasks

Spark lowers the resource overhead for starting or shuffling jobs

Less expensive than MapReduce

Spark Hadoop MapReduce

StorageNo built-in storage

(in-memory)On-disk only

Operations Map, Reduce, Join, Sample, etc. Map and Reduce

Execution Model Batch, interactive, streaming Batch only

Programming

EnvironmentsPython, Scala, Java, and R Java only

© 2016 IBM Corporation13

Infusing Analytics Applications with Apache Spark

Performant Architecture

Productive Workflows

Leverages Existing Investments

Spark is a unified framework for building analytics applications

Continually Improving

Project RedRockhttps://github.com/SparkTC/redrock

http://www.spark.tc/project-redrock-design-data/

IBM donates SystemML to open source to strengthen Spark machine learning

IBM Research

Apache SystemML

systemml.apache.org

© 2016 IBM Corporation15

Optimizer

Cost Models for Heterogeneous Architectures

Improving Scala

Automating Model Selection via Hyperparameter Heuristics

© 2016 IBM Corporation16

Example of how GPU Acceleration can be added to SystemML Optimizer

Language

Compiler

Instructions

Runtime

In-Memory

Single Node Cluster

Memory

Estimator

RecompilerPiggybacking

Operator

Selector

Algebraic

Rewriter

GPU MemoryManager

In-memory Instructions

Spark Job Instructions

GPU Instructions

Cost Model

cuBLAS cuSPARSE Custom

GPU Kernels

Java Native Interface

CUDA4J

GPU Memory Manager

Java Application (SystemML)

© 2016 IBM Corporation17

Understand

problem and

domain

Ingest

data

Explore

and

understand

data

Transform:

clean

Transform:

shape

Create and

build model

Evaluate

Communicate

results

Deliver and

deploy model

All of this innovation in Big Data Platforms will continue to flow into OPTIMIZERs so

that Data Scientists can focus on being Data Scientists

So, where can I Learn More?

Apache SystemMLsystemml.apache.org

1

2

3 IBM Data Science Experiencedatascience.ibm.com

Spark Technology Centerspark.tc

© 2016 IBM Corporation20