data science connect, july 22nd 2014 @ibm innovation center zurich

73
© 2013 IBM Corporation 1 The Data Scientists Workplace of the Future - Data Science Connect 22nd of July, 2014 Romeo Kienzler IBM Center of Excellence for Data Science, Cognitive Systems and BigData (A joint-venture between IBM Research Zurich and IBM Innovation Center DACH) Source: http://www.kdnuggets.com/2012/04/data-science-history.jpg

Upload: romeo-kienzler

Post on 28-Nov-2014

399 views

Category:

Data & Analytics


1 download

DESCRIPTION

Presentation held on Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich

TRANSCRIPT

Page 1: Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich

© 2013 IBM Corporation1

The Data Scientists Workplace of the Future - Data Science Connect 22nd of July, 2014

Romeo Kienzler

IBM Center of Excellence for Data Science, Cognitive Systems and BigData(A joint-venture between IBM Research Zurich and IBM Innovation Center DACH)

Source: http://www.kdnuggets.com/2012/04/data-science-history.jpg

Page 2: Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich

© 2013 IBM Corporation2

What is DataScience?

Source: Statoo.com http://slidesha.re/1kmNiX0

Page 3: Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich

© 2013 IBM Corporation3

DataScience at present● Tools (http://blog.revolutionanalytics.com/2014/01/in-data-scientist-survey-r-is-the-most-used-tool-other-than-databases.html)

● SQL (42%)● R (33%)● Python (26%)● Excel (25%)● Java, Ruby, C++ (17%)● SPSS, SAS (9%)

● Limitations (Single Node usage)● Main Memory● CPU <> Main Memory Bandwidth● CPU ● Storage <> Main Memory Bandwidth (either Single node or SAN)

Page 4: Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich

© 2013 IBM Corporation4

What is BIG data?

Page 5: Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich

© 2013 IBM Corporation5

What is BIG data?

Page 6: Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich

© 2013 IBM Corporation6

What is BIG data?

Big Data

Hadoop

Page 7: Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich

© 2013 IBM Corporation7

What is BIG data?

Business Intelligence

Data Warehouse

Page 8: Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich

© 2013 IBM Corporation8

BigData == Hadoop?

Hadoop BigData

Hadoop

Page 9: Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich

© 2013 IBM Corporation9

What is beyond “Data Warehouse”?

Data Lake

Data Warehouse

Page 10: Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich

© 2013 IBM Corporation10

First “BigData” UseCase ?● Google Index

● 40 X 10^9 = 40.000.000.000 => 40 billion pages indexed● Will break 100 PB barrier soon● Derived from MapReduce● now “caffeine” based on “percolator”

● Incremental vs. batch● In-Memory vs. disk

Page 11: Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich

© 2013 IBM Corporation11

Map-Reduce → Hadoop → BigInsights

Page 12: Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich

© 2013 IBM Corporation12

BigData Analytics – Predictive Analytics

"sometimes it's not who has the best algorithm that wins; it's who has the most data."

(C) Google Inc.

The Unreasonable Effectiveness of Data¹

¹http://www.csee.wvu.edu/~gidoretto/courses/2011-fall-cp/reading/TheUnreasonable%20EffectivenessofData_IEEE_IS2009.pdf

No Sampling => Work with full dataset => No p-Value/z-Scores anymore

Page 13: Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich

© 2013 IBM Corporation13

Aggregated Bandwith between CPU, Main Memory and Hard Drive

1 TB (at 10 GByte/s)

- 1 Node - 100 sec

- 10 Nodes - 10 sec

- 100 Nodes - 1 sec

- 1000 Nodes - 100 msec

Page 14: Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich

© 2013 IBM Corporation14

Fault Tolerance / Commodity Hardware

AMD Turion II Neo N40L (2x 1,5GHz / 2MB / 15W), 8 GB RAM,

3TB SEAGATE Barracuda 7200.14

< CHF 500

100 K => 200 X (2, 4, 3) => 400 Cores, 1,6 TB RAM, 200 TB HD

MTBF ~ 365 d > 1,5 d

Source: http://www.cloudcomputingpatterns.org/Watchdog

Page 15: Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich

© 2013 IBM Corporation15

“Elastic” Scale-Out

Source: http://www.cloudcomputingpatterns.org/Continuously_Changing_Workload

Page 16: Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich

© 2013 IBM Corporation16

“Elastic” Scale-Out

of

Page 17: Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich

© 2013 IBM Corporation17

“Elastic” Scale-Out

of

CPU Cores

Page 18: Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich

© 2013 IBM Corporation18

“Elastic” Scale-Out

of

CPU Cores Storage

Page 19: Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich

© 2013 IBM Corporation19

“Elastic” Scale-Out

of

CPU Cores Storage Memory

Page 20: Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich

© 2013 IBM Corporation20

“Elastic” Scale-Out

linear

Source: http://www.cloudcomputingpatterns.org/Elastic_Platform

Page 21: Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich

© 2013 IBM Corporation21

How do Databases Scale-Out?

Shared Disk Architectures

Page 22: Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich

© 2013 IBM Corporation22

How do Databases Scale-Out?

Shared Nothing Architectures

Page 23: Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich

© 2013 IBM Corporation23

Hadoop?

Shared Nothing Architecture?

Shared Disk Architecture?

http://bluemix.net/6 Node Hadoop Cluster 4 Free

Page 24: Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich

© 2013 IBM Corporation24

Data Science on Hadoop

SQL (42%)

R (33%)

Python (26%)

Excel (25%)

Java, Ruby, C++ (17%)

SPSS, SAS (9%)

Data Science Hadoop

Page 25: Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich

© 2013 IBM Corporation25

SQL on Hadoop● IBM BigSQL (ANSI 92 compliant)● HIVE, Presto● Cloudera Impala ● Lingual● Shark● ...

SQL Hadoop

Page 26: Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich

© 2013 IBM Corporation26

Two types of SQL Engines● Type I

● Compiler and Optimizer SQL->MapReduce● Type II

● Brings own distributed execution engine on Data Nodes● Brings own Task Scheduler

● The Hadoop SQL Ecosystem is evolving very fast

Page 27: Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich

© 2013 IBM Corporation27

Hive● Runs on top of MapReduce● → Type I

Source: http://cdn.venublog.com/wp-content/uploads/2013/07/hive-1.jpg

Page 28: Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich

© 2013 IBM Corporation28

Lingual● ANSI SQL Layer on top of Cascading● Cascading

● Java API do express DAG● Runs on top of MapReduce● → Type I

Page 29: Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich

© 2013 IBM Corporation29

Limits of MapReduce● Disk writes between Map and Reduce● Slow for computations which depend on previously computed values● JOINs are very slow and difficult to implement

● Only sequential data access● Only tuple-wise data access● Map-Side joins have sort and size constraints● Reduce-Side joins require secondary sorting of values● …

● ...

Page 30: Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich

© 2013 IBM Corporation30

Impala (Type II)

http://blog.cloudera.com/blog/wp-content/uploads/2012/10/impala.png

Page 31: Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich

© 2013 IBM Corporation31

Presto (Type II)

https://www.facebook.com/notes/facebook-engineering/presto-interacting-with-petabytes-of-data-at-facebook/10151786197628920

Page 32: Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich

© 2013 IBM Corporation32

Spark / Shark (Type II)

Source: http://bighadoop.files.wordpress.com/2014/04/spark-architecture.png

Page 33: Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich

© 2013 IBM Corporation33

BigSQL V3.0 (Type II)

Like in Spark, MapReduce has been Kicked out :)(No JobTracker, No Task Tracker, But HDFS/GPFS remains)

Page 34: Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich

© 2013 IBM Corporation34

BigSQL V3.0 – Architecture

Putting the story together….Big SQL shares a common SQL dialect with DB2Big SQL shares the same client drivers with DB2

Page 35: Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich

© 2013 IBM Corporation35

BigSQL V3.0 – PerformanceQuery rewritesExhaustive query rewrite capabilitiesLeverages additional metadata such as constraints and nullability

OptimizationStatistics and heuristic driven query optimizationQuery optimizer based upon decades of IBM RDBMS experience

Tools and metricsHighly detailed explain plans and query diagnostic toolsExtensive number of available performance metrics

SELECT ITEM_DESC, SUM(QUANTITY_SOLD), AVG(PRICE), AVG(COST)

FROM PERIOD, DAILY_SALES, PRODUCT, STORE

WHERE

PERIOD.PERKEY=DAILY_SALES.PERKEY AND

PRODUCT.PRODKEY=DAILY_SALES.PRODKEY AND

STORE.STOREKEY=DAILY_SALES.STOREKEY AND

CALENDAR_DATE BETWEEN AND

'01/01/2012' AND '04/28/2012' AND

STORE_NUMBER='03' AND

CATEGORY=72

GROUP BY ITEM_DESC

Access plan generationQuery transformation

Dozens of query transformations

Hundreds or thousands of access plan options

Store

Product

Product Store

NLJOIN

Daily SalesNLJOIN

Period

NLJOIN

Product

NLJOIN

Daily Sales

NLJOIN

Period

NLJOIN

Store

HSJOIN

Daily Sales

HSJOIN

Period

HSJOIN

Product

StoreZZJOIN

Daily Sales

HSJOIN

Period

Page 36: Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich

© 2013 IBM Corporation36

BigSQL V3.0 – PerformanceYou are substantially faster if you don't use MapReduce

IBM BigInsights v3.0, with Big SQL 3.0, is the only Hadoop distribution to successfully run ALL 99 TPC-DS queries and ALL 22 TPC-H queries without modification. Source: http://www.ibmbigdatahub.com/blog/big-deal-about-infosphere-biginsights-v30-big-sql

Page 37: Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich

© 2013 IBM Corporation37

BigSQL V3.0 – Query Federation

Head Node

Big SQL

Compute Node

Task Tracker Data Node BigSQL

Compute Node

Task Tracker Data NodeBigSQL

Compute Node

Task Tracker Data NodeBigSQL

Compute Node

Task Tracker Data NodeBigSQL

Page 38: Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich

© 2013 IBM Corporation38

BigSQL V1.0 – Demo (small)● 32 GB Data, ~650.000.000 rows (small, Innovation Center Zurich)● 3 TB Data, ~ 60.937.500.000 rows (middle, Innovation Center Zurich)● 0.7 PB Data, ~ 1.421875×10¹³ rows (large, Innovation Center Hursley)

● 32 GB Data, ~650.000.000 rows (small, Innovation Center Zurich)● 3 TB Data, ~ 60.937.500.000 rows (middle, Innovation Center Zurich)● 0.7 PB Data, ~ 1.421875×10¹³ rows (large, Innovation Center Hursley)

Page 39: Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich

© 2013 IBM Corporation39

BigSQL V1.0 – Demo (small)CREATE EXTERNAL TABLE trace (

hour integer, employeeid integer,

departmentid integer, clientid integer,

date string, timestamp string)

ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE LOCATION '/user/biadmin/32Gtest';

Page 40: Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich

© 2013 IBM Corporation40

BigSQL V1.0 – Demo (small)

Page 41: Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich

© 2013 IBM Corporation41

BigSQL V1.0 – Demo (small)

Page 42: Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich

© 2013 IBM Corporation42

BigSQL V1.0 – Demo (small)[bivm.ibm.com][biadmin] 1> select count(*) from trace1;

+----------+

| |

+----------+

| 11416740 |

+----------+

1 row in results(first row: 39.78s; total: 39.78s)

Page 43: Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich

© 2013 IBM Corporation43

BigSQL V1.0 – Demo (small)

select count(hour), hour from trace group by hour order by hour

30 rows in results(first row: 37.98s; total: 37.99s)

Page 44: Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich

© 2013 IBM Corporation44

BigSQL V1.0 – Demo (small)

[bivm.ibm.com][biadmin] 1> select count(*) from trace1 t3 inner join trace2 t4 on t3.hour=t4.hour;

+--------+

| |

+--------+

| 477340 |

+--------+

1 row in results(first row: 32.24s; total: 32.25s)

Page 45: Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich

© 2013 IBM Corporation45

BigSQL V3.0 – Demo (small)CREATE HADOOP TABLE trace3 (

hour int, employeeid int,

departmentid int,clientid int,

date varchar(30), timestamp varchar(30) )

row format delimited

fields terminated by '|'

stored as textfile;

Page 46: Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich

© 2013 IBM Corporation46

BigSQL V3.0 – Demo (small)[bivm.ibm.com][biadmin] 1> select count(*) from trace3;

+----------+

| 1 |

+----------+

| 12014733 |

+----------+

1 row in results(first row: 2.94s; total: 2.95s)

Page 47: Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich

© 2013 IBM Corporation47

BigSQL V3.0 – Demo (small)

[bivm.ibm.com][biadmin] 1> select count(*) from trace3 t3 inner join trace4 t4 on t3.hour=t4.hour;

+--------+

| 1 |

+--------+

| 504360 |

+--------+

1 row in results(first row: 0.79s; total: 0.80s)

Page 48: Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich

© 2013 IBM Corporation48

BigSQL V3.0 – Demo (small)

[bivm.ibm.com][biadmin] 1> select count(hour), hour from trace3 group by hour order by hour;

29 rows in results(first row: 1.88s; total: 1.89s)

Page 49: Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich

© 2013 IBM Corporation49

R on Hadoop● IBM BigR (based on SystemML Almadan Research project)● Rhadoop● RHIPE● ...

“R” Hadoop

Page 50: Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich

© 2013 IBM Corporation50

Page 51: Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich

© 2013 IBM Corporation5151

Goal: Find column mean

Problems:• Column vector can not fit into memory

You have to partition and parallelize

Page 52: Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich

© 2013 IBM Corporation52

● Sampling Full dataset > RAM Example: use 1% vs 100% of dataset Precision loss from skewed/sparse data

● Numerical Stability Limitation from finite precision in computing Algorithms must be carefully implemented Instability causes errors to cascade throughout your analysis

Catastrophic Cancellation Error: 6.375 – 5.625

True value: 0.75 Computed: 0 Relative Error: 1.06.375 round to 6.0

5.625 round to 6.0

Page 53: Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich

© 2013 IBM Corporation53

Data in Hadoop

You

R User

Data in distributed memory

Page 54: Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich

© 2013 IBM Corporation54

Data in Hadoop: Can run R on a single node

R User

Data in distributed memory

You

Page 55: Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich

© 2013 IBM Corporation55

BigR (based on SystemML)SystemML compiles hybrid runtime plans ranging from in-memory, single machine (CP) to large-scale, cluster (MR) compute

● Challenge● Guaranteed hard memory constraints

(budget of JVM size)● for arbitrary complex ML programs

● Key Technical Innovations● CP & MR Runtime: Single machine & MR operations, integrated runtime● Caching: Reuse and eviction of in-memory objects● Cost Model: Accurate time and worst-case memory estimates● Optimizer: Cost-based runtime plan generation● Dyn. Recompiler: Re-optimization for initial unknowns

Data size

Run

time

CP CP/MR MR

Gradually exploit MR parallelism

High performance computing for small data sizes.

Scalable computing for large data sizes.

Hybrid Plans

Page 56: Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich

© 2013 IBM Corporation56

R Clients

SystemMLStatistics

Engine

Data Sources

Embedded R Execution

IBM R Packages

IBM R Packages

Pull data (summaries) to

R client

Or, push R functions

right on the data

1

2

3

© 2014 IBM Corporation17 IBM Internal Use Only

BigR Architecture

Page 57: Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich

© 2013 IBM Corporation57

Big R Data Structures: Proxy to entire dataset

data <- bigr.frame(…)

Appears and acts like all of the data is on your laptop

You

Page 58: Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich

© 2013 IBM Corporation58

BigR Demo (small) ● 32 GB Data, ~650.000.000 rows (small, Innovation Center Zurich)● 3 TB Data, ~ 60.937.500.000 rows (middle, Innovation Center Zurich)● 0.7 PB Data, ~ 1.421875×10¹³ rows (large, Innovation Center Hursley)

Page 59: Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich

© 2013 IBM Corporation59

BigR Demo (small) library(bigr)

bigr.connect(host="bigdata",

port=7052, database="default",

user="biadmin", password="xxx")

is.bigr.connected()

tbr <- bigr.frame(dataSource="DEL", coltypes = c("numeric","numeric","numeric","numeric","character","character"),

dataPath="/user/biadmin/32Gtest", delimiter=",",

header=F, useMapReduce=T)

h <- bigr.histogram.stats(tbr$V1, nbins=24)

Page 60: Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich

© 2013 IBM Corporation60

BigR Demo (small) class bins counts centroids

1 ALL 0 18289280 1.583333

2 ALL 1 15360 2.750000

3 ALL 2 55040 3.916667

4 ALL 3 189440 5.083333

5 ALL 4 579840 6.250000

6 ALL 5 5292160 7.416667

7 ALL 6 8074880 8.583333

8 ALL 7 15653120 9.750000

...

Page 61: Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich

© 2013 IBM Corporation61

BigR Demo (small)

Page 62: Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich

© 2013 IBM Corporation62

BigR Demo (small) jpeg('hist.jpg')

bigr.histogram(tbr$V1, nbins=24)

# This command runs on 32 GB / ~650.000.000 rows in HDFS

dev.off()

Page 63: Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich

© 2013 IBM Corporation63

SPSS on Hadoop

Page 64: Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich

© 2013 IBM Corporation64

SPSS on Hadoop

Page 65: Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich

© 2013 IBM Corporation65

BigSheets Demo (small)● 32 GB Data, ~650.000.000 rows (small, Innovation Center Zurich)● 3 TB Data, ~ 60.937.500.000 rows (middle, Innovation Center Zurich)● 0.7 PB Data, ~ 1.421875×10¹³ rows (large, Innovation Center Hursley)

● 32 GB Data, ~650.000.000 rows (small, Innovation Center Zurich)● 3 TB Data, ~ 60.937.500.000 rows (middle, Innovation Center Zurich)● 0.7 PB Data, ~ 1.421875×10¹³ rows (large, Innovation Center Hursley)

Page 66: Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich

© 2013 IBM Corporation66

BigSheets Demo (small)

Page 67: Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich

© 2013 IBM Corporation67

BigSheets Demo (small)

This command runs on 32 GB /

~650.000.000 rows in HDFS

Page 68: Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich

© 2013 IBM Corporation68

BigSheets Demo (small)

Page 69: Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich

© 2013 IBM Corporation69

Text Extraction (SystemT, AQL)

Page 70: Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich

© 2013 IBM Corporation70

Text Extraction (SystemT, AQL)

Page 71: Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich

© 2013 IBM Corporation71

If this is not enough? → BigData AppStore

Page 72: Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich

© 2013 IBM Corporation72

BigData AppStore, Eclipse Tooling● Write your apps in

● Java (MapReduce)● PigLatin,Jaql● BigSQL/Hive/BigR

● Deploy it to BigInsights via Eclipse● Automatically

● Schedule● Update

● hdfs files● BigSQL tables● BigSheets collections

Page 73: Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich

© 2013 IBM Corporation73

Questions?

http://www.ibm.com/software/data/bigdata/

Twitter: @RomeoKienzler, @IBMEcosystem_DE, @IBM_ISV_Alps