data science connect, july 22nd 2014 @ibm innovation center zurich
DESCRIPTION
Presentation held on Data Science Connect, July 22nd 2014 @IBM Innovation Center ZurichTRANSCRIPT
© 2013 IBM Corporation1
The Data Scientists Workplace of the Future - Data Science Connect 22nd of July, 2014
Romeo Kienzler
IBM Center of Excellence for Data Science, Cognitive Systems and BigData(A joint-venture between IBM Research Zurich and IBM Innovation Center DACH)
Source: http://www.kdnuggets.com/2012/04/data-science-history.jpg
© 2013 IBM Corporation2
What is DataScience?
Source: Statoo.com http://slidesha.re/1kmNiX0
© 2013 IBM Corporation3
DataScience at present● Tools (http://blog.revolutionanalytics.com/2014/01/in-data-scientist-survey-r-is-the-most-used-tool-other-than-databases.html)
● SQL (42%)● R (33%)● Python (26%)● Excel (25%)● Java, Ruby, C++ (17%)● SPSS, SAS (9%)
● Limitations (Single Node usage)● Main Memory● CPU <> Main Memory Bandwidth● CPU ● Storage <> Main Memory Bandwidth (either Single node or SAN)
© 2013 IBM Corporation4
What is BIG data?
© 2013 IBM Corporation5
What is BIG data?
© 2013 IBM Corporation6
What is BIG data?
Big Data
Hadoop
© 2013 IBM Corporation7
What is BIG data?
Business Intelligence
Data Warehouse
© 2013 IBM Corporation8
BigData == Hadoop?
Hadoop BigData
Hadoop
© 2013 IBM Corporation9
What is beyond “Data Warehouse”?
Data Lake
Data Warehouse
© 2013 IBM Corporation10
First “BigData” UseCase ?● Google Index
● 40 X 10^9 = 40.000.000.000 => 40 billion pages indexed● Will break 100 PB barrier soon● Derived from MapReduce● now “caffeine” based on “percolator”
● Incremental vs. batch● In-Memory vs. disk
●
© 2013 IBM Corporation11
Map-Reduce → Hadoop → BigInsights
© 2013 IBM Corporation12
BigData Analytics – Predictive Analytics
"sometimes it's not who has the best algorithm that wins; it's who has the most data."
(C) Google Inc.
The Unreasonable Effectiveness of Data¹
¹http://www.csee.wvu.edu/~gidoretto/courses/2011-fall-cp/reading/TheUnreasonable%20EffectivenessofData_IEEE_IS2009.pdf
No Sampling => Work with full dataset => No p-Value/z-Scores anymore
© 2013 IBM Corporation13
Aggregated Bandwith between CPU, Main Memory and Hard Drive
1 TB (at 10 GByte/s)
- 1 Node - 100 sec
- 10 Nodes - 10 sec
- 100 Nodes - 1 sec
- 1000 Nodes - 100 msec
© 2013 IBM Corporation14
Fault Tolerance / Commodity Hardware
AMD Turion II Neo N40L (2x 1,5GHz / 2MB / 15W), 8 GB RAM,
3TB SEAGATE Barracuda 7200.14
< CHF 500
100 K => 200 X (2, 4, 3) => 400 Cores, 1,6 TB RAM, 200 TB HD
MTBF ~ 365 d > 1,5 d
Source: http://www.cloudcomputingpatterns.org/Watchdog
© 2013 IBM Corporation15
“Elastic” Scale-Out
Source: http://www.cloudcomputingpatterns.org/Continuously_Changing_Workload
© 2013 IBM Corporation16
“Elastic” Scale-Out
of
© 2013 IBM Corporation17
“Elastic” Scale-Out
of
CPU Cores
© 2013 IBM Corporation18
“Elastic” Scale-Out
of
CPU Cores Storage
© 2013 IBM Corporation19
“Elastic” Scale-Out
of
CPU Cores Storage Memory
© 2013 IBM Corporation20
“Elastic” Scale-Out
linear
Source: http://www.cloudcomputingpatterns.org/Elastic_Platform
© 2013 IBM Corporation21
How do Databases Scale-Out?
Shared Disk Architectures
© 2013 IBM Corporation22
How do Databases Scale-Out?
Shared Nothing Architectures
© 2013 IBM Corporation23
Hadoop?
Shared Nothing Architecture?
Shared Disk Architecture?
http://bluemix.net/6 Node Hadoop Cluster 4 Free
© 2013 IBM Corporation24
Data Science on Hadoop
SQL (42%)
R (33%)
Python (26%)
Excel (25%)
Java, Ruby, C++ (17%)
SPSS, SAS (9%)
Data Science Hadoop
© 2013 IBM Corporation25
SQL on Hadoop● IBM BigSQL (ANSI 92 compliant)● HIVE, Presto● Cloudera Impala ● Lingual● Shark● ...
SQL Hadoop
© 2013 IBM Corporation26
Two types of SQL Engines● Type I
● Compiler and Optimizer SQL->MapReduce● Type II
● Brings own distributed execution engine on Data Nodes● Brings own Task Scheduler
● The Hadoop SQL Ecosystem is evolving very fast
© 2013 IBM Corporation27
Hive● Runs on top of MapReduce● → Type I
Source: http://cdn.venublog.com/wp-content/uploads/2013/07/hive-1.jpg
© 2013 IBM Corporation28
Lingual● ANSI SQL Layer on top of Cascading● Cascading
● Java API do express DAG● Runs on top of MapReduce● → Type I
© 2013 IBM Corporation29
Limits of MapReduce● Disk writes between Map and Reduce● Slow for computations which depend on previously computed values● JOINs are very slow and difficult to implement
● Only sequential data access● Only tuple-wise data access● Map-Side joins have sort and size constraints● Reduce-Side joins require secondary sorting of values● …
● ...
© 2013 IBM Corporation30
Impala (Type II)
http://blog.cloudera.com/blog/wp-content/uploads/2012/10/impala.png
© 2013 IBM Corporation31
Presto (Type II)
https://www.facebook.com/notes/facebook-engineering/presto-interacting-with-petabytes-of-data-at-facebook/10151786197628920
© 2013 IBM Corporation32
Spark / Shark (Type II)
Source: http://bighadoop.files.wordpress.com/2014/04/spark-architecture.png
© 2013 IBM Corporation33
BigSQL V3.0 (Type II)
Like in Spark, MapReduce has been Kicked out :)(No JobTracker, No Task Tracker, But HDFS/GPFS remains)
© 2013 IBM Corporation34
BigSQL V3.0 – Architecture
Putting the story together….Big SQL shares a common SQL dialect with DB2Big SQL shares the same client drivers with DB2
© 2013 IBM Corporation35
BigSQL V3.0 – PerformanceQuery rewritesExhaustive query rewrite capabilitiesLeverages additional metadata such as constraints and nullability
OptimizationStatistics and heuristic driven query optimizationQuery optimizer based upon decades of IBM RDBMS experience
Tools and metricsHighly detailed explain plans and query diagnostic toolsExtensive number of available performance metrics
SELECT ITEM_DESC, SUM(QUANTITY_SOLD), AVG(PRICE), AVG(COST)
FROM PERIOD, DAILY_SALES, PRODUCT, STORE
WHERE
PERIOD.PERKEY=DAILY_SALES.PERKEY AND
PRODUCT.PRODKEY=DAILY_SALES.PRODKEY AND
STORE.STOREKEY=DAILY_SALES.STOREKEY AND
CALENDAR_DATE BETWEEN AND
'01/01/2012' AND '04/28/2012' AND
STORE_NUMBER='03' AND
CATEGORY=72
GROUP BY ITEM_DESC
Access plan generationQuery transformation
Dozens of query transformations
Hundreds or thousands of access plan options
Store
Product
Product Store
NLJOIN
Daily SalesNLJOIN
Period
NLJOIN
Product
NLJOIN
Daily Sales
NLJOIN
Period
NLJOIN
Store
HSJOIN
Daily Sales
HSJOIN
Period
HSJOIN
Product
StoreZZJOIN
Daily Sales
HSJOIN
Period
© 2013 IBM Corporation36
BigSQL V3.0 – PerformanceYou are substantially faster if you don't use MapReduce
IBM BigInsights v3.0, with Big SQL 3.0, is the only Hadoop distribution to successfully run ALL 99 TPC-DS queries and ALL 22 TPC-H queries without modification. Source: http://www.ibmbigdatahub.com/blog/big-deal-about-infosphere-biginsights-v30-big-sql
© 2013 IBM Corporation37
BigSQL V3.0 – Query Federation
Head Node
Big SQL
Compute Node
Task Tracker Data Node BigSQL
Compute Node
Task Tracker Data NodeBigSQL
Compute Node
Task Tracker Data NodeBigSQL
Compute Node
Task Tracker Data NodeBigSQL
© 2013 IBM Corporation38
BigSQL V1.0 – Demo (small)● 32 GB Data, ~650.000.000 rows (small, Innovation Center Zurich)● 3 TB Data, ~ 60.937.500.000 rows (middle, Innovation Center Zurich)● 0.7 PB Data, ~ 1.421875×10¹³ rows (large, Innovation Center Hursley)
● 32 GB Data, ~650.000.000 rows (small, Innovation Center Zurich)● 3 TB Data, ~ 60.937.500.000 rows (middle, Innovation Center Zurich)● 0.7 PB Data, ~ 1.421875×10¹³ rows (large, Innovation Center Hursley)
© 2013 IBM Corporation39
BigSQL V1.0 – Demo (small)CREATE EXTERNAL TABLE trace (
hour integer, employeeid integer,
departmentid integer, clientid integer,
date string, timestamp string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE LOCATION '/user/biadmin/32Gtest';
© 2013 IBM Corporation40
BigSQL V1.0 – Demo (small)
© 2013 IBM Corporation41
BigSQL V1.0 – Demo (small)
© 2013 IBM Corporation42
BigSQL V1.0 – Demo (small)[bivm.ibm.com][biadmin] 1> select count(*) from trace1;
+----------+
| |
+----------+
| 11416740 |
+----------+
1 row in results(first row: 39.78s; total: 39.78s)
© 2013 IBM Corporation43
BigSQL V1.0 – Demo (small)
select count(hour), hour from trace group by hour order by hour
30 rows in results(first row: 37.98s; total: 37.99s)
© 2013 IBM Corporation44
BigSQL V1.0 – Demo (small)
[bivm.ibm.com][biadmin] 1> select count(*) from trace1 t3 inner join trace2 t4 on t3.hour=t4.hour;
+--------+
| |
+--------+
| 477340 |
+--------+
1 row in results(first row: 32.24s; total: 32.25s)
© 2013 IBM Corporation45
BigSQL V3.0 – Demo (small)CREATE HADOOP TABLE trace3 (
hour int, employeeid int,
departmentid int,clientid int,
date varchar(30), timestamp varchar(30) )
row format delimited
fields terminated by '|'
stored as textfile;
© 2013 IBM Corporation46
BigSQL V3.0 – Demo (small)[bivm.ibm.com][biadmin] 1> select count(*) from trace3;
+----------+
| 1 |
+----------+
| 12014733 |
+----------+
1 row in results(first row: 2.94s; total: 2.95s)
© 2013 IBM Corporation47
BigSQL V3.0 – Demo (small)
[bivm.ibm.com][biadmin] 1> select count(*) from trace3 t3 inner join trace4 t4 on t3.hour=t4.hour;
+--------+
| 1 |
+--------+
| 504360 |
+--------+
1 row in results(first row: 0.79s; total: 0.80s)
© 2013 IBM Corporation48
BigSQL V3.0 – Demo (small)
[bivm.ibm.com][biadmin] 1> select count(hour), hour from trace3 group by hour order by hour;
29 rows in results(first row: 1.88s; total: 1.89s)
© 2013 IBM Corporation49
R on Hadoop● IBM BigR (based on SystemML Almadan Research project)● Rhadoop● RHIPE● ...
“R” Hadoop
© 2013 IBM Corporation50
© 2013 IBM Corporation5151
Goal: Find column mean
Problems:• Column vector can not fit into memory
You have to partition and parallelize
© 2013 IBM Corporation52
● Sampling Full dataset > RAM Example: use 1% vs 100% of dataset Precision loss from skewed/sparse data
● Numerical Stability Limitation from finite precision in computing Algorithms must be carefully implemented Instability causes errors to cascade throughout your analysis
Catastrophic Cancellation Error: 6.375 – 5.625
True value: 0.75 Computed: 0 Relative Error: 1.06.375 round to 6.0
5.625 round to 6.0
© 2013 IBM Corporation53
Data in Hadoop
You
R User
Data in distributed memory
© 2013 IBM Corporation54
Data in Hadoop: Can run R on a single node
R User
Data in distributed memory
You
© 2013 IBM Corporation55
BigR (based on SystemML)SystemML compiles hybrid runtime plans ranging from in-memory, single machine (CP) to large-scale, cluster (MR) compute
● Challenge● Guaranteed hard memory constraints
(budget of JVM size)● for arbitrary complex ML programs
● Key Technical Innovations● CP & MR Runtime: Single machine & MR operations, integrated runtime● Caching: Reuse and eviction of in-memory objects● Cost Model: Accurate time and worst-case memory estimates● Optimizer: Cost-based runtime plan generation● Dyn. Recompiler: Re-optimization for initial unknowns
Data size
Run
time
CP CP/MR MR
Gradually exploit MR parallelism
High performance computing for small data sizes.
Scalable computing for large data sizes.
Hybrid Plans
© 2013 IBM Corporation56
R Clients
SystemMLStatistics
Engine
Data Sources
Embedded R Execution
IBM R Packages
IBM R Packages
Pull data (summaries) to
R client
Or, push R functions
right on the data
1
2
3
© 2014 IBM Corporation17 IBM Internal Use Only
BigR Architecture
© 2013 IBM Corporation57
Big R Data Structures: Proxy to entire dataset
data <- bigr.frame(…)
Appears and acts like all of the data is on your laptop
You
© 2013 IBM Corporation58
BigR Demo (small) ● 32 GB Data, ~650.000.000 rows (small, Innovation Center Zurich)● 3 TB Data, ~ 60.937.500.000 rows (middle, Innovation Center Zurich)● 0.7 PB Data, ~ 1.421875×10¹³ rows (large, Innovation Center Hursley)
© 2013 IBM Corporation59
BigR Demo (small) library(bigr)
bigr.connect(host="bigdata",
port=7052, database="default",
user="biadmin", password="xxx")
is.bigr.connected()
tbr <- bigr.frame(dataSource="DEL", coltypes = c("numeric","numeric","numeric","numeric","character","character"),
dataPath="/user/biadmin/32Gtest", delimiter=",",
header=F, useMapReduce=T)
h <- bigr.histogram.stats(tbr$V1, nbins=24)
© 2013 IBM Corporation60
BigR Demo (small) class bins counts centroids
1 ALL 0 18289280 1.583333
2 ALL 1 15360 2.750000
3 ALL 2 55040 3.916667
4 ALL 3 189440 5.083333
5 ALL 4 579840 6.250000
6 ALL 5 5292160 7.416667
7 ALL 6 8074880 8.583333
8 ALL 7 15653120 9.750000
...
© 2013 IBM Corporation61
BigR Demo (small)
© 2013 IBM Corporation62
BigR Demo (small) jpeg('hist.jpg')
bigr.histogram(tbr$V1, nbins=24)
# This command runs on 32 GB / ~650.000.000 rows in HDFS
dev.off()
© 2013 IBM Corporation63
SPSS on Hadoop
© 2013 IBM Corporation64
SPSS on Hadoop
© 2013 IBM Corporation65
BigSheets Demo (small)● 32 GB Data, ~650.000.000 rows (small, Innovation Center Zurich)● 3 TB Data, ~ 60.937.500.000 rows (middle, Innovation Center Zurich)● 0.7 PB Data, ~ 1.421875×10¹³ rows (large, Innovation Center Hursley)
● 32 GB Data, ~650.000.000 rows (small, Innovation Center Zurich)● 3 TB Data, ~ 60.937.500.000 rows (middle, Innovation Center Zurich)● 0.7 PB Data, ~ 1.421875×10¹³ rows (large, Innovation Center Hursley)
© 2013 IBM Corporation66
BigSheets Demo (small)
© 2013 IBM Corporation67
BigSheets Demo (small)
This command runs on 32 GB /
~650.000.000 rows in HDFS
© 2013 IBM Corporation68
BigSheets Demo (small)
© 2013 IBM Corporation69
Text Extraction (SystemT, AQL)
© 2013 IBM Corporation70
Text Extraction (SystemT, AQL)
© 2013 IBM Corporation71
If this is not enough? → BigData AppStore
© 2013 IBM Corporation72
BigData AppStore, Eclipse Tooling● Write your apps in
● Java (MapReduce)● PigLatin,Jaql● BigSQL/Hive/BigR
● Deploy it to BigInsights via Eclipse● Automatically
● Schedule● Update
● hdfs files● BigSQL tables● BigSheets collections
© 2013 IBM Corporation73
Questions?
http://www.ibm.com/software/data/bigdata/
Twitter: @RomeoKienzler, @IBMEcosystem_DE, @IBM_ISV_Alps