dataiku big data paris - the rise of the hadoop ecosystem
DESCRIPTION
Snapshot of the hadoop ecosystem at the beginning of 2014, with the rise of real time and in memory processing distributed frameworks that complement and supplant the Map Reduce paradigmTRANSCRIPT
![Page 1: Dataiku big data paris - the rise of the hadoop ecosystem](https://reader035.vdocuments.mx/reader035/viewer/2022062418/554f84a6b4c905d25b8b4be5/html5/thumbnails/1.jpg)
The Riseof the
HadoopEcosystem
![Page 2: Dataiku big data paris - the rise of the hadoop ecosystem](https://reader035.vdocuments.mx/reader035/viewer/2022062418/554f84a6b4c905d25b8b4be5/html5/thumbnails/2.jpg)
Florian DouetteauCEO Dataiku
DATAIKU
DATA PREPARATIONMODELING STATISTICS
VISUALIZATION
ALL-IN-ONE
DATA SCIENCE STUDIO
![Page 3: Dataiku big data paris - the rise of the hadoop ecosystem](https://reader035.vdocuments.mx/reader035/viewer/2022062418/554f84a6b4c905d25b8b4be5/html5/thumbnails/3.jpg)
TOPICS FOR TODAY
DRIVERS FOR THE NEW “REAL-TIME“HADOOP ECOSYSTEM
KEY TOOLS AND FRAMEWORKSTO BE AWARE OF
![Page 4: Dataiku big data paris - the rise of the hadoop ecosystem](https://reader035.vdocuments.mx/reader035/viewer/2022062418/554f84a6b4c905d25b8b4be5/html5/thumbnails/4.jpg)
DRIVER 1: BACK TO THE BASICS
RAM - CPU - DISK
![Page 5: Dataiku big data paris - the rise of the hadoop ecosystem](https://reader035.vdocuments.mx/reader035/viewer/2022062418/554f84a6b4c905d25b8b4be5/html5/thumbnails/5.jpg)
2000 2013
1000$ / GB
6$ / GB$10 / GB
$0.06 / GB
memory divided by 150
disk costdivided by 250
MAPREDUCE
times
HACKREDUCE
times
A PERSISTENT MEMORY PROBLEM
![Page 6: Dataiku big data paris - the rise of the hadoop ecosystem](https://reader035.vdocuments.mx/reader035/viewer/2022062418/554f84a6b4c905d25b8b4be5/html5/thumbnails/6.jpg)
DATA IS BIGGER
![Page 7: Dataiku big data paris - the rise of the hadoop ecosystem](https://reader035.vdocuments.mx/reader035/viewer/2022062418/554f84a6b4c905d25b8b4be5/html5/thumbnails/7.jpg)
IS USEFUL DATA BIGGER ?
WHOLE DATA
REFINED DATA
![Page 8: Dataiku big data paris - the rise of the hadoop ecosystem](https://reader035.vdocuments.mx/reader035/viewer/2022062418/554f84a6b4c905d25b8b4be5/html5/thumbnails/8.jpg)
GOLD
NEEDLE IN HAYSTACK ?
![Page 9: Dataiku big data paris - the rise of the hadoop ecosystem](https://reader035.vdocuments.mx/reader035/viewer/2022062418/554f84a6b4c905d25b8b4be5/html5/thumbnails/9.jpg)
OILD
REFINE BEFOREUSE
![Page 10: Dataiku big data paris - the rise of the hadoop ecosystem](https://reader035.vdocuments.mx/reader035/viewer/2022062418/554f84a6b4c905d25b8b4be5/html5/thumbnails/10.jpg)
HOW BIG IS BIG DATA ?Web Site– $1B revenue per year – 10 Millions Unique Visitor per month– 100.Millions orders / actions / per day
10TBRAW DATA
1TBREFINE DATA
![Page 11: Dataiku big data paris - the rise of the hadoop ecosystem](https://reader035.vdocuments.mx/reader035/viewer/2022062418/554f84a6b4c905d25b8b4be5/html5/thumbnails/11.jpg)
1 TERABYTE
FITS IN MEMORY
1TB
![Page 12: Dataiku big data paris - the rise of the hadoop ecosystem](https://reader035.vdocuments.mx/reader035/viewer/2022062418/554f84a6b4c905d25b8b4be5/html5/thumbnails/12.jpg)
DRIVER 2 : ECOSYSTEM GROWS
• 1 Circle OPEN SOURCE– YAHOO – IBM –
LINKEDIN - FACEBOOK
• 2 Circle – STANDFORD BERKELEY– STARTUPS
![Page 13: Dataiku big data paris - the rise of the hadoop ecosystem](https://reader035.vdocuments.mx/reader035/viewer/2022062418/554f84a6b4c905d25b8b4be5/html5/thumbnails/13.jpg)
STARTUPS
64m$
6.75m$
14m$
2m$
40m$
20m$
20.5m$
19m$
4m$
100m$
1.8m$
17m$
11m$
7.75m$
1.7m$
20132012
2011
2010
2009
$1B per yearInvested
in Big Data TECH
223m$
301m$
![Page 14: Dataiku big data paris - the rise of the hadoop ecosystem](https://reader035.vdocuments.mx/reader035/viewer/2022062418/554f84a6b4c905d25b8b4be5/html5/thumbnails/14.jpg)
HAVE YOU SEEN THE MOVIE ?
dooop
![Page 15: Dataiku big data paris - the rise of the hadoop ecosystem](https://reader035.vdocuments.mx/reader035/viewer/2022062418/554f84a6b4c905d25b8b4be5/html5/thumbnails/15.jpg)
ALL-IN-ONE SOLUTION
HDFS
MAP REDUCE
1. Safe Large Storage (HDFS)
2. Distributed computation paradigm (Map Reduce)
3. Resilient long job
4. Disk-CPU locality aware resource allocation
HADOOP =
![Page 16: Dataiku big data paris - the rise of the hadoop ecosystem](https://reader035.vdocuments.mx/reader035/viewer/2022062418/554f84a6b4c905d25b8b4be5/html5/thumbnails/16.jpg)
LOVELY TANGLED TOGETHER
![Page 17: Dataiku big data paris - the rise of the hadoop ecosystem](https://reader035.vdocuments.mx/reader035/viewer/2022062418/554f84a6b4c905d25b8b4be5/html5/thumbnails/17.jpg)
INTRODUCTING YARN
![Page 18: Dataiku big data paris - the rise of the hadoop ecosystem](https://reader035.vdocuments.mx/reader035/viewer/2022062418/554f84a6b4c905d25b8b4be5/html5/thumbnails/18.jpg)
HDFSYARN
map reduce
provider1
Other cluster
provider…
THE NEW ECOSYSTEM
![Page 19: Dataiku big data paris - the rise of the hadoop ecosystem](https://reader035.vdocuments.mx/reader035/viewer/2022062418/554f84a6b4c905d25b8b4be5/html5/thumbnails/19.jpg)
FASTER FASTER FASTER
REALLY FASTER ?
![Page 20: Dataiku big data paris - the rise of the hadoop ecosystem](https://reader035.vdocuments.mx/reader035/viewer/2022062418/554f84a6b4c905d25b8b4be5/html5/thumbnails/20.jpg)
REAL-TIME
REAL-TIME QUERIES
REAL-TIME UPDATES
FASTMACHINE LEARNING
![Page 21: Dataiku big data paris - the rise of the hadoop ecosystem](https://reader035.vdocuments.mx/reader035/viewer/2022062418/554f84a6b4c905d25b8b4be5/html5/thumbnails/21.jpg)
REAL-TIME
REAL-TIME QUERIES
REAL-TIME UPDATES
FASTMACHINE LEARNING
![Page 22: Dataiku big data paris - the rise of the hadoop ecosystem](https://reader035.vdocuments.mx/reader035/viewer/2022062418/554f84a6b4c905d25b8b4be5/html5/thumbnails/22.jpg)
DEVELOPER CAN WAIT
DEVELOPPER CAN WAIT
![Page 23: Dataiku big data paris - the rise of the hadoop ecosystem](https://reader035.vdocuments.mx/reader035/viewer/2022062418/554f84a6b4c905d25b8b4be5/html5/thumbnails/23.jpg)
BUSINESS WON’TWAIT
![Page 24: Dataiku big data paris - the rise of the hadoop ecosystem](https://reader035.vdocuments.mx/reader035/viewer/2022062418/554f84a6b4c905d25b8b4be5/html5/thumbnails/24.jpg)
REAL-TIME QUERIES
Not All Queries are born
equals
![Page 25: Dataiku big data paris - the rise of the hadoop ecosystem](https://reader035.vdocuments.mx/reader035/viewer/2022062418/554f84a6b4c905d25b8b4be5/html5/thumbnails/25.jpg)
RT QUERIES > IMPALA
MPP Database like performance for Hadoop
- Created in 2012 by Cloudera
- x100 performance over Hive (for certain queries)
![Page 26: Dataiku big data paris - the rise of the hadoop ecosystem](https://reader035.vdocuments.mx/reader035/viewer/2022062418/554f84a6b4c905d25b8b4be5/html5/thumbnails/26.jpg)
RT QUERIES > DRILLExtensible architecturefor SQL Querying
• Started in 2013
• Apache Incubated Project• Lucidworks• Mapr • ElasticSearch• …
• Alpha Status
• Open architecture for supporting SQL like queries to various data sources: • Cassandra• MongoDB• HDFS• HBase
Apache DRILL
![Page 27: Dataiku big data paris - the rise of the hadoop ecosystem](https://reader035.vdocuments.mx/reader035/viewer/2022062418/554f84a6b4c905d25b8b4be5/html5/thumbnails/27.jpg)
REAL-TIME
REAL-TIME QUERIES
REAL-TIME UPDATES
FASTMACHINE LEARNING
![Page 28: Dataiku big data paris - the rise of the hadoop ecosystem](https://reader035.vdocuments.mx/reader035/viewer/2022062418/554f84a6b4c905d25b8b4be5/html5/thumbnails/28.jpg)
REAL-TIME UPDATES
![Page 29: Dataiku big data paris - the rise of the hadoop ecosystem](https://reader035.vdocuments.mx/reader035/viewer/2022062418/554f84a6b4c905d25b8b4be5/html5/thumbnails/29.jpg)
UPDATE > Recommender SystemUpdate the Model Once per week using the whole history
Apply the model for each userusing the very last events
Real-TimeNavigation
Real-TimeRecommendation
![Page 30: Dataiku big data paris - the rise of the hadoop ecosystem](https://reader035.vdocuments.mx/reader035/viewer/2022062418/554f84a6b4c905d25b8b4be5/html5/thumbnails/30.jpg)
UPDATE > STORM
STORM Reliable Distributed Real-Time Computations
- Connect to a variety of datasources (HDFS, RabbitMQ, JMS etc..)
- Run Computation in java (native) or python, ruby, perl …
- Guarantees that events are taken processed
- Distributes workload
![Page 31: Dataiku big data paris - the rise of the hadoop ecosystem](https://reader035.vdocuments.mx/reader035/viewer/2022062418/554f84a6b4c905d25b8b4be5/html5/thumbnails/31.jpg)
UPDATES > SUMMINGBIRD
Write Map-Reduce like program and executing either in
• Batch• Real-Time• Hybrid Batch / Real-Time
• Open Sourced By Twitter in 2013
• Built on top of Storm (and Cascading)
• Program in Scala
![Page 32: Dataiku big data paris - the rise of the hadoop ecosystem](https://reader035.vdocuments.mx/reader035/viewer/2022062418/554f84a6b4c905d25b8b4be5/html5/thumbnails/32.jpg)
REAL-TIME
REAL-TIME QUERIES
REAL-TIME UPDATES
FASTMACHINE LEARNING
![Page 33: Dataiku big data paris - the rise of the hadoop ecosystem](https://reader035.vdocuments.mx/reader035/viewer/2022062418/554f84a6b4c905d25b8b4be5/html5/thumbnails/33.jpg)
FAST LEARNING DRIVE
GOOD PUPILS ITERATE
![Page 34: Dataiku big data paris - the rise of the hadoop ecosystem](https://reader035.vdocuments.mx/reader035/viewer/2022062418/554f84a6b4c905d25b8b4be5/html5/thumbnails/34.jpg)
ITERATION FOR MACHINE LEARNING
……..
……..
Stochastic Gradient Descent : ITERATE
K-Means : ITERATE
Pages Rank: ITERATE
……..
![Page 35: Dataiku big data paris - the rise of the hadoop ecosystem](https://reader035.vdocuments.mx/reader035/viewer/2022062418/554f84a6b4c905d25b8b4be5/html5/thumbnails/35.jpg)
LEARNING > GRAPHLAB
“Graph” Analytics in Memory
• Created at Carnegie-Mellon in 2009
• Generic Graph Traversal framework
• Packaged Machine Learning- Recommender Systems- Graph Analytics- Clustering
• Easy Python Integration
![Page 36: Dataiku big data paris - the rise of the hadoop ecosystem](https://reader035.vdocuments.mx/reader035/viewer/2022062418/554f84a6b4c905d25b8b4be5/html5/thumbnails/36.jpg)
LEARNING > H2O
In-Memory Distribution Prediction Engine
Machine Learning- Classification- Regression- Clustering
- R/Python easy integration
![Page 37: Dataiku big data paris - the rise of the hadoop ecosystem](https://reader035.vdocuments.mx/reader035/viewer/2022062418/554f84a6b4c905d25b8b4be5/html5/thumbnails/37.jpg)
ALL > SPARK
Real-Time Resilient Distributed Memory Framework
• Abstraction with any DAG operation on data:- Filter- Map- Reduce - Cache
![Page 38: Dataiku big data paris - the rise of the hadoop ecosystem](https://reader035.vdocuments.mx/reader035/viewer/2022062418/554f84a6b4c905d25b8b4be5/html5/thumbnails/38.jpg)
SPARK AND ITS ECOSYSTEM
SHARK
MLBASE
STREAMING
Real-Time Queries
Real-Time Updates
In-Memory Learning
SPAR
K
![Page 39: Dataiku big data paris - the rise of the hadoop ecosystem](https://reader035.vdocuments.mx/reader035/viewer/2022062418/554f84a6b4c905d25b8b4be5/html5/thumbnails/39.jpg)
THE WHOLE PICTURE
HDFSYARN
map reduce SPARK
GRAPHLAB
H2OST
REAM
ING
ML
BASE
SHAR
K
PIG
HIV
E
CASC
ADIN
G
STO
RM
DRI
LL
othe
r sto
rage
IMPA
LA
![Page 40: Dataiku big data paris - the rise of the hadoop ecosystem](https://reader035.vdocuments.mx/reader035/viewer/2022062418/554f84a6b4c905d25b8b4be5/html5/thumbnails/40.jpg)
THANK YOU !
dataiku.com
DATAIKU STAND A4
DEMO
DATA SCIENCE STUDIO
Questions now
or later