hadoop and machine learning
DESCRIPTION
Slides for the talk by the Cloudera Data Science team on the state of machine learning and Hadoop at NIPS 2011.TRANSCRIPT
![Page 1: Hadoop and Machine Learning](https://reader035.vdocuments.mx/reader035/viewer/2022062617/54c6bbd34a7959c35a8b457b/html5/thumbnails/1.jpg)
Machine Learning and HadoopPresent and FutureJosh Wills, Tom Pierce, and Jeff HammerbacherCloudera Data Science TeamDecember 17th, 2011
![Page 2: Hadoop and Machine Learning](https://reader035.vdocuments.mx/reader035/viewer/2022062617/54c6bbd34a7959c35a8b457b/html5/thumbnails/2.jpg)
High Availability for Data Scientists
Copyright 2011 Cloudera Inc. All rights reserved
NIPS
![Page 3: Hadoop and Machine Learning](https://reader035.vdocuments.mx/reader035/viewer/2022062617/54c6bbd34a7959c35a8b457b/html5/thumbnails/3.jpg)
Agenda
• Part 1: Industrial Machine Learning• Part 2: Machine Learning and Hadoop
• State of the World• Where Things Are Headed
• Part 3: Things Industry Needs From Academia
Copyright 2011 Cloudera Inc. All rights reserved
![Page 4: Hadoop and Machine Learning](https://reader035.vdocuments.mx/reader035/viewer/2022062617/54c6bbd34a7959c35a8b457b/html5/thumbnails/4.jpg)
Industrial Machine Learning
Copyright 2011 Cloudera Inc. All rights reserved
![Page 5: Hadoop and Machine Learning](https://reader035.vdocuments.mx/reader035/viewer/2022062617/54c6bbd34a7959c35a8b457b/html5/thumbnails/5.jpg)
Delta One: Model Evaluation
• ML Systems Are One Piece of a Complex System• Well-defined objective functions are the exception
• Multiple, often conflicting goals• Weights are fuzzy and shift with business priorities• Pareto optimization is the safest play
• Predictive Accuracy Is Only Useful Up to a Point• Examples
• Computational advertising• Friend recommendations on social networks
Copyright 2011 Cloudera Inc. All rights reserved
![Page 6: Hadoop and Machine Learning](https://reader035.vdocuments.mx/reader035/viewer/2022062617/54c6bbd34a7959c35a8b457b/html5/thumbnails/6.jpg)
Delta Two: Systems Precede Algorithms
• Greenfield Projects Hardly Ever Happen• (and don’t usually launch)
• Industrial Computational Infrastructure• General-purpose• Cheap• Shared
• Constraints Drive Innovation• Vowpal Wabbit Hashing Trick• SETI @ Google
Copyright 2011 Cloudera Inc. All rights reserved
![Page 7: Hadoop and Machine Learning](https://reader035.vdocuments.mx/reader035/viewer/2022062617/54c6bbd34a7959c35a8b457b/html5/thumbnails/7.jpg)
Delta Three: Workflow
Copyright 2011 Cloudera Inc. All rights reserved
Practice Over Theory Blog
![Page 8: Hadoop and Machine Learning](https://reader035.vdocuments.mx/reader035/viewer/2022062617/54c6bbd34a7959c35a8b457b/html5/thumbnails/8.jpg)
Delta Three: Workflow
• Optimize the Overall Process• Model fitting is a small piece of the overall flow time• Parallelize everything
• Better Features > Better Models• Fast Model Deployment
• Common Feature Extraction Logic• Servable Models
• Validation as Sanity Checking• Deploy to a small subset of real data and evaluate
Copyright 2011 Cloudera Inc. All rights reserved
![Page 9: Hadoop and Machine Learning](https://reader035.vdocuments.mx/reader035/viewer/2022062617/54c6bbd34a7959c35a8b457b/html5/thumbnails/9.jpg)
Agenda
• Part 1: Industrial Machine Learning• Part 2: Machine Learning and Hadoop
• State of the World• Where Things Are Headed
• Part 3: Things Industry Needs From Academia
Copyright 2011 Cloudera Inc. All rights reserved
![Page 10: Hadoop and Machine Learning](https://reader035.vdocuments.mx/reader035/viewer/2022062617/54c6bbd34a7959c35a8b457b/html5/thumbnails/10.jpg)
Hadoop: It’s Where The Data Is
Copyright 2011 Cloudera Inc. All rights reserved
![Page 11: Hadoop and Machine Learning](https://reader035.vdocuments.mx/reader035/viewer/2022062617/54c6bbd34a7959c35a8b457b/html5/thumbnails/11.jpg)
Hadoop Platform: Substrate
• Commodity servers• Open Compute
• Open source operating system• Linux
• Open source configuration management• Puppet• Chef
• Coordination service• ZooKeeper
Copyright 2011 Cloudera Inc. All rights reserved
![Page 12: Hadoop and Machine Learning](https://reader035.vdocuments.mx/reader035/viewer/2022062617/54c6bbd34a7959c35a8b457b/html5/thumbnails/12.jpg)
Hadoop Platform: Storage
• Distributed schema-less storage• HDFS• Ceph
• Append-only storage formats and metadata• Avro• RCFile• HCatalog
• Mutable key-value storage and metadata• HBase
Copyright 2011 Cloudera Inc. All rights reserved
![Page 13: Hadoop and Machine Learning](https://reader035.vdocuments.mx/reader035/viewer/2022062617/54c6bbd34a7959c35a8b457b/html5/thumbnails/13.jpg)
Hadoop Platform: Integration
• Tool Access• FUSE• JDBC• ODBC
• Data Ingestion• Flume• Sqoop
Copyright 2011 Cloudera Inc. All rights reserved
![Page 14: Hadoop and Machine Learning](https://reader035.vdocuments.mx/reader035/viewer/2022062617/54c6bbd34a7959c35a8b457b/html5/thumbnails/14.jpg)
ML and Hadoop: The State of the World
Copyright 2011 Cloudera Inc. All rights reserved
![Page 15: Hadoop and Machine Learning](https://reader035.vdocuments.mx/reader035/viewer/2022062617/54c6bbd34a7959c35a8b457b/html5/thumbnails/15.jpg)
Computation: Plain Old MapReduce
• Great for:• Data Preparation• Feature Engineering• Model Validation/Evaluation
• Works For Certain Model Fitting Problems• Recommendation Systems• Decision Trees (PLANET; Gradient Boosted Decision Trees)
• Not A Practical Option for Online Learning• Way More Detail from the KDD 2011 Talk
Copyright 2011 Cloudera Inc. All rights reserved
![Page 16: Hadoop and Machine Learning](https://reader035.vdocuments.mx/reader035/viewer/2022062617/54c6bbd34a7959c35a8b457b/html5/thumbnails/16.jpg)
Tools for Data Preparation/Feature Engineering
• Languages/Environments• PigLatin• HiveQL• Need to deal with mismatch between offline/online feature
generation
• Java/Scala APIs• Crunch (Cloudera)• Scoobi (NICTA)• Cascading (Concurrent)• Jaql (IBM)
Copyright 2011 Cloudera Inc. All rights reserved
![Page 17: Hadoop and Machine Learning](https://reader035.vdocuments.mx/reader035/viewer/2022062617/54c6bbd34a7959c35a8b457b/html5/thumbnails/17.jpg)
Apache Mahout
• The starting place for MapReduce-based machine learning algorithms• Not machine-learning-in-a-box• Custom tweaks/modifications are the rule
• A disparate collection of algorithms for:• Recommendations• Clustering• Classification• Frequent Itemset Mining
Copyright 2011 Cloudera Inc. All rights reserved
![Page 18: Hadoop and Machine Learning](https://reader035.vdocuments.mx/reader035/viewer/2022062617/54c6bbd34a7959c35a8b457b/html5/thumbnails/18.jpg)
Apache Mahout (cont.)
• Best Library: Taste Recommender• Oldest project, most widely-deployed in production• SVD implementation is particularly active
• Good Libraries: Online SGD• Does not use MapReduce• Vowpal Rabbit + AllReduce is faster, has L-BFGS option
• Roll Your Own Instead: Naïve Bayes• Challenges
• “Secret sauce” effect• Delta between Mahout + the cutting edge in ML
Copyright 2011 Cloudera Inc. All rights reserved
![Page 19: Hadoop and Machine Learning](https://reader035.vdocuments.mx/reader035/viewer/2022062617/54c6bbd34a7959c35a8b457b/html5/thumbnails/19.jpg)
More Machine Learning Interfaces for Hadoop
• Based on MapReduce• SystemML (IBM)• AllReduce (Vowpal Wabbit)
• No MapReduce• Spark
• R-Based Systems (Augment MapReduce with R)• Segue• RHIPE• RHadoop• Ricardo (IBM)
Copyright 2011 Cloudera Inc. All rights reserved
![Page 20: Hadoop and Machine Learning](https://reader035.vdocuments.mx/reader035/viewer/2022062617/54c6bbd34a7959c35a8b457b/html5/thumbnails/20.jpg)
ML and Hadoop: Where Things are Headed
Copyright 2011 Cloudera Inc. All rights reserved
![Page 21: Hadoop and Machine Learning](https://reader035.vdocuments.mx/reader035/viewer/2022062617/54c6bbd34a7959c35a8b457b/html5/thumbnails/21.jpg)
MRv2 and YARN
• Eliminates JobTracker bottleneck• Separate Resource Manager/Scheduler• Individual jobs have their own task masters
• Moves MapReduce into user-land• Enables Hadoop clusters to run all sorts of jobs
• MPI (Hamster; MAPREDUCE-2911)• Native BSP (Giraph)• Spark• AllReduce, GraphLab
Copyright 2011 Cloudera Inc. All rights reserved
![Page 22: Hadoop and Machine Learning](https://reader035.vdocuments.mx/reader035/viewer/2022062617/54c6bbd34a7959c35a8b457b/html5/thumbnails/22.jpg)
Agenda
• Part 1: Industrial Machine Learning• Part 2: Machine Learning and Hadoop
• State of the World• Where Things Are Headed
• Part 3: Things Industry Needs From Academia
Copyright 2011 Cloudera Inc. All rights reserved
![Page 23: Hadoop and Machine Learning](https://reader035.vdocuments.mx/reader035/viewer/2022062617/54c6bbd34a7959c35a8b457b/html5/thumbnails/23.jpg)
Machine Learning on Multivariate Time Series
• 1e5 writes/sec• Positive events are
relatively rare• Feature extraction
challenge• May not be clear what
the right time horizon is• Tight SLAs• Very high stakes
Copyright 2011 Cloudera Inc. All rights reserved
![Page 24: Hadoop and Machine Learning](https://reader035.vdocuments.mx/reader035/viewer/2022062617/54c6bbd34a7959c35a8b457b/html5/thumbnails/24.jpg)
An Academic Language For Feature Engineering
• Feature extraction/selection is as important as model fitting• e.g., hierarchical feature representation, impact on training
time and experiment design, feature cost modeling, etc.
• Academic literature on this problem is sparse and dispersed across multiple fields• NIPS 2003• HCI, NLP, Information Retrieval, etc.
• We need a common language for talking about these problems across disciplines
Copyright 2011 Cloudera Inc. All rights reserved
![Page 25: Hadoop and Machine Learning](https://reader035.vdocuments.mx/reader035/viewer/2022062617/54c6bbd34a7959c35a8b457b/html5/thumbnails/25.jpg)
A Broader Ontology For Model Selection
• Practical factors that enter into the “best” choice of model…• Data arrival rate• Data volume• Scoring latency• Model refresh time• Robustness/reliability
• …in addition to the standard predictive power/simplicity tradeoffs
Copyright 2011 Cloudera Inc. All rights reserved
![Page 26: Hadoop and Machine Learning](https://reader035.vdocuments.mx/reader035/viewer/2022062617/54c6bbd34a7959c35a8b457b/html5/thumbnails/26.jpg)
Questions?Want A Job?
@josh_wills