machine learning and apache mahout : an introduction
DESCRIPTION
An Introductory presentation on Machine Learning and Apache Mahout. I presented it at the BigData Meetup - Pune Chapter's first meetup (http://www.meetup.com/Big-Data-Meetup-Pune-Chapter/).TRANSCRIPT
![Page 1: Machine Learning and Apache Mahout : An Introduction](https://reader033.vdocuments.mx/reader033/viewer/2022052303/540dea9b8d7f728d7e8b4b5b/html5/thumbnails/1.jpg)
+
Varad MeruSoftware Development EngineerOrzota, Inc.about.me/vrdmr
Machine Learning
and Apache Mahout
© Varad Meru, 2013
![Page 2: Machine Learning and Apache Mahout : An Introduction](https://reader033.vdocuments.mx/reader033/viewer/2022052303/540dea9b8d7f728d7e8b4b5b/html5/thumbnails/2.jpg)
+Who Am I
Orzota, Inc. Making BigData Easy Designing a Cloud-based platform for ETL, Analytics
Past Work Experience Persistent Systems Ltd.
Recommendation Engines and User Behavior Analytics.
Area of Interest Machine Learning Distributed Systems Recommendation Engines
2
![Page 3: Machine Learning and Apache Mahout : An Introduction](https://reader033.vdocuments.mx/reader033/viewer/2022052303/540dea9b8d7f728d7e8b4b5b/html5/thumbnails/3.jpg)
+Outline
Introduction
Machine Learning Introduction and History Types of Learning Algorithms Applications What’s New
Apache Mahout History Architecture Applications and Examples
Conclusion© Varad Meru, 2013
3
![Page 4: Machine Learning and Apache Mahout : An Introduction](https://reader033.vdocuments.mx/reader033/viewer/2022052303/540dea9b8d7f728d7e8b4b5b/html5/thumbnails/4.jpg)
+
Machine LearningRise of the Machine-Era
4
![Page 5: Machine Learning and Apache Mahout : An Introduction](https://reader033.vdocuments.mx/reader033/viewer/2022052303/540dea9b8d7f728d7e8b4b5b/html5/thumbnails/5.jpg)
+Introduction
Term coined by Arthur Samuel "Field of study that gives computers the ability to learn
without being explicitly programmed“.
Branch of Artificial Intelligence and Statistics
Focuses on prediction based on known properties
Used as a sub-process in Data Mining. Data Mining focuses on discovering new, unknown
properties.
“Machine Learning is Programming Computers to optimize a Performance Criterion using
Example Data or Past Experience”
5
![Page 6: Machine Learning and Apache Mahout : An Introduction](https://reader033.vdocuments.mx/reader033/viewer/2022052303/540dea9b8d7f728d7e8b4b5b/html5/thumbnails/6.jpg)
+Learning Algorithms
Supervised Learning Labelled input data. Creating classifiers to predict unseen inputs.
Unsupervised Learning Unlabelled input data. Creating a function to predict the relation and output
Semi-Supervised Learning Combines Supervised and Unsupervised Learning
methodology
Reinforcement Learning Reward-Punishment based agent.
6
![Page 7: Machine Learning and Apache Mahout : An Introduction](https://reader033.vdocuments.mx/reader033/viewer/2022052303/540dea9b8d7f728d7e8b4b5b/html5/thumbnails/7.jpg)
+Supervised Learning
Learn from the Data
Data is already labelled Expert, Crowd-sourced or case-based labelling of data.
Applications Handwriting Recognition Spam Detection Information Retrieval
Personalisation based on ranks Speech Recognition
Introduction
7
![Page 8: Machine Learning and Apache Mahout : An Introduction](https://reader033.vdocuments.mx/reader033/viewer/2022052303/540dea9b8d7f728d7e8b4b5b/html5/thumbnails/8.jpg)
+Supervised Learning
Decision Trees
k-Nearest Neighbours
Naive Bayes
Logistic Regression
Perceptron and Multi-level Perceptrons
Neural Networks
SVM and Kernel estimation
Algorithms
8
![Page 9: Machine Learning and Apache Mahout : An Introduction](https://reader033.vdocuments.mx/reader033/viewer/2022052303/540dea9b8d7f728d7e8b4b5b/html5/thumbnails/9.jpg)
+Supervised LearningExample: Naive Bayes Classifier
President Obama’s Speech’s Word Map
9
![Page 10: Machine Learning and Apache Mahout : An Introduction](https://reader033.vdocuments.mx/reader033/viewer/2022052303/540dea9b8d7f728d7e8b4b5b/html5/thumbnails/10.jpg)
+Supervised LearningExample: Naive Bayes Classifier
A Spam Document’s Word Map
10
![Page 11: Machine Learning and Apache Mahout : An Introduction](https://reader033.vdocuments.mx/reader033/viewer/2022052303/540dea9b8d7f728d7e8b4b5b/html5/thumbnails/11.jpg)
+Supervised LearningExample: Naive Bayes Classifier
Running a test on the Classifier
Classifier
“Order a trial Adobe chicken daily EAB-List new summer
savings, welcome!”
11
SpamBin
![Page 12: Machine Learning and Apache Mahout : An Introduction](https://reader033.vdocuments.mx/reader033/viewer/2022052303/540dea9b8d7f728d7e8b4b5b/html5/thumbnails/12.jpg)
+Unsupervised Learning
Finding hidden structure in data
Unlabelled Data
SMEs needed post-processing to verify, validate and use the output
Used in exploratory analysis rather than predictive analytics
Applications Pattern Recognition Groupings based on a distance measure
Group of People, Objects, ...
Introduction
12
![Page 13: Machine Learning and Apache Mahout : An Introduction](https://reader033.vdocuments.mx/reader033/viewer/2022052303/540dea9b8d7f728d7e8b4b5b/html5/thumbnails/13.jpg)
+Unsupervised Learning
Clustering k-Means, MinHash, Hierarchical Clustering
Hidden Markov Models
Feature Extraction methods
Self-organizing Maps (Neural Nets)
Algorithms
13
![Page 14: Machine Learning and Apache Mahout : An Introduction](https://reader033.vdocuments.mx/reader033/viewer/2022052303/540dea9b8d7f728d7e8b4b5b/html5/thumbnails/14.jpg)
+Unsupervised LearningExample K-Means
14
Source: http://apandre.wordpress.com/visible-data/cluster-analysis/
![Page 15: Machine Learning and Apache Mahout : An Introduction](https://reader033.vdocuments.mx/reader033/viewer/2022052303/540dea9b8d7f728d7e8b4b5b/html5/thumbnails/15.jpg)
+Learning ProblemCat and Dog Problem
Humans can easily classify which is a cat and which is a dog.
But how can a computer do that?
Some attempts used Clustering Mechanisms to solve it – Co-occurence Clustering, Deep Learning
15
![Page 16: Machine Learning and Apache Mahout : An Introduction](https://reader033.vdocuments.mx/reader033/viewer/2022052303/540dea9b8d7f728d7e8b4b5b/html5/thumbnails/16.jpg)
+
Apache MahoutScalable Machine Learning Library
© Varad Meru, 2013
16
![Page 17: Machine Learning and Apache Mahout : An Introduction](https://reader033.vdocuments.mx/reader033/viewer/2022052303/540dea9b8d7f728d7e8b4b5b/html5/thumbnails/17.jpg)
+History and Etymology
Inspired from MapReduce for Machine Learning on Multicore” Ng et. al.
Written in Java. Apache License.
Founders Mahout – Isabel Drost, Grant Ingersoll,
Karl Witten. Taste – Sean Owen
Mahout – Keeper/Driver of Elephants.
Current Release – 0.8 (stable)
© Varad Meru, 2013
17
![Page 18: Machine Learning and Apache Mahout : An Introduction](https://reader033.vdocuments.mx/reader033/viewer/2022052303/540dea9b8d7f728d7e8b4b5b/html5/thumbnails/18.jpg)
+Need
BigData Ever-growing data. Yesterday’s methods to
process tomorrow’s data Cheap Storage
Scalable from Ground Up Should be build on top of
any existing Distributed Systems framework
Should contain distributed version of ML algorithms
Size Classification Tools
LinesSample Data
Analysis and Visualisation
Whiteboard,Bash, ...
KBs – low MBsPrototype Data
Analysis and Visualisation
Matlab, Octave, R, Processing, Bash, ...
MBs – low GBs
Online Data
StorageMySQL (DBs), ...
Analysis
NumPy, SciPy, Pandas, Weka..
VisualisationFlare, AmCharts, Raphael
GBs – TBs – PBs
Big Data
StorageHDFS, Hbase, Cassandra,...
AnalysisHive, Giraph, Hama, Mahout
18
![Page 19: Machine Learning and Apache Mahout : An Introduction](https://reader033.vdocuments.mx/reader033/viewer/2022052303/540dea9b8d7f728d7e8b4b5b/html5/thumbnails/19.jpg)
+Mahout Modules
Evolutionary Algorithms
Classification
Clustering Recommenders
Regression FPM Dimension Reduction
UtiliesLucene/Vectorizer
MathVectors/ Matrics/SVD
Collections(Primitives)
Hadoop
Applications
19
![Page 20: Machine Learning and Apache Mahout : An Introduction](https://reader033.vdocuments.mx/reader033/viewer/2022052303/540dea9b8d7f728d7e8b4b5b/html5/thumbnails/20.jpg)
+Recommender Systems
© Varad Meru, 2013
20
![Page 21: Machine Learning and Apache Mahout : An Introduction](https://reader033.vdocuments.mx/reader033/viewer/2022052303/540dea9b8d7f728d7e8b4b5b/html5/thumbnails/21.jpg)
+Recommender Systems
Types of Recommender Systems Content Based Recommendations Collaborative Filtering Recommendations
User-User Recommendations Item-Item Recommendations
Dimensionality Reduction (SVD) Recommendations
Applications Products you would like to buy People you might want to connect with Potential Life-Partners Recommending Songs you might like ...
21
Introduction
![Page 22: Machine Learning and Apache Mahout : An Introduction](https://reader033.vdocuments.mx/reader033/viewer/2022052303/540dea9b8d7f728d7e8b4b5b/html5/thumbnails/22.jpg)
+Recommender Systems
22
Collaborative Filtering in Action
Assuming people have seen at least one movie. Cold Start?
1: seen
0: not seen
© Varad Meru, 2013
![Page 23: Machine Learning and Apache Mahout : An Introduction](https://reader033.vdocuments.mx/reader033/viewer/2022052303/540dea9b8d7f728d7e8b4b5b/html5/thumbnails/23.jpg)
+Collaborative Filtering in Action
Tanimoto Coefficient
NA – Number of Customers who bought A
NB – Number of Customers who bought B
NC – Number of Customers who bought A and B
© Varad Meru, 2013
CBA
C
NNN
NbaT
),(
23
![Page 24: Machine Learning and Apache Mahout : An Introduction](https://reader033.vdocuments.mx/reader033/viewer/2022052303/540dea9b8d7f728d7e8b4b5b/html5/thumbnails/24.jpg)
+Collaborative Filtering in Action
Cosine Coefficient
NA – Number of Customers who bought A
NB – Number of Customers who bought B
NC – Number of Customers who bought A and B
© Varad Meru, 2013
BA
C
NN
NbaC
),(
24
![Page 25: Machine Learning and Apache Mahout : An Introduction](https://reader033.vdocuments.mx/reader033/viewer/2022052303/540dea9b8d7f728d7e8b4b5b/html5/thumbnails/25.jpg)
+Apache Mahout
Two Modes Stand-alone non distributed (“Taste”) Scalable Distributed Algorithmic version
for Collaborative Filtering
Top-level Packages Data Model User Similarity Item Similarity User Neighbourhood Recommender
25
Recommender System Architecture
![Page 26: Machine Learning and Apache Mahout : An Introduction](https://reader033.vdocuments.mx/reader033/viewer/2022052303/540dea9b8d7f728d7e8b4b5b/html5/thumbnails/26.jpg)
+Naive Bayes Classifier
26
Classifier
“Order a trial Adobe chicken daily EAB-List new summer
savings, welcome!”
![Page 27: Machine Learning and Apache Mahout : An Introduction](https://reader033.vdocuments.mx/reader033/viewer/2022052303/540dea9b8d7f728d7e8b4b5b/html5/thumbnails/27.jpg)
+Naive Bayes Classifier
Naive Bayes is a pretty complex process in Mahout: training the classifier requires four separate Hadoop jobs.
Training: Read the Features Calculate per-Document
Statistics Normalize across Categories Calculate normalizing factor
of each label
Testing Classification (fifth job, explicitly invoked)
© Varad Meru, 2013
27
![Page 28: Machine Learning and Apache Mahout : An Introduction](https://reader033.vdocuments.mx/reader033/viewer/2022052303/540dea9b8d7f728d7e8b4b5b/html5/thumbnails/28.jpg)
+K-Means Clustering
28
Iterations
![Page 29: Machine Learning and Apache Mahout : An Introduction](https://reader033.vdocuments.mx/reader033/viewer/2022052303/540dea9b8d7f728d7e8b4b5b/html5/thumbnails/29.jpg)
+K-Means Clustering
29
MapReduce Version
![Page 30: Machine Learning and Apache Mahout : An Introduction](https://reader033.vdocuments.mx/reader033/viewer/2022052303/540dea9b8d7f728d7e8b4b5b/html5/thumbnails/30.jpg)
+ Summary• Machine Learning
• Learning Algorithms• Varied Applications
• Mahout• Scaling to Giga/Tera/Peta Scale• Free and Open Source
30
![Page 31: Machine Learning and Apache Mahout : An Introduction](https://reader033.vdocuments.mx/reader033/viewer/2022052303/540dea9b8d7f728d7e8b4b5b/html5/thumbnails/31.jpg)
+More Info.
1. “Scalable Similarity-Based Neighborhood Methods with MapReduce” by Sebastian Schelter, Christoph Boden and Volker Markl. – RecSys 2012.
2. “Case Study Evaluation of Mahout as a Recommender Platform” by Carlos E. Seminario and David C. Wilson - Workshop on Recommendation Utility Evaluation: Beyond RMSE (RUE 2012)
3. http://mahout.apache.org/ - Apache Mahout Project Page
4. http://www.ibm.com/developerworks/java/library/j-mahout/ - Introducing Apache Mahout
5. [VIDEO] “Collaborative filtering at scale” by Sean Owen
6. [BOOK] “Mahout in Action” by Owen et. al., Manning Pub.
© Varad Meru, 2013
31
![Page 32: Machine Learning and Apache Mahout : An Introduction](https://reader033.vdocuments.mx/reader033/viewer/2022052303/540dea9b8d7f728d7e8b4b5b/html5/thumbnails/32.jpg)
+
Questions?
© Varad Meru, 2013
32
![Page 33: Machine Learning and Apache Mahout : An Introduction](https://reader033.vdocuments.mx/reader033/viewer/2022052303/540dea9b8d7f728d7e8b4b5b/html5/thumbnails/33.jpg)
+ Thank YouGo BigData!!!
33
© Varad Meru, 2014