mahout introduction barcampdc
DESCRIPTION
An introduction to Apache Mahout presented at Apache BarCamp DC, May 19, 2012A brief introduction to the examples and links to more resources for further exploration.TRANSCRIPT
MahoutLearning with
About me
Drew Farris Committer to Apache Mahout since
2/2010 ..not as active in the past year
Author: Taming Text My Company: (and BarCamp DC Sponsor)
What is Mahout?
Mahout (as in hoot) or Mahout (as in trout)?
A scalable machine learning library
What is Mahout?
A scalable machine learning library ‘large’ data sets Often Hadoop ..but sometimes not
What is Mahout?
A scalable machine learning library Recommendation Mining
What is Mahout?
A scalable machine learning library Recommendation Mining Clustering
What is Mahout?
A scalable machine learning library Recommendation Mining Clustering Classification
What is Mahout?
A scalable machine learning library Recommendation Mining Clustering Classification Association Mining
What is Mahout?
A scalable machine learning library Recommendation Mining Clustering Classification Association Mining A reasonable linear algebra library A reasonable library of collections
What is Mahout?
A scalable machine learning library Recommendation Mining Clustering Classification Association Mining A reasonable linear algebra library A reasonable library of collections Other Stuff
Mahout
Getting Started Check out & build the code ▪ git clone git://git.apache.org/mahout.git▪ mvn install –DskipTests=true▪ The tests take a looong time to run, not needed for
intial build Or use the Cloudera Virtual Machine (http://bit.ly/
MyBnFi)
Mahout
Getting Started Check out & build the code Examples in examples/bin
Mahout
Getting Started Check out & build the code Examples in examples/bin Wiki (http://mahout.apache.org/)
Mahout
Getting Started Check out & build the code Examples in examples/bin Wiki (http://mahout.apache.org/) Articles & Presentations▪ Grant’s IBM Developerworks Article▪ http://ibm.co/LUbptg (Nov 2011)
▪ Others @ http://bit.ly/IZ6PqE (wiki)
Mahout
Getting Started Check out & build the code Examples in examples/bin Wiki (http://mahout.apache.org/) Articles & Publications (http://bit.ly/IZ6PqE) Mailing Lists ▪ [email protected] ▪ (http://bit.ly/L1GSHB)▪ [email protected]▪ (http://bit.ly/JPeNoE)
Mahout
Getting Started Check out & build the code Examples in examples/bin Wiki (http://mahout.apache.org/) Articles & Presentations Mailing Lists Books! ▪ Mahout in Action: http://bit.ly/IWMvaz▪ Taming Text: http://bit.ly/KkODZV
Mahout Examples
Kicking the Tires in examples/bin classify-20newsgroups.sh cluster-reuters.sh cluster-syntheticcontrol.sh asf-email-examples.sh
Mahout Examples
Kicking the Tires in examples/bin classify-20newsgroups.sh Premise: Classify News Stories Algorithm: sgd Data: http://people.csail.mit.edu/jrennie/20Newsgroups/20news-
bydate.tar.gz
Mahout Examples
Kicking the Tires in examples/bin cluster-reuters.sh Premise: Group Related News Stories Data: http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.tar.gz
Mahout Examples
Kicking the Tires in examples/bin cluster-syntheticcontrol.sh▪ Premise: Cluster time series data▪ normal, cyclic, increasing, decreasing, upward,
downward shift
▪ Algorithms: ▪ canopy, kmeans, fuzzykmeans, dirichlet, meanshift
See: https://cwiki.apache.org/MAHOUT/clustering-of-synthetic-control-data.html Data: http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data.html
Mahout Examples
Kicking the Tires in examples/bin asf-email-examples.sh▪ Recommendation (user based)▪ Clustering (kmeans, dirichlet, minhash)▪ Classification (naïve bayes, sgd)
Learning Outline
General Outline: Data Transformation▪ From Native format to…▪ ..Sequence Files; Typed Key, Value pairs▪ ..Labeled Vectors
Learning Outline
General Outline: Data Transformation▪ From Native format to…▪ ..Sequence Files; Typed Key, Value pairs▪ ..Labeled Vectors
Model Training
Learning Outline
General Outline: Data Transformation▪ From Native format to…▪ ..Sequence Files; Typed Key, Value pairs▪ ..Labeled Vectors
Model Training Model Evaluation
Learning Outline
General Outline: Data Transformation▪ From Native format to…▪ ..Sequence Files; Typed Key, Value pairs▪ ..Labeled Vectors
Model Training Model Evaluation Lather, Rinse, Repeat
Learning Outline
General Outline: Data Transformation▪ From Native format to…▪ ..Sequence Files; Typed Key, Value pairs▪ ..Labeled Vectors
Model Training Model Evaluation Lather, Rinse, Repeat Production
Learning Outline
General Outline: Data Transformation▪ From Native format to…▪ ..Sequence Files; Typed Key, Value pairs▪ ..Labeled Vectors
Model Training Model Evaluation Lather, Rinse, Repeat Production Lather, Rinse, Repeat
Text to Sparse Vectors
mahout seq2sparse Tokenize Documents Count Words Make Partial/Merge Vectors TFIDF Make Partial/Merge TFIDF Vectors
Tips
View Sequence Files with: mahout seqdumper –i /path/to/sequence/file
Check out shortcuts in: src/conf/driver.classes.props
Run classes with: mahout org.apache.mahout.SomeCoolNewFeature …
Standalone vs. Distributed Standalone mode is default Set HADOOP_CONF_DIR to use Hadoop MAHOUT_LOCAL will force standalone
Example: Recommendation asf-email-examples.sh (recommendation)
Premise: Recommend Interesting Threads User based recommendation Boolean preferences based on thread
contribution Implies boolean similarity measure – tanimoto, log-
likelihood
See: http://www.ibm.com/developerworks/java/library/j-mahout-scaling/
Recommendation Process Recommendation Steps
Convert Mail to Sequence Files Convert Sequence Files to Preferences Prepare Preference Matrix Row Similarity Job Recommender Job
See: http://www.ibm.com/developerworks/java/library/j-mahout-scaling/
Example: Classification
asf-email-examples.sh (classification)
Premise: Predict project mailing lists for incoming messages
Data labeled based on the mailing list it arrived on Hold back a random 20% of data for testing, the
rest for training. Algorithms: Naïve Bayes (Standard, Complimentary),
SGD
See: http://www.ibm.com/developerworks/java/library/j-mahout-scaling/
Classification Process
Classification Steps Convert Mail to Sequence Files Sequence Files to Sparse Vectors Modify Sequence File Labels Split into Training and Test Sets Train the Model Test the Model
See: http://www.ibm.com/developerworks/java/library/j-mahout-scaling/
Example: Clustering
asf-email-examples.sh (clustering)
Premise: Grouping Messages by Subject Same Prep as Classification Different Algorithms: (kmeans, dirichlet,
minhash)
12/05/16 05:16:02 INFO driver.MahoutDriver: Program took 20577398 ms (Minutes: 342.95663333333334
See: http://www.ibm.com/developerworks/java/library/j-mahout-scaling/
Clustering Process
Clustering Steps Convert Mail to Sequence Files Sequence Files to Sparse Vectors Run Clustering (iterate) Dump Results
Where to now?
Insert Bar Camp Style Discussion Here
Resources
Mahout in Action Owen, Anil, Dunning and Friedman http://bit.ly/IWMvaz
Taming Text Ingersoll, Morton and Farris http://bit.ly/KkODZV