mahout classification presentation

Post on 27-Jan-2015

111 Views

Category:

Technology

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

These slides were presented in class on April 7th, 2014.

TRANSCRIPT

Classification on MahoutNaoki NakataniSan Jose State University

CS185C Spring 2014

Agenda

● Classification Overview● Mahout Overview

○ Classification on Mahout● Case Study with Demo

○ Problem Description○ Working Environment○ Data Preparation○ ML Model Generation

Classification?● Classifying examples into given set of categories● Supervised learning

○ Prepare data○ Build classifier (train & test)○ Apply classifier to new data

http://www.ndm.net/opentext/images/stories/images/extraction_cmyk_thumb.jpg

Mahout?● Scalable machine learning

library = Can handle Big Data

● Runs on HDFS● Classification, Clustering,

Collaborative Filtering , etc

http://www.robinanil.com/wp-content/uploads/2010/03/mahout-logo-200.png

Classification on Mahout?Classifying examples into given set of categories

Scalable machine learning library that can handle big data

Classifying big data into given set of categories

Case Study & Demo

Given question with title and body, can we automatically generate tags for it?

Where can I find the LaTeX3 manual?Few month ago I saw a big pdf-manual of all LaTeX3-packages and the new syntax. I think it was bigger than 300 pages. I can't find it on the web.

Does anyone have a link?

Documentation

latex3

expl3

DatasetFile :● TrainSmall.tsv

Fields :● id, title, body, tags

Characteristics :● Each question contains

only one tag

\0

“----” , ”-----------” , “------------------------” , “--- --- --- ---”

\0

\0

“----” , ”-----------” , “------------------------” , “--- --- --- ---”“----” , ”-----------” , “------------------------” , “--- --- --- ---”

Working Environment

● Mac OS 10.9.1● Eclipse 4.3.2● Hadoop 1.2.1● Mahout 0.9● Source code available here.

Prerequisite (Where are you?)● You have input tsv file at result > output-topfivetags.● You are at “result” directory in Terminal.● Command “hadoop” and “mahout” is working.

Prepare Data1. Convert TSV file to Hadoop sequence file format.

Specify tag as a category. (Run TSVToSeq.java)

output-tsvtoseq folder and chunk-0 file is created.

Prepare Data1. Make directory in HDFS and upload chunk-0 (sequence

file) to the folder.

hadoop fs -mkdir <directory>

hadoop fs -put <source> <destination>

Prepare Data2. Transform questions into vectors. (mahout seq2sparse)

mahout seq2sparse -i <input directory> -o <output directory>

Prepare Data3. Split data into

a. Train set : to train modelb. Test set : to test model

mahout split \-i <input directory> \

--trainingOutput <output dir to train> \--testOutput <output dir to test> \--randomSelectionPct <integer> \

--overwrite \--sequenceFiles \

-xm sequential

Build Classifier1. Choose algorithm to use for classificationAvailable algorithms:

○ Naive Bayes■ trainnb, testnb■ org.apache.mahout.

classifier.naivebayes

○ Hidden Markov Model■ baumwelch, hmmpredict■ org.apache.mahout.

classifier.sequencelearning.hmm

○ Logistic Regression■ trainlogistic, testlogistic■ org.apache.mahout.

classifier.sgd

○ Random Forest■ ?■ ?

2. Train & test model using train set

Should yield high accuracy

Build Classifier (Naive Bayes)

mahout trainnb \-i <dir to train vectors> \

-el \-li <dir to put label index> \

-o <dir to put model> \-ow \

-c

mahout testnb \-i <dir to train vectors> \

-m <dir to model> \-l <dir to label index> \

-ow \-o <output dir> \

-c

Build Classifier (Naive Bayes)3. Test model using test set

Check if the accuracy is satisfactory

Apply ClassifierWhat do you have at this point?● model● label index

You can start classifying new data! (Check this example)

Model

Label Index

Happy Machine Learning!

top related