mahout part2
DESCRIPTION
Part two of a presentation about Mahout system. It is based on http://my.safaribooksonline.com/9781935182689/TRANSCRIPT
Mahout in ActionPart 2
Yasmine M. Gaber4 April 2013
Agenda
Part 2: Clustering Part 3: Classification
Clustering
An algorithm
A notion of both similarity and dissimilarity
A stopping condition
Measuring the similarity of items
Euclidean Distance
Creating the input
Preprocess the data Use that data to create vectors Save the vectors in SequenceFile format as input for the
algorithm
Using Mahout clustering
The SequenceFile containing the input vectors.
The SequenceFile containing the initial cluster centers.
The similarity measure to be used. The convergenceThreshold. The number of iterations to be done. The Vector implementation used in the input
files.
Using Mahout clustering
Distance measures
Euclidean distance measure
Squared Euclidean distance measure
Manhattan distance measure
Distance measures
Cosine distance measure
Tanimoto distance measure
Playing Around
Representing data
Representing text documents as vectors
Vector Space Model (VSM) TF-IDF
N-gram collocations
Generating vectors from documents
$ bin/mahout seqdirectory -c UTF-8 -i examples/reuters-extracted/ -o reuters-seqfiles
$ bin/mahout seq2sparse -i reuters-seqfiles/ -o reuters-vectors -ow
Improving quality of vectors using normalization
P-norm
$ bin/mahout seq2sparse -i reuters-seqfiles/ -o reuters-normalized-bigram -ow -a org.apache.lucene.analysis.WhitespaceAnalyzer
-chunk 200 -wt tfidf -s 5 -md 3 -x 90 -ng 2 -ml 50 -seq -n 2
Clustering Categories
Exclusive clustering Overlapping clustering Hierarchical clustering Probabilistic clustering
Clustering Approaches
Fixed number of centers
Bottom-up approach
Top-down approach
Clustering algorithms
K-means clustering
Fuzzy k-means clustering
Dirichlet clustering
k-means clustering algorithm
Running k-means clustering
Running k-means clustering
$ bin/mahout kmeans -i reuters-vectors/tfidf-vectors/ -c reuters-initial-clusters -o reuters-kmeans-clusters -dm org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure -cd 1.0 -k 20 -x 20 -cl
$ bin/mahout kmeans -i reuters-vectors/tfidf-vectors/ -c reuters-initial-clusters -o reuters-kmeans-clusters -dm org.apache.mahout.common.distance.CosineDistanceMeasure -cd 0.1 -k 20 -x 20 -cl
$ bin/mahout clusterdump -dt sequencefile -d reuters-vectors/dictionary.file-* -s reuters-kmeans-clusters/clusters-19 -b 10 -n 10
Fuzzy k-means clustering
Instead of the exclusive clustering in k-means, fuzzy k-means tries to generate overlapping clusters from the data set.
Also known as fuzzy c-means algorithm.
Running fuzzy k-means clustering
Running fuzzy k-means clustering
$ bin/mahout fkmeans -i reuters-vectors/tfidf-vectors/ -c reuters-fkmeans-centroids -o reuters-fkmeans-clusters -cd 1.0 -k 21 -m 2 -ow -x 10 -dm org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure
Fuzziness factor
Dirichlet clustering
model-based clustering algorithm
Running Dirichlet clustering
$ bin/mahout dirichlet -i reuters-vectors/tfidf-vectors -o reuters-dirichlet-clusters -k 60 -x 10 -a0 1.0 -md org.apache.mahout.clustering.dirichlet.models.GaussianClusterDistribution -mp org.apache.mahout.math.SequentialAccessSparseVector
Evaluating and improving clustering quality
Inspecting clustering output Evaluating the quality of clustering0 Improving clustering quality
Inspecting clustering output
$ bin/mahout clusterdump -s kmeans-output/clusters-19/ -d reuters-vectors/dictionary.file-0 -dt sequencefile -n 10
Top Terms:
said => 11.60126582278481
bank => 5.943037974683544
dollar => 4.89873417721519
market => 4.405063291139241
us => 4.2594936708860756
banks => 3.3164556962025316
pct => 3.069620253164557
he => 2.740506329113924
rates => 2.7151898734177213
rate => 2.7025316455696204
Analyzing clustering output
Distance measure and feature selection Inter-cluster and intra-cluster distances Mixed and overlapping clusters
Improving clustering quality
Improving document vector generation Writing a custom distance measure
Real-world applications of clustering
Clustering like-minded people on Twitter
Suggesting tags for an artist on Last.fm using clustering
Creating a related-posts feature for a website
Classification
Classification is a process of using specific information (input) to choose a single selection (output) from a short list of predetermined potential responses.
Applications of classification, e.g. spam filtering
Why use Mahout for classification?
How classification works
Classification
Training versus test versus production Predictor variables versus target variable Records, fields, and values
Types of values for predictor variables
Continuous Categorical Word-like Text-like
Classification Work flow
Training the model
Evaluating the model
Using the model in production
Stage 2: evaluating the classification model
Stage 3: using the model in production
Stage 1: training the classification model
Stage 1: training the classification model
Define Categories for the Target Variable Collect Historical Data Define Predictor Variables Select a Learning Algorithm to Train the Model Use Learning Algorithm to Train the Model
Extracting features to build a Mahout classifier
Preprocessing raw data into classifiable data
Converting classifiable data into vectors
Use one Vector cell per word, category, or continuous value
Represent Vectors implicitly as bags of words Use feature hashing
Classifying the 20 newsgroups data set
Choosing an algorithm
The classifier evaluation API
Percent correct Confusion matrix Entropy matrix AUC Log likelihood
When classifiers go bad
Target leaks Broken feature extraction
Tuning the problem
Remove Fluff Variables Add New Variables, Interactions, and Derived
Values
Tuning the classifier
Try Alternative Algorithms Tune the Learning Algorithm
Thank You
Contact at:Email: [email protected]: Twitter.com/yasmine_mohamed