text classification powered by apache mahout and lucene

Text classificationWith Apache Mahout and Lucene

Isabel Drost-Fromm

Software Engineer at Nokia Maps*

Member of the Apache Software Foundation

Co-Founder of Berlin Buzzwords and Berlin Apache Hadoop GetTogether

Co-founder of Apache Mahout

*We are hiring, talk to me or mail careers@here.com

https://cwiki.apache.org/confluence/display/MAHOUT/Powered+By+Mahout

… provide your own success story online.

Classification?

January 8, 2008 by Pink Sherbet Photographyhttp://www.flickr.com/photos/pinksherbet/2177961471/

By freezelight, http://www.flickr.com/photos/63056612@N00/155554663/

http://www.flickr.com/photos/29143375@N05/3344809375/in/photostream/

http://www.flickr.com/photos/redux/409356158/

Image by jasondevillahttp://www.flickr.com/photos/jasondv/91960897/

How a linear classifier sees data

Image by ZapTheDingbat (Light meter)http://www.flickr.com/photos/zapthedingbat/3028168415

Instance*

(sometimes also called example, item, or in databases a row)

Feature*

(sometimes also called attribute, signal, predictor, co-variate, or column in databases)

Label*

(sometimes also called class, target variable)

Image taken in Lisbon/ Portugal.

Image by jasondevillahttp://www.flickr.com/photos/jasondv/91960897/

● Remove noise.

● Convert text to vectors.

Text consists of terms and phrases.

Encoding issues?

Chinese? Japanese?

“New York” vs. new York?

“go” vs. “going” vs. “went” vs. “gone”?

“go” vs. “Go”?

Terms? Tokens? Wait!

Now we have terms – how to turn theminto vectors?

Sunny weather

High performance computing

If we looked at two phrases only:

Binary bag of words

● Imagine a n-dimensional space.

● Each dimension = one possible word in texts.

● Entry in vector is one, if word occurs in text.

● Problem:

– How to know all possible terms in unknown text?

bi , j={1∀ xi∈d j0else }

Term Frequency

● Entry in vector equal to the words frequency.

● Problem:

– Common words dominate vectors.

bi , j=ni , j

TF with stop wording

● Filter stopwords.

● Entry in vector equal to the words frequency.

● Problem:

– Common and uncommon words with same weight.

bi , j=ni , j

TF- IDF

● Filter stopwords.

● Entry in vector equal to the weighted frequency.

● Problem:

– Long texts get larger values.

bi , j=ni , j×log ∣D∣

∣{d : ti∈d }∣

Hashed feature vectors

● Each word in texts = hashed to one dimension.

● Entry in vector set to one, if word hashed to it.

How a linear classifier sees data

LuceneAnalyzer

HTML Apache Tika Fulltext

OnlineLearner

Tokenstream+xFeatureVector

EncoderVector Model

Image by ZapTheDingbat (Light meter)http://www.flickr.com/photos/zapthedingbat/3028168415

● Did I use the best model parameters?

● How well will my model perform in the wild?

Tune modelParameters,

Experiment withTokenization,

Experiment withVector Encoding

Compute expectedperformance

Performance

● Use same data for training and testing.

● Problem:

– Highly optimistic.

– Model generalization unknown.

Performance

● Use same data for training and testing.

● Problem:

– Model generalization unknown.

Performance

● Use just a fraction for training.

● Set some data aside for testing.

● Problems:

– Pessimistic predictor: Not all data used for training.

– Result may depend on which data was set aside.

Performance

● Partition your data into n fractions.

● Each fraction set aside for testing in turn.

● Problem:

– Still a pessimistic predictor.

Performance

● Set some data aside for tuning and testing.

● Problems:

– Parameters manually tuned to testing data.

Performance

● Set some data aside for tuning and testing.

● Problems:

– Parameters manually tuned to testing data.

Performance

● Set some data aside for tuning.

● Set another set of data aside for testing.

● Problems:

– Pretty pessimistic as not all data is used.

– May depend on which data was set aside.

Performance Measures

Correct prediction: negative Correct prediction: positive

Model prediction: positive

Model prediction: negative

Accuracy

ACC=true positivetruenegative

true positive false positive false negativetruenegative

● Problems:

– What if class distribution is skewed?

Precision/ Recall

Precision=true positive

true positive false positive

Recall=true positive

true positive false negative

● Problem:

– Depends on decision threshold.

ROC Curves

Orange rate

ROC Curves

False orange rate

True orange rate

ROC Curves

False orange rate

True orange rate

ROC Curves

False orange rate

True orange rate

ROC Curves

False orange rate

True orange rate

ROC Curves

False orange rate

True orange rate

AUC – area under ROC

False orange rate

True orange rate

Foto taken by fras1977http://www.flickr.com/photos/fras/4992313333/

Image by Medienmagazin prohttp://www.flickr.com/photos/medienmagazinpro/6266643422

http://www.flickr.com/photos/generated/943078008/

Math libs/ Mahout collections

Apache Hadoop-ready

Recommendations/Collaborative filtering

Classification/Logistic Regression/ SGD

Sequence learning/HMM

kNN and matrix factorizationbased Collaborative filtering

Classification/Naïve Bayes, random forest

Frequent item sets/(P)FPGrowth

Co-Location search

Clustering/ Mean shift, k-Means,Canopy, Dirichlet Process,

Image by pareeericahttp://www.flickr.com/photos/pareeerica/3711741298/

Libraries to have a look at:Vowpal Wabbit MalletLibSvm LibLinearLibfm IncanterGraphLab Skikits learn

Get your hands dirty:http://kaggle.com

https://cwiki.apache.org/confluence/display/MAHOUT/Collections

Where to get more information:“Mahout in Action” - Manning“Taming Text” - Manning“Machine Learning” - Andrew Ng

https://cwiki.apache.org/confluence/display/MAHOUT/Books+Tutorials+and+Talks

https://cwiki.apache.org/confluence/display/MAHOUT/Reference+Reading

Frameworks worth mentioning:Apache Mahout Apache GiraphMatlab/ Otave RShogun WekaRapidI MyMedialight

Where to meet these people:RecSys ICMLNIPS ECMLKDD WSDMPKDD JMLRApacheCon Berlin BuzzwordsO'Reilly Strata

Get started today with the right tools.

January 8, 2008 by dreizehn28http://www.flickr.com/photos/1328/2176949559

Discuss ideas and problems online.

November 16, 2005 [phil h]http://www.flickr.com/photos/hi-phi/64055296

Discuss ideas and problems in person.

Images taken at Berlin Buzzwords 2011/12/13 byPhilipp Kaden. See you there end of May 2014.

Become a committer yourself

http://BerlinBuzzwords.de – End of May 2014 in Berlin/ Germany.

Online – user/dev@mahout.apache.org, java-user@lucene.apache.org, dev@lucene.apache.org

Interest in solving hard problems.

Being part of lively community.

Engineering best practices.

Bug reports, patches, features.

Documentation, code, examples.

Image by: Patrick McEvoy

By freezelight, http://www.flickr.com/photos/63056612@N00/155554663/

text classification powered by apache mahout and lucene

set of data

performance use

possible word

performance partition

performance measures

jasondevilla http

ndimensional space

roc curvesorange rate

Technology

collaborative filtering - rajashree. apache mahout in 2008...

introducing apache mahout

indic threads pune12-recommenders-apache-mahout

recomendação de conteúdo com apache mahout

apache lucene

apache mahout

apache mahout - introduction

intelligent apps with apache lucene, mahout and friends

learning apache mahout - sample chapter

apache mahout

apache mahout clustering designs - sample chapter

apache lucene

Лекция 10. apache mahout

intelligent apps with apache lucene, mahout and friends

apache mahout essentials - sample chapter

intro to apache mahout

indicthreads pune12 recommenders apache mahout

learning apache mahout classification -...

solr @ etsy - apache lucene eurocon

apache solr/lucene: looking ahead