apache mahout - isabel-drost.de · functional languages support map/reduce. 2004 - mapreduce:...

62
Apache Mahout Large Scale Machine Learning Speaker: Isabel Drost

Upload: others

Post on 25-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

Apache MahoutLarge Scale Machine Learning

Speaker: Isabel Drost

Page 2: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

Agenda

● Motivation.

● What is machine learning?

● Introduction to Mahout.

Page 3: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

January 3, 2006 by Matt Callowhttp://www.flickr.com/photos/blackcustard/81680010

Page 4: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

News aggregation

Today: Read news papers,Blogs, Twitter, RSS feed.

Wish: Aggregate sourcesand track emerging topics.

September 10, 2008 by Alex Barthhttp://www.flickr.com/photos/a-barth/2846621384

Page 5: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

March 7, 2008 by extranoise

http://www.flickr.com/photos/extranoise/2317950586/

Page 6: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

Go to cinema

Today: IMDB, zitty, movie reviewpages, twitter, blogs, ask friends.

Wish: Reviews, sentimentdetection, recommendations.

March 22, 2008 by Crystian Cruzhttp://www.flickr.com/photos/crystiancruz/2353895708

Page 7: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

Machine learning – what's that?

Page 8: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

Image by John Leech, from: The Comic History of Rome by Gilbert Abbott A Beckett.

Bradbury, Evans & Co, London, 1850sArchimedes taking a Warm Bath

Page 9: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

Archimedes model of nature

Page 10: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

June 25, 2008 by chase-mehttp://www.flickr.com/photos/sasy/2609508999

Page 11: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004
Page 12: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

An SVM's model of nature

Page 13: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

From data to model.

Page 14: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

Gatherdata

Page 15: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

January 8, 2008 by Pink Sherbet Photographyhttp://www.flickr.com/photos/pinksherbet/2177961471/

Page 16: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

Gatherdata

Extractsignals

Algor.choice

Para-meters

Trainmodel

Applymodel

Useresults

Page 17: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

December 31, 2005, birdfarmhttp://www.flickr.com/photos/birdfarm/80052248/

Page 18: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

E-Bay

Password

If we looked at two words only:

Page 19: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

Reality

● There are a few more words in mails.● Use all relevant features/ signals available.

– Words.

– Header fields.

– Characteristics of attachments.

– …

● Usually pipeline of feature extractors.● UIMA: Apache project focussed on that task.

Page 20: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

Gatherdata

Extractsignals

Algor.choice

Para-meters

Trainmodel

Applymodel

Useresults

Page 21: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

Training the model

● No single best algorithm for all tasks.● No single best parameter setting per algorithm.

● Evaluate constantly.

Page 22: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

Gatherdata

Extractsignals

Algor.choice

Para-meters

Trainmodel

Applymodel

Useresults

Page 23: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

Gatherdata

Extractsignals

Algor.choice

Para-meters

Trainmodel

Applymodel

Useresults

Problem:“Nature changes”

Page 24: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

Challenges.

Page 25: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

Challenges

● Amount of data grows exponentially.– User generated content on the web.

– Sensor data.

– Customer logs.

● Index and search the data.● Build models and generalize from raw data.

● How do non-Googlers deal with that?

Page 26: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

Once upon a time

● A Java library.● Index with easy to use API.

Page 27: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

Once upon a time

● A library alone is not enough.

Page 28: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

Once upon a time

● Lucene: Umbrella project for search at Apache.

Page 29: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

Once upon a time

● nutch needed a way to scale to the web.

Page 30: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

● Functional languages support map/reduce.

● 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat.

● 2004 - Initial versions of DFS and Map-Reduce by Doug Cutting & Mike Cafarella

● December 2005 - Nutch ported to framework, 20 nodes.

● January 2006 - Doug Cutting joins Yahoo!

● February 2006 - Apache Hadoop project hived off.

● March 2006 - Formation of the Yahoo! Hadoop team

● April 2007 - Research clusters - 2 clusters of 1000 nodes

Page 31: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

Once upon a time

Page 32: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

Once upon a time

Page 33: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

Once upon a time

And many more inside and outside Apache.

Page 34: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

Where does Mahout fit in?

● Amount of data to process is growing.● Idea: Scale and go parallel.

Page 35: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

Where does Mahout fit in?

● Amount of data to process is growing.● Idea: Scale and go parallel.

Page 36: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

Where does Mahout fit in?

● Amount of data to process is growing.● Idea: Scale and go parallel.

Page 37: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

Where does Mahout fit in?

?

Page 38: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

What does Mahout have to offer.

Page 39: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

Discover groups of items

● Group items by similarity.

● Examples:– Group news articles by topic.

– Find developers with similar interests.

– Discovery of groups of related search results.

Page 40: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

Discover groups of similar items

● Canopy.

● k-Means.

● Fuzzy k-Means.

● Dirichlet based.

● Others upcoming.

Page 41: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

Discover groups of similar items

● Example: Synthetic Control

– http://archive.ics.uci.edu/ml/datasets/Synthetic+Control+Chart+Time+Series– Example Job: <MAHOUT_HOME>/examples– Outputs clusters

● Download the distribution.● Run the example.● Have a closer look at the examples.

Page 42: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

Assign items to defined categories.

● Given pre-defined categories, assign items to it.

● Examples:– Spam mail classification.

– Discovery of images depicting humans.

Page 43: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

Assign items to defined categories.

● Naïve Bayes.

● Complementary naïve bayes.

● Winnow/Perceptron.

● Others upcoming.

Page 44: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

Assign items to defined categories.

● Examples based on “standard” datasets:

● 20 Newsgroups

– http://cwiki.apache.org/confluence/display/MAHOUT/TwentyNewsgroups

● Wikipedia

– http://cwiki.apache.org/confluence/display/MAHOUT/WikipediaBayesExample

Page 45: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

Evolutionary algorithms

● Traveling Salesman– http://cwiki.apache.org/confluence/display/MAHOUT

/Traveling+Salesman

● Classification rule discovery– http://cwiki.apache.org/confluence/display/MAHOUT

/Class+Discovery

Page 46: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

Collaborative filtering

● Recommend items to users.

● Examples:– Find movies I might want to watch.

– Find books related to the book I am buying.

Page 47: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

Collaborative filtering

Page 48: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

Collaborative filtering

Page 49: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

Recommendation mining.

● Mahout with more Taste.● Mature Java library.● Java-based, web service / HTTP bindings.

● Batch mode based on EC2 and Hadoop.

Page 50: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

What next?

● More algorithms.

● More examples.

Page 51: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

What next?

● 2nd Summer of code.● Four mentors.● Three students.● Two returning students.

Robin Anil: Online Classification and Frequent Pattern Mining using Map-Reduce.

David Hall: Distributed Latent Dirichlet Allocation.

AbdelHakim: Implement parallel Random/ Regression Forest.

Page 52: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

Why go for Apache Mahout?

Page 53: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

Jumpstart your project with proven code.

January 8, 2008 by dreizehn28http://www.flickr.com/photos/1328/2176949559

Page 54: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

Discuss ideas and problems online.

November 16, 2005 [phil h]http://www.flickr.com/photos/hi-phi/64055296

Page 55: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

Become part of the community.

Page 56: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

Release: 0.1Big Thanks to those who made this possible!

October 22, 2008 by e_calamarhttp://www.flickr.com/photos/e_calamar/2964991182/

Page 57: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

[email protected]

[email protected]

Interest in machine learning.

Interesting problems.

Hadoop proficiency.

Bug reports, patches, features.

Documentation, code, examples.July 9, 2006 by trackrecordhttp://www.flickr.com/photos/trackrecord/185514449

Page 58: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

Ahem – I do not own a big cluster...

Page 59: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

● Mahout runs on top of Amazon EMR.● Run Mahout on your Hadoop cluster on EC2.● Committers do get free credits for EC2 ;)● Set up your own Hadoop cluster.

Page 60: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

June, 25th 2009: Hadoop* Get Together in Berlin

● Torsten Curdt: “Data Legacy - the challenges of an evolving data warehouse.”

● Christoph M. Friedrich: “SCAIView - Lucene for Life Science Knowledge Discovery”

● Uri Boness, Bram Smeets: “Solr in production.”

newthinking store

Tucholskystr. 48

September, 29th 2009: Hadoop* Get Together in Berlin featuring a talk on UIMA by Thilo Götz.

* UIMA, Hbase, Lucene, Solr, katta, Mahout, CouchDB, pig, Hive, Cassandra, Cascading, JAQL, ... talks welcome as well.

Page 61: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004

[email protected]

[email protected]

Interest in machine learning.

Interesting problems.

Hadoop proficiency.

Bug reports, patches, features.

Documentation, code, examples.July 9, 2006 by trackrecordhttp://www.flickr.com/photos/trackrecord/185514449

Page 62: Apache Mahout - isabel-drost.de · Functional languages support map/reduce. 2004 - MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat. 2004