Foursquare - ML Presentation

Download Foursquare - ML Presentation

Post on 26-Mar-2015

37.255 views

Category:

Documents

2 download

Embed Size (px)

TRANSCRIPT

<p>Big Data @ foursquareInfrastructure, Analy6cs, Predic6on, and Beyond </p> <p>3/22/2011 Machine Learning Meetup </p> <p>Jus6n Moore - @injust Ma;hew Rathbone - @rathboma </p> <p>Overview What is foursquare Analy6cs and Data Machine Learning, Recommenda6ons </p> <p>3/22/2011 Machine Learning Meetup </p> <p>Jus6n Moore - @injust Ma;hew Rathbone - @rathboma </p> <p>What is Foursquare? Loca6on based startup, applica6on that helps you to explore your city, discover new places Visit places, check-in, earn rewards, stay connected with your friends Game elements: single-player, mul6- player 3/22/2011 Machine Learning Meetup Jus6n Moore - @injust Ma;hew Rathbone - @rathboma </p> <p>What is Foursquare? (cont.) 7M+ users, 15M+ venues, 500M+ check-ins Large reach (every country, North Pole, Space, Everest) Na6ve app for almost every smartphone, also available on SMS, web, mobile-web 3/22/2011 Machine Learning Meetup Jus6n Moore - @injust Ma;hew Rathbone - @rathboma </p> <p>Explore Our new social- recommenda6on engine Real-6me sugges6ons based on your social graph. 3/22/2011 Machine Learning Meetup Jus6n Moore - @injust Ma;hew Rathbone - @rathboma </p> <p>Data Model Users Check-ins Venues </p> <p>Shouts </p> <p>Tips/To-dos </p> <p>3/22/2011 Machine Learning Meetup </p> <p>Jus6n Moore - @injust Ma;hew Rathbone - @rathboma </p> <p>Analy6cs @ Foursquare Im going to talk about: Why produc6on dbs are bad for analy6cs What we do to make it be;er (hint: hadoop) Our custom Dashboard Usage examples Thoughts about the hadoop/hive experience 3/22/2011 Machine Learning Meetup Jus6n Moore - @injust Ma;hew Rathbone - @rathboma </p> <p>Our Data: Problems using the Produc6on Databases </p> <p>3/22/2011 Machine Learning Meetup </p> <p>Jus6n Moore - @injust Ma;hew Rathbone - @rathboma </p> <p>3/22/2011 Machine Learning Meetup </p> <p>Jus6n Moore - @injust Ma;hew Rathbone - @rathboma </p> <p>3/22/2011 Machine Learning Meetup </p> <p>Jus6n Moore - @injust Ma;hew Rathbone - @rathboma </p> <p>Our data: So we turn to our friends </p> <p>Our repor6ng / analy6cs / data mining stack is thanks to open source sobware 3/22/2011 Machine Learning Meetup Jus6n Moore - @injust Ma;hew Rathbone - @rathboma </p> <p>Our data: What we do instead </p> <p>Log Files 3/22/2011 Machine Learning Meetup Jus6n Moore - @injust Ma;hew Rathbone - @rathboma </p> <p>About Hadoop and Hive Hadoop: Distributed Data processing framework (map-reduce). Wri;en in Java Hive: SQL layer on top of hadoop Lets us do select count(1) from checkins instead of having to write our own map-reduce java classes. Image from ibm.com Jus6n Moore - @injust Ma;hew Rathbone - @rathboma </p> <p>3/22/2011 Machine Learning Meetup </p> <p>About Hive Create/Drop/Insert/Select etc Table Joins Aggrega6on Func6ons Date Func6ons URL parsing func6ons Cool n-gram func6ons Just now gegng database drivers for popular languages (JAVA) Jus6n Moore - @injust Ma;hew Rathbone - @rathboma </p> <p>3/22/2011 Machine Learning Meetup </p> <p>About Hive Select * from x; Select count(1) from x; Select sum(x.price) from x; Select a, sum(price) from x group by a; Select a from x where datedi(2011-01-01, d) = 0; Drop table x; 3/22/2011 Machine Learning Meetup Jus6n Moore - @injust Ma;hew Rathbone - @rathboma </p> <p>Hadoop vs Hive #mapper: $stdin.each do |line| date, country, id = line.split puts date + , + country end #reducer counts = Hash.new(0) $stdin.each do |line| counts[line] += 1 end puts counts </p> <p>SELECT created_date, country, VS count(1) FROM checkins GROUP BY created_date, country Jus6n Moore - @injust Ma;hew Rathbone - @rathboma </p> <p>3/22/2011 Machine Learning Meetup </p> <p>Our Hadoop Infrastructure We use clusters generated through amazons Elas6c MapReduce That means we store all of our data in at les in Amazon S3 (which keeps things simple) We export data from both MongoDB and h;p proxy log-les We manage everything using a custom ruby-on-rails dashboard rake cluster:start[30] =&gt; starts a 30 node cluster, just like that 3/22/2011 Machine Learning Meetup Jus6n Moore - @injust Ma;hew Rathbone - @rathboma </p> <p>Our Dashboard Dene and schedule reports through it Allow ad-hoc access to (internal) users Controls data imports into S3 from mongo/ log-les Provides an intermediate DB layer for rollup data caching(experimental atm) Allows you to do a bunch of cool stu with queries aber theyve run 3/22/2011 Machine Learning Meetup Jus6n Moore - @injust Ma;hew Rathbone - @rathboma </p> <p>Example: Impor6ng Data </p> <p>3/22/2011 Machine Learning Meetup </p> <p>Jus6n Moore - @injust Ma;hew Rathbone - @rathboma </p> <p>Example: Query Walkthrough Find top 20 venues in Switzerland venuename zurich airport (zrh) geneva-cointrin airport (gva) zurich hauptbahnhof sony ericsson football hotspot basel bahnhof sbb gare de cornavin bern hauptbahnhof gare de lausanne apple store bahnhof luzern terminal e bellevueplatz terminal a bahnhof oerlikon bahnhof stadelhofen sihlcity zurich ughafen bahnhof bahnhof olten bahnhof winterthur bahnhof hardbrcke city kloten grand-saconnex zurich basel basel geneva bern lausanne zurich luzern kloten zurich kloten zurich zurich zurich zurich olten winterthur zurich total 3746 3012 1780 773 761 760 736 672 670 477 458 457 455 453 444 400 400 391 379 369 </p> <p>3/22/2011 Machine Learning Meetup </p> <p>Jus6n Moore - @injust Ma;hew Rathbone - @rathboma </p> <p>Walkthrough: Start the query </p> <p>3/22/2011 Machine Learning Meetup </p> <p>Jus6n Moore - @injust Ma;hew Rathbone - @rathboma </p> <p>Walkthrough: Get the results in email </p> <p>3/22/2011 Machine Learning Meetup </p> <p>Jus6n Moore - @injust Ma;hew Rathbone - @rathboma </p> <p>Walkthrough: Top Venues </p> <p>3/22/2011 Machine Learning Meetup </p> <p>Jus6n Moore - @injust Ma;hew Rathbone - @rathboma </p> <p>Walkthrough If we want to schedule something to run daily/weekly/ monthly we can do that too Reports are represented as Ac6veRecord models 3/22/2011 Machine Learning Meetup Jus6n Moore - @injust Ma;hew Rathbone - @rathboma </p> <p>Walkthrough: Reports feed our dashboards </p> <p>3/22/2011 Machine Learning Meetup </p> <p>Jus6n Moore - @injust Ma;hew Rathbone - @rathboma </p> <p>Walkthrough: queries allow data explora6on </p> <p>3/22/2011 Machine Learning Meetup </p> <p>Jus6n Moore - @injust Ma;hew Rathbone - @rathboma </p> <p>Stats on the Stats Stack 25-machine clusters Reports on check-in data (joining venues and/or users) usually take 5-15 minutes to run Reports on log data usually take 10-20 minutes to run We run 10-30 reports a day Most data goes into a Google spreadsheet for people to look at. 3/22/2011 Machine Learning Meetup Jus6n Moore - @injust Ma;hew Rathbone - @rathboma </p> <p>Thoughts on Amazons EMR The API has very low rate limits Everything is a HTTP get request (even crea6ng a cluster) The ruby library/client is unusable as a client library. (we shell out to it in order to capture the resul6ng JSON) 3/22/2011 Machine Learning Meetup Jus6n Moore - @injust Ma;hew Rathbone - @rathboma </p> <p>Thoughts on Hive Generally good Some6mes it will act crazy Par66oning data is harder than it looks The JSON serde makes all sorts of weird stu happen when youre joining tables Always join LAST! 3/22/2011 Machine Learning Meetup Jus6n Moore - @injust Ma;hew Rathbone - @rathboma </p> <p>Working With Hive SELECT v.venuename, count(*) FROM checkins c JOIN venues v ON c.venueid = v.id GROUP BY v.address SELECT v.venuename, c.total FROM (SELECT venueid, count(1) FROM checkins GROUP BY venueid ) c JOIN venues v on c.venueid = v.id BETTER Jus6n Moore - @injust Ma;hew Rathbone - @rathboma </p> <p>OK 3/22/2011 Machine Learning Meetup </p> <p>Our Data: End Hadoop + Hive &gt; Mongo + Scripts Simple ruby dashboard == super useful Lots of data == fun charts QUESTIONS? 3/22/2011 Machine Learning Meetup Jus6n Moore - @injust Ma;hew Rathbone - @rathboma </p> <p>foursquare 3.0: Explore </p> <p>3/22/2011 Machine Learning Meetup </p> <p>Jus6n Moore - @injust Ma;hew Rathbone - @rathboma </p> <p>Engineering an Online Recommenda6on System </p> <p>3/22/2011 Machine Learning Meetup </p> <p>Jus6n Moore - @injust Ma;hew Rathbone - @rathboma </p> <p>Engineering cont. Goals: Here and now No new signals Use all of our textual data 100ms per query Jus6n Moore - @injust Ma;hew Rathbone - @rathboma </p> <p>3/22/2011 Machine Learning Meetup </p> <p>Engineering cont. Pain points: Geo indexes, compound geo indexes Limi6ng queries in minimally impacul ways Cached datastores (building rollup collec6ons) Geo indexes 3/22/2011 Machine Learning Meetup Jus6n Moore - @injust Ma;hew Rathbone - @rathboma </p> <p>Compu6ng a Similarity Matrix Analyzing similarity func6ons OK on single machine 10M+ venues = 100 trillion element sparse matrix Compute without visi6ng every element Parallelize, cross machine 3/22/2011 Machine Learning Meetup Jus6n Moore - @injust Ma;hew Rathbone - @rathboma </p> <p>Compute Similarity Matrix, cont. Leverage Mahouts library of similarity func6ons, easy to extend Job system controls execu6on of sequen6al dependent M-R tasks Hadoop: easily scalable to large commodity machine clusters, elas6c makes increasing cluster size trivial 3/22/2011 Machine Learning Meetup Jus6n Moore - @injust Ma;hew Rathbone - @rathboma </p> <p>Compute Similarity Matrix, cont. Series of Jobs, each do a Map-Reduce 1. Convert input at le dumped from Hive to binary sparse vector representa6on 2. Compute pairwise co-occurrences 3. Compute column based weights (column normaliza6on), retrieve all vectors with co-occurrences 4. Compute pairwise similari6es, store in sparse matrix format 5. Fla;en sparse matrix to text format that we can load into DB Jus6n Moore - @injust Ma;hew Rathbone - @rathboma </p> <p>3/22/2011 Machine Learning Meetup </p> <p>The Value of Why Show people which friends visited, which places are co-visited (not the same as similar?) Lowers the bar for precision </p> <p> Mix with the social, story-telling aspects of product Collabora6ve ltering allows for easy descrip6on </p> <p> Allows users to choose for themselves among recs Increase propensity to check-in (sales pitch for the venue) </p> <p>3/22/2011 Machine Learning Meetup </p> <p>Jus6n Moore - @injust Ma;hew Rathbone - @rathboma </p> <p>Case Study: Dening Interes6ng Need to show ranked venues for cold-start Various inuencing factors in what makes a place interes6ng Number of users checked in Average visits per user Tips leb, to-dos done How people check-in (broadcast to T/FB, o-the-grid?) Trending direc6on (more popular lately?) </p> <p> Measuring raw popularity poses problems </p> <p> Places open just for lunch, smaller dining rooms, longer meal 6mes Been in system longer, opened recently Dierences between categories (coee shops != burger joints) Jus6n Moore - @injust Ma;hew Rathbone - @rathboma </p> <p>3/22/2011 Machine Learning Meetup </p> <p>Dening Interes6ng cont. 7 6 Visits Per User 5 4 3 2 1 0 3/22/2011 Machine Learning Meetup </p> <p>Local Favorite </p> <p>Must See </p> <p>Unique Users Jus6n Moore - @injust Ma;hew Rathbone - @rathboma </p> <p>Future Direc6ons S6ll a big unknown, collect user feedback to drive development Scale beyond just co-occurrences, improve predic6on in new territory Planning mode (beyond the here and now) Joint recommenda6ons (where do I go with this set of friends?) 3/22/2011 Machine Learning Meetup Jus6n Moore - @injust Ma;hew Rathbone - @rathboma </p> <p>Help us get there foursquare is hiring www.foursquare.com/jobs Jus6n Moore Ma;hew Rathbone @injust @rathboma jus6n@foursquare.com ma;hew@foursquare.com </p> <p>3/22/2011 Machine Learning Meetup </p> <p>Jus6n Moore - @injust Ma;hew Rathbone - @rathboma </p>