# Boston hug

Post on 19-Nov-2014

2.624 views

Embed Size (px)

DESCRIPTION

Talk about why and how machine learning works with Hadoop with recent developments for real-time operation.TRANSCRIPT

<ul><li> 1. Machine Learning with Hadoop</li></ul>
<p> 2. Agenda Why Big Data? Why now? What can you do with big data? How does it work? 2 3. Slow Motion Explosion3 4. Why Now? But Moores law has applied for a long time Why is Hadoop/Big Data exploding now? Why not 10 years ago? Why not 20?2/15/2012 4 5. Size Matters, but If it were just availability of data then existingbig companies would adopt big datatechnology first 5 6. Size Matters, but If it were just availability of data then existingbig companies would adopt big datatechnology firstThey didnt 6 7. Or Maybe Cost If it were just a net positive value then financecompanies should adopt first because theyhave higher opportunity value / byte7 8. Or Maybe Cost If it were just a net positive value then financecompanies should adopt first because theyhave higher opportunity value / byte They didnt8 9. Backwards adoption Under almost any threshold argumentstartups would not adopt big data technologyfirst 9 10. Backwards adoption Under almost any threshold argumentstartups would not adopt big data technologyfirst They did 10 11. Everywhere at Once? Something very strange is happening Big data is being applied at many different scales At many value scales By large companies and small 11 12. Everywhere at Once? Something very strange is happening Big data is being applied at many different scales At many value scales By large companies and smallWhy? 12 13. Analytics Scaling Laws Analytics scaling is all about the 80-20 rule Big gains for little initial effort Rapidly diminishing returns The key to net value is how costs scale Old school exponential scaling Big data linear scaling, low constant Cost/performance has changed radically IF you can use many commodity boxes 14. Youre kidding, people do that?We didnt know that! We should have known thatWe knew that 15. NSA, non-proliferation10.75Industry-wide data consortiumValue 0.5 In-house analyticsIntern with a spreadsheet0.25 Anybody with eyes0 05001000 1500 2,000Scale 16. 10.75 Net value optimum has aValue 0.5 sharp peak well before maximum effort0.250 0 500 10001500 2,000 Scale 17. But scaling laws are changingboth slope and shape 18. 10.75Value 0.5 More than just a little0.250 0 500 1000 1500 2,000 Scale 19. 10.75Value 0.5 They are changing a LOT!0.250 0 500 1000 1500 2,000 Scale 20. 10.75Value 0.50.250 0 500 10001500 2,000 Scale 21. 10.75Value 0.50.250 0 500 10001500 2,000 Scale 22. 10.75 A tipping point is reached and things change radically Value 0.5 Initially, linear cost scaling actually makes things worse0.250 0500 1000 1500 2,000 Scale 23. Pre-requisites for Tipping To reach the tipping point, Algorithms must scale out horizontally On commodity hardware That can and will fail Data practice must change Denormalized is the new black Flexible data dictionaries are the rule Structured data becomes rare 24. So that is why and why now 26 25. So that is why, and why nowWhat can you do with it?And how?27 26. Agenda Mahout outline Recommendations Clustering Classification Hybrid Parallel/Sequential Systems Real-time learning 27. Agenda Mahout outline Recommendations Clustering Classification Supervised on-line learning Feature hashing Hybrid Parallel/Sequential Systems Real-time learning 28. Classification in Detail Naive Bayes Family Hadoop based training Decision Forests Hadoop based training Logistic Regression (aka SGD) fast on-line (sequential) training 29. Classification in Detail Naive Bayes Family Hadoop based training Decision Forests Hadoop based training Logistic Regression (aka SGD) fast on-line (sequential) training 30. Classification in Detail Naive Bayes Family Hadoop based training Decision Forests Hadoop based training Logistic Regression (aka SGD) fast on-line (sequential) training Now with MORE topping! 31. How it Works We are given features Often binary values in a vector Algorithm learns weights Weighted sum of feature * weight is the key Each weight is a single real value 32. An Example 33. FeaturesFrom: Thu, Paul 20, 2010 at 10:51 AMDate: Dr. May AcquahDear Sir,From: George Re: Proposal for over-invoice Contract BenevolenceHi Ted, was a pleasure talking to you last nightBased on information gathered from the idea ofat the Hadoop User Group. I liked the Indiahospital directory, I am pleased to propose agoing for lunch together. Are you availableconfidential business noon? for our mutualtomorrow (Friday) at dealbenefit. I have in my possession, instruments(documentation) to transfer the sum of33,100,000.00 eur thirty-three million one hundredthousand euros, only) into a foreign companysbank account for our favor.... 34. But Text and words arent suitable features We need a numerical vector So we use binary vectors with lots of slots 35. Feature Encoding 36. Hashed Encoding 37. Feature Collisions 38. Training Data 39. Training Data 40. Training Data Joining, combining,Rawtransforming Training examplesdatawith target valuesParsing TokensEncoding Training Vectors algorithm 41. Full Scale TrainingSide-dataNow via NFSI FeaturenSequentialextraction Datap SGD and joinu Learningdowntsampling Map-reduce 42. Hybrid Model DevelopmentLogsGroup byUserCountTraining datausersessions transaction Shared filesystem patternsBig-data clusterLegacy modeling Training data Account infoMergePROCModel LOGISTIC44 43. Enter the Pig Vector Pig UDFs for Vector encoding define EncodeVectororg.apache.mahout.pig.encoders.EncodeVector(10,x+y+1,x:numeric, y:numeric, z:numeric); Model trainingvectors = foreach docs generate newsgroup, encodeVector(*) as v;grouped = group vectors all;model = foreach grouped generate 1 as key,train(vectors) as model; 44. Real-time Developments Storm + Hadoop + Mapr Real-time with Storm Long-term with Hadoop State checkpoints with MapR Add the Bayesian Bandit for on-line learning 45. Aggregate Splicing Storm handles theHadoop handles the presentt past 46. Mobile Network MonitorTransaction dataGeo-dispersed ingest servers Batch aggregationRetro-analysisinterfaceHBase Real-time dashboardand alerts 48 47. A Quick Diversion You see a coin What is the probability of heads? Could it be larger or smaller than that? I flip the coin and while it is in the air ask again I catch the coin and ask again I look at the coin (and you dont) and ask again Why does the answer change? And did it ever have a single value? 48. A First Conclusion Probability as expressed by humans issubjective and depends on information andexperience 49. A Second Conclusion A single number is a bad way to expressuncertain knowledge A distribution of values might be better 50. I Dunno 51. 5 and 5 52. 2 and 10 53. Bayesian Bandit Compute distributions based on data Sample p1 and p2 from these distributions Put a coin in bandit 1 if p1 > p2 Else, put the coin in bandit 2 54. The Basic Idea We can encode a distribution by sampling Sampling allows unification of exploration andexploitation Can be extended to more general responsemodels 55. Deployment with Storm/MapRTargeting Online Engine ModelRPC RPC ModelSelector RPCOnline RPCModelImpressionLogs TrainingConversion OnlineTraining DetectorModel TrainingClick Logs RPC All state managed transactionally in MapR file systemConversionDashboard 56. Service Architecture MapR Pluggable Service ManagementStormTargeting Online Engine ModelRPC RPC ModelSelector RPCOnlineImpressionLogsConversion Detector RPC TrainingTrainingModel OnlineHadoop Model TrainingClick Logs RPCConversionDashboard MapR Lockless Storage Services 57. Find Out More Me: tdunning@mapr.comted.dunning@gmail.comtdunning@apache.com MapR: http://www.mapr.com Mahout: http://mahout.apache.org Code: https://github.com/tdunning </p>