enar short course

1.Statistical Computing For Big Data Deepak Agarwal LinkedIn Applied Relevance Science [email protected] ENAR 2014, Baltimore, USA

2. Structure of This Tutorial Part I: Introduction to Map-Reduce and the Hadoop System Overview of Distributed Computing Introduction to Map-Reduce Introduction to the Hadoop System The Pig Language A Deep Dive of Hadoop Map-Reduce Part II: Examples of Statistical Computing for Big Data Bag of Little Bootstraps Large Scale Logistic Regression Parallel Matrix Factorization Part III: The Future of Cloud Computing 3. Structure of This Tutorial Part I: Introduction to Map-Reduce and the Hadoop System Overview of Distributed Computing Introduction to Map-Reduce Introduction to the Hadoop System The Pig Language A Deep Dive of Hadoop Map-Reduce Part II: Examples of Statistical Computing for Big Data Bag of Little Bootstraps Large Scale Logistic Regression Parallel Matrix Factorization Part III: The Future of Cloud Computing 4. Big Data becoming Ubiquitous Bioinformatics Astronomy Internet Telecommunications Climatology 5. Big Data: Some size estimates 1000 human genomes: > 100TB of data (1000 genomes project) Sloan Digital Sky Survey: 200GB data per night (>140TB aggregated) Facebook: A billion monthly active users LinkedIn: 225M members worldwide Twitter: 500 million tweets a day Over 6 billion mobile phones in the world generating data everyday 6. Big Data: Paradigm shift Classical Statistics Generalize using small data Paradigm Shift with Big Data We now have an almost infinite supply of data Easy Statistics ? Just appeal to asymptotic theory? So the issue is mostly computational? Not quite More data comes with more heterogeneity Need to change our statistical thinking to adapt Classical statistics still invaluable to think about big data analytics 7. Some Statistical Challenges Exploratory Analysis (EDA), Visualization Retrospective (on Terabytes) More Real Time (streaming computations every few minutes/hours) Statistical Modeling Scale (computational challenge) Curse of dimensionality Millions of predictors, heterogeneity Temporal and Spatial correlations 8. Statistical Challenges continued Experiments To test new methods, test hypothesis from randomized experiments Adaptive experiments Forecasting Planning, advertising Many more I are not fully well versed in 9. Defining Big Data How to know you have the big data problem? Is it only the number of terabytes ? What about dimensionality, structured/unstructured, computa tions required, No clear definition, different point of views When desired computation cannot be completed in the stipulated time with current best algorithm using cores available on a commodity PC 10. Distributed Computing for Big Data Distributed computing invaluable tool to scale computations for big data Some distributed computing models Multi-threading Graphics Processing Units (GPU) Message Passing Interface (MPI) Map-Reduce 11. Evaluating a method for a problem Scalability Process X GB in Y hours Ease of use for a statistician Reliability (fault tolerance) Especially in an industrial environment Cost Hardware and cost of maintaining Good for the computations required? E.g., Iterative versus one pass Resource sharing 12. Multithreading Multiple threads take advantage of multiple CPUs Shared memory Threads can execute independently and concurrently Can only handle Gigabytes of data Reliable 13. Graphics Processing Units (GPU) Number of cores: CPU: Order of 10 GPU: smaller cores Order of 1000 Can be >100x faster than CPU Parallel computationally intensive tasks off-loaded to GPU Good for certain computationally-intensive tasks Can only handle Gigabytes of data Not trivial to use, requires good understanding of low-level architecture for efficient use But things changing, it is getting more user friendly 14. Message Passing Interface (MPI) Language independent communication protocol among processes (e.g. computers) Most suitable for master/slave model Can handle Terabytes of data Good for iterative processing Fault tolerance is low 15. Map-Reduce (Dean & Ghemawat, 2004) Mappers Reducers Data Output Computation split to Map (scatter) and Reduce (gather) stages Easy to Use: User needs to implement two functions: Mapper and Reducer Easily handles Terabytes of data Very good fault tolerance (failed tasks automatically get restarted) 16. Comparison of Distributed Computing Methods Multithreading GPU MPI Map-Reduce Scalability (data size) Gigabytes Gigabytes Terabytes Terabytes Fault Tolerance High High Low High Maintenance Cost Low Medium Medium Medium-High Iterative Process Complexity Cheap Cheap Cheap Usually expensive Resource Sharing Hard Hard Easy Easy Easy to Implement? Easy Needs understanding of low-level GPU architecture Easy Easy 17. Example Problem Tabulating word counts in corpus of documents Similar to table function in R 18. Word Count Through Map-Reduce Hello World Bye World Hello Hadoop Goodbye Hadoop Mapper 1 Mapper 2 Reducer 1 Words from A-G Reducer 2 Words from H-Z 19. Key Ideas about Map-Reduce Big Data Partition 1 Partition 2 Partition N Mapper 1 Mapper 2 Mapper N Reducer 1 Reducer 2 Reducer M Output 1 Output 1Output 1Output 1 20. Key Ideas about Map-Reduce Data are split into partitions and stored in many different machines on disk (distributed storage) Mappers process data chunks independently and emit pairs Data with the same key are sent to the same reducer. One reducer can receive multiple keys Every reducer sorts its data by key For each key, the reducer processes the values corresponding to the key according to the customized reducer function and output 21. Compute Mean for Each Group ID Group No. Score 1 1 0.5 2 3 1.0 3 1 0.8 4 2 0.7 5 2 1.5 6 3 1.2 7 1 0.8 8 2 0.9 9 4 1.3 22. Key Ideas about Map-Reduce Data are split into partitions and stored in many different machines on disk (distributed storage) Mappers process data chunks independently and emit pairs For each row: Key = Group No. Value = Score Data with the same key are sent to the same reducer. One reducer can receive multiple keys E.g. 2 reducers Reducer 1 receives data with key = 1, 2 Reducer 2 receives data with key = 3, 4 Every reducer sorts its data by key E.g. Reducer 1: , For each key, the reducer processes the values corresponding to the key according to the customized reducer function and output E.g. Reducer 1 output: , 23. Key Ideas about Map-Reduce Data are split into partitions and stored in many different machines on disk (distributed storage) Mappers process data chunks independently and emit pairs For each row: Key = Group No. Value = Score Data with the same key are sent to the same reducer. One reducer can receive multiple keys E.g. 2 reducers Reducer 1 receives data with key = 1, 2 Reducer 2 receives data with key = 3, 4 Every reducer sorts its data by key E.g. Reducer 1: , For each key, the reducer processes the values corresponding to the key according to the customized reducer function and output E.g. Reducer 1 output: , What you need to implement 24. Mapper: Input: Data for (row in Data) { groupNo = row$groupNo; score = row$score; Output(c(groupNo, score)); } Reducer: Input: Key (groupNo), List Value (a list of scores that belong to the Key) count = 0; sum = 0; for (v in Value) { sum += v; count++; } Output(c(Key, sum/count)); Pseudo Code (in R) 25. Exercise 1 Problem: Average height per {Grade, Gender}? What should be the mapper output key? What should be the mapper output value? What are the reducer input? What are the reducer output? Write mapper and reducer for this? Student ID Grade Gender Height (cm) 1 3 M 120 2 2 F 115 3 2 M 116 26. Problem: Average height per Grade and Gender? What should be the mapper output key? {Grade, Gender} What should be the mapper output value? Height What are the reducer input? Key: {Grade, Gender}, Value: List of Heights What are the reducer output? {Grade, Gender, mean(Heights)} Student ID Grade Gender Height (cm) 1 3 M 120 2 2 F 115 3 2 M 116 27. Exercise 2 Problem: Number of students per {Grade, Gender}? What should be the mapper output key? What should be the mapper output value? What are the reducer input? What are the reducer output? Write mapper and reducer for this? Student ID Grade Gender Height (cm) 1 3 M 120 2 2 F 115 3 2 M 116 28. Problem: Number of students per {Grade, Gender}? What should be the mapper output key? {Grade, Gender} What should be the mapper output value? 1 What are the reducer input? Key: {Grade, Gender}, Value: List of 1s What are the reducer output? {Grade, Gender, sum(value list)} OR: {Grade, Gender, length(value list)} Student ID Grade Gender Height (cm) 1 3 M 120 2 2 F 115 3 2 M 116 29. More on Map-Reduce Depends on distributed file systems Typically mappers are the data storage nodes Map/Reduce tasks automatically get restarted when they fail (good fault tolerance) Map and Reduce I/O are all on disk Data transmission from mappers to reducers are through disk copy Iterative process through Map-Reduce Each iteration becomes a map-reduce job Can be expensive since map-reduce overhead is high 30. The Apache Hadoop System An open-source software for reliable, scalable, distributed computing The most popular distributed computing system in the world Key modules: Hadoop Distributed File System (HDFS) Hadoop YARN (job scheduling and cluster resource management) Hadoop MapReduce 31. Major Tools on Hadoop Pig A high-level language for Map-Reduce computation Hive A SQL-like query language for data querying via Map-Reduce Hbase A distributed & scalable database on Hadoop Allows random, real time read/write access to big data Voldemort is similar to Hbase Mahout A scalable machine learning library 32. Hadoop Installation Setting up Hadoop on your desktop/laptop: http://hadoop.apache.org/docs/stable/single_nod e_setup.html Setting up Hadoop on a cluster of machines http://hadoop.apache.org/docs/stable/cluster_set up.html 33. Hadoop Distributed File System (HDFS) Master/Slave architecture NameNode: a single master node that controls which data block is stored where. DataNodes: slave nodes that store data and do R/W operations Clients (Gateway): Allow users to login and interact with HDFS and submit Map-Reduce jobs Big data is split to equal-sized blocks, each block can be stored in different DataNodes Disk failure tolerance: data is replicated multiple times 34. Load the Data into Pig A = LOAD Sample-1.dat' USING PigStorage() AS (ID : int, groupNo: int, score: float); The path of the data on HDFS after LOAD USING PigStorage() means delimit the data by tab (can be omitted) If data are delimited by other characters, e.g. space, use USING PigStorage( ) Data schema defined after AS Variable types: int, long, float, double, chararray, 35. Structure of This Tutorial Part I: Introduction to Map-Reduce and the Hadoop System Overview of Distributed Computing Introduction to Map-Reduce Introduction to the Hadoop System The Pig Language A Deep Dive of Hadoop Map-Reduce Part II: Examples of Statistical Computing for Big Data Bag of Little Bootstraps Large Scale Logistic Regression Parallel Matrix Factorization Part III: The Future of Cloud Computing 36. Bag of Little Bootstraps Kleiner et al. 2012 37. Bootstrap (Efron, 1979) A re-sampling based method to obtain statistical distribution of sample estimators Why are we interested ? Re-sampling is embarrassingly parallelizable For example: Standard deviation of the mean of N samples () For i = 1 to r do Randomly sample with replacement N times from the original sample -> bootstrap data i Compute mean of the i-th bootstrap data -> i Estimate of Sd() = Sd(*1,r]) r is usually a large number, e.g. 200 38. Bootstrap for Big Data Can have r nodes running in parallel, each sampling one bootstrap data However N can be very large Data may not fit into memory Collecting N samples with replacement on each node can be computationally expensive 39. M out of N Bootstrap (Bikel et al. 1997) Obtain SdM() by sampling M samples with replacement for each bootstrap, where M= 10 occurrences => 2.2M covariates Training data: 8,407,752 samples Test data : 510,302 samples 67. Avg Training Loglikelihood vs Number of Iterations 68. Test AUC vs Number of Iterations 69. Better Convergence Can Be Achieved By Better Initialization Use results from Nave method to initialize the parameters Adaptively change step size () for each iteration based on the convergence status of the consensus 70. Recommender Problems for Web Applications 71. Agenda Topic of Interest Recommender problems for dynamic, time- sensitive applications Content Optimization, Online Advertising, Movie recommendation, shopping, Introduction Offline components Regression, Collaborative filtering (CF), Online components + initialization Time-series, online/incremental methods, explore/exploit (bandit) Evaluation methods + Multi-Objective Challenges 72. Three components we will focus on Defining the problem Formulate objectives whose optimization achieves some long- term goals for the recommender system E.g. How to serve content to optimize audience reach and engagement, optimize some combination of engagement and revenue ? Modeling (to estimate some critical inputs) Predict rates of some positive user interaction(s) with items based on data obtained from historical user-item interactions E.g. Click rates, average time-spent on page, etc Could be explicit feedback like ratings Experimentation Create experiments to collect data proactively to improve models, helps in converging to the best choice(s) cheaply and rapidly. Explore and Exploit (continuous experimentation) DOE (testing hypotheses by avoiding bias inherent in data) 73. Modern Recommendation Systems Goal Serve the right item to a user in a given context to optimize long-term business objectives A scientific discipline that involves Large scale Machine Learning & Statistics Offline Models (capture global & stable characteristics) Online Models (incorporates dynamic components) Explore/Exploit (active and adaptive experimentation) Multi-Objective Optimization Click-rates (CTR), Engagement, advertising revenue, diversity, etc Inferring user interest Constructing User Profiles Natural Language Processing to understand content Topics, aboutness, entities, follow-up of something, breaking news, 74. Some examples from content optimization Simple version I have a content module on my page, content inventory is obtained from a third party source which is further refined through editorial oversight. Can I algorithmically recommend content on this module? I want to improve overall click-rate (CTR) on this module More advanced I got X% lift in CTR. But I have additional information on other downstream utilities (e.g. advertising revenue). Can I increase downstream utility without losing too many clicks? Highly advanced There are multiple modules running on my webpage. How do I perform a simultaneous optimization? 75. Recommend applications Recommend search queries Recommend news article Recommend packages: Image Title, summary Links to other pages Pick 4 out of a pool of K K = 20 ~ 40 Dynamic Routes traffic other pages 76. Problems in this example Optimize CTR on multiple modules Today Module, Trending Now, Personal Assistant, News Simple solution: Treat modules as independent, optimize separately. May not be the best when there are strong correlations. For any single module Optimize some combination of CTR, downstream engagement, and perhaps advertising revenue. 77. Online Advertising Advertisers Ad Network Ads Page Recommend Best ad(s) User Publisher Response rates (click, conversion, ad-view) Bids Auction Click conversion Select argmax f(bid,response rates) ML /Statistical model Examples: Yahoo, Google, MSN, Ad exchanges (RightMedia, DoubleClick, ) 78. LinkedIn Today: Content Module Objective: Serve content to maximize engagement metrics like CTR (or weighted CTR) 79. LinkedIn Ads: Match ads to users visiting LinkedIn 80. Right Media Ad Exchange: Unified Marketplace Match ads to page views on publisher sites Has ad impression to sell -- AUCTIONS Bids $0.50 Bids $0.75 via Network which becomes $0.45 bid Bids $0.65WINS! AdSense Ad.com Bids $0.60 81. Recommender problems in general USER Item Inventory Articles, web page, ads, Use an automated algorithm to select item(s) to show Get feedback (click, time spent,..) Refine the models Repeat (large number of times) Optimize metric(s) of interest (Total clicks, Total revenue,) Example applications Search: Web, Vertical Online Advertising Content .. Context query, page, 82. Items: Articles, ads, modules, movies, users, updates, etc. Context: query keywords, pages, mobile, social media, etc. Metric to optimize (e.g., relevance score, CTR, revenue, engagement) Currently, most applications are single-objective Could be multi-objective optimization (maximize X subject to Y, Z,..) Properties of the item pool Size (e.g., all web pages vs. 40 stories) Quality of the pool (e.g., anything vs. editorially selected) Lifetime (e.g., mostly old items vs. mostly new items) Important Factors 83. Factors affecting Solution (continued) Properties of the context Pull: Specified by explicit, user-driven query (e.g., keywords, a form) Push: Specified by implicit context (e.g., a page, a user, a session) Most applications are somewhere on continuum of pull and push Properties of the feedback on the matches made Types and semantics of feedback (e.g., click, vote) Latency (e.g., available in 5 minutes vs. 1 day) Volume (e.g., 100K per day vs. 300M per day) Constraints specifying legitimate matches e.g., business rules, diversity rules, editorial Voice Multiple objectives Available Metadata (e.g., link graph, various user/item attributes) 84. Predicting User-Item Interactions (e.g. CTR) Myth: We have so much data on the web, if we can only process it the problem is solved Number of things to learn increases with sample size Rate of increase is not slow Dynamic nature of systems make things worse We want to learn things quickly and react fast Data is sparse in web recommender problems We lack enough data to learn all we want to learn and as quickly as we would like to learn Several Power laws interacting with each other E.g. User visits power law, items served power law Bivariate Zipf: Owen & Dyer, 2011 85. Can Machine Learning help? Fortunately, there are group behaviors that generalize to individuals & they are relatively stable E.g. Users in San Francisco tend to read more baseball news Key issue: Estimating such groups Coarse group : more stable but does not generalize that well. Granular group: less stable with few individuals Getting a good grouping structure is to hit the sweet spot Another big advantage on the web Intervene and run small experiments on a small population to collect data that helps rapid convergence to the best choices(s) We dont need to learn all user-item interactions, only those that are good. 86. Predicting user-item interaction rates Offline ( Captures stable characteristics at coarse resolutions) (Logistic, Boosting,.) Feature construction Content: IR, clustering, taxonomy, entity,.. User profiles: clicks, views, social, community,.. Near Online (Finer resolution Corrections) (item, user level) (Quick updates) Explore/Exploit (Adaptive sampling) (helps rapid convergence to best choices) Initialize 87. Post-click: An example in Content Optimization Recommender EDITORIAL content Clicks on FP links influence downstream supply distribution AD SERVER DISPLAY ADVERTISING Revenue Downstream engagement (Time spent) 88. Serving Content on Front Page: Click Shaping What do we want to optimize? Current: Maximize clicks (maximize downstream supply from FP) But consider the following Article 1: CTR=5%, utility per click = 5 Article 2: CTR=4.9%, utility per click=10 By promoting 2, we lose 1 click/100 visits, gain 5 utils If we do this for a large number of visits --- lose some clicks but obtain significant gains in utility? E.g. lose 5% relative CTR, gain 40% in utility (revenue, engagement, etc) 89. High level picture http request Statistical Models updated in Batch mode: e.g. once every 30 mins Server Item Recommendation system: thousands of computations in sub-seconds User Interacts e.g. click, does nothing 90. High level overview: Item Recommendation System User Info Item Index Id, meta-data ML/ Statistical Models Score Items P(Click), P(share), Semantic-relevance score,. Rank Items: sort by score (CTR,bid*CTR,..) combine scores using Multi-obj optim, Threshold on some scores,. User-item interaction Data: batch process Updated in batch: Activity, profile Pre-filter SPAM,editorial,,.. Feature extraction NLP, cllustering,.. 91. ML/Statistical models for scoring Number of items Scored by ML Traffic volume 1000100 100k 1M 100M Few hours Few days Several days LinkedIn Today Yahoo! Front Page Right Media Ad exchange LinkedIn Ads 92. Summary of deployments Yahoo! Front page Today Module (2008-2011): 300% improvement in click-through rates Similar algorithms delivered via a self-serve platform, adopted by several Yahoo! Properties (2011): Significant improvement in engagement across Yahoo! Network Fully deployed on LinkedIn Today Module (2012): Significant improvement in click-through rates (numbers not revealed due to reasons of confidentiality) Yahoo! RightMedia exchange (2012): Fully deployed algorithms to estimate response rates (CTR, conversion rates). Significant improvement in revenue (numbers not revealed due to reasons of confidentiality) LinkedIn self-serve ads (2012-2013):Fully deployed LinkedIn News Feed (2013-2014): Fully deployed Several others in progress. 93. Broad Themes Curse of dimensionality Large number of observations (rows), large number of potential features (columns) Use domain knowledge and machine learning to reduce the effective dimension (constraints on parameters reduce degrees of freedom) I will give examples as we move along We often assume our job is to analyze Big Data but we often have control on what data to collect through clever experimentation This can fundamentally change solutions Think of computation and models together for Big data Optimization: What we are trying to optimize is often complex,models to work in harmony with optimization Pareto optimality with competing objectives 94. Statistical Problem Rank items (from an admissible pool) for user visits in some context to maximize a utility of interest Examples of utility functions Click-rates (CTR) Share-rates (CTR* [Share|Click] ) Revenue per page-view = CTR*bid (more complex due to second price auction) CTR is a fundamental measure that opens the door to a more principled approach to rank items Converge rapidly to maximum utility items Sequential decision making process (explore/exploit) 95. item j from a set of candidates User i with user features (e.g., industry, behavioral features, Demographic features,) (i, j) : response yijvisits Algorithm selects (click or not) Which item should we select? The item with highest predicted CTR An item for which we need data to predict its CTR Exploit Explore LinkedIn Today, Yahoo! Today Module: Choose Items to maximize CTR This is an Explore/Exploit Problem 96. The Explore/Exploit Problem (to maximize CTR) Problem definition: Pick k items from a pool of N for a large number of serves to maximize the number of clicks on the picked items Easy!? Pick the items having the highest click-through rates (CTRs) But The system is highly dynamic: Items come and go with short lifetimes CTR of each item may change over time How much traffic should be allocated to explore new items to achieve optimal performance ? Too little Unreliable CTR estimates due to starvation Too much Little traffic to exploit the high CTR items 97. Y! front Page Application Simplify: Maximize CTR on first slot (F1) Item Pool Editorially selected for high quality and brand image Few articles in the pool but item pool dynamic 98. CTR Curves of Items on LinkedIn Today CTR 99. Impact of repeat item views on a given user Same user is shown an item multiple times (despite not clicking) 100. Simple algorithm to estimate most popular item with small but dynamic item pool Simple Explore/Exploit scheme % explore: with a small probability (e.g. 5%), choose an item at random from the pool (100 )% exploit: with large probability (e.g. 95%), choose highest scoring CTR item Temporal Smoothing Item CTRs change over time, provide more weight to recent data in estimating item CTRs Kalman filter, moving average Discount item score with repeat views CTR(item) for a given user drops with repeat views by some discount factor (estimated from data) Segmented most popular Perform separate most-popular for each user segment 101. Time series Model: Kalman filter Dynamic Gamma-Poisson: click-rate evolves over time in a multiplicative fashion Estimated Click-rate distribution at time t+1 Prior mean: Prior variance: High CTR items more adaptive 102. More economical exploration? Better bandit solutions Consider two armed problem p2 (unknown payoff probabilities) The gambler has 1000 plays, what is the best way to experiment ? (to maximize total expected reward) This is called the multi-armed bandit problem, have been studied for a long time. Optimal solution: Play the arm that has maximum potential of being good Optimism in the face of uncertainty p1 > 103. Item Recommendation: Bandits? Two Items: Item 1 CTR= 2/100 ; Item 2 CTR= 250/10000 Greedy: Show Item 2 to all; not a good idea Item 1 CTR estimate noisy; item could be potentially better Invest in Item 1 for better overall performance on average Exploit what is known to be good, explore what is potentially good CTR Probabilitydensity Item 2 Item 1 104. Next few hours Most Popular Recommendation Personalized Recommendation Offline Models Collaborative filtering (cold-start problem) Online Models Time-series models Incremental CF, online regression Intelligent Initialization Prior estimation Prior estimation, dimension reduction Explore/Exploit Multi-armed bandits Bandits with covariates 105. Offline Components: Collaborative Filtering in Cold-start Situations 106. Problem Item j with User i with user features xi (demographics, browse history, search history, ) item features xj (keywords, content categories, ...) (i, j) : response yijvisits Algorithm selects (explicit rating, implicit click/no-click) Predict the unobserved entries based on features and the observed entries 107. Model Choices Feature-based (or content-based) approach Use features to predict response (regression, Bayes Net, mixture models, ) Limitation: need predictive features Bias often high, does not capture signals at granular levels Collaborative filtering (CF aka Memory based) Make recommendation based on past user-item interaction User-user, item-item, matrix factorization, See [Adomavicius & Tuzhilin, TKDE, 2005], [Konstan, SIGMOD08 Tutorial], etc. Better performance for old users and old items Does not naturally handle new users and new items (cold- start) 108. Collaborative Filtering (Memory based methods) User-User Similarity Item-Item similarities, incorporating both Estimating Similarities Pearsons correlation Optimization based (Koren et al) 109. How to Deal with the Cold-Start Problem Heuristic-based approaches Linear combination of regression and CF models Filterbot Add user features as psuedo users and do collaborative filtering - Hybrid approaches - Use content based to fill up entries, then use CF Matrix Factorization Good performance on Netflix (Koren, 2009) Model-based approaches Bilinear random-effects model (probabilistic matrix factorization) Good on Netflix data [Ruslan et al ICML, 2009] Add feature-based regression to matrix factorization (Agarwal and Chen, 2009) Add topic discovery (from textual items) to matrix factorization (Agarwal and Chen, 2009; Chun and Blei, 2011) 110. Per-item regression models When tracking users by cookies, distribution of visit patters could get extremely skewed Majority of cookies have 1-2 visits Per item models (regression) based on user covariates attractive in such cases 111. Several per-item regressions: Multi-task learning Low dimension (5-10), B estimated retrospective data Agarwal,Chen and Elango, KDD, 2010 Affinity to old items 112. Per-user, per-item models via bilinear random-effects model 113. Motivation Data measuring k-way interactions pervasive Consider k = 2 for all our discussions E.g. User-Movie, User-content, User-Publisher-Ads,. Power law on both user and item degrees Classical Techniques Approximate matrix through a singular value decomposition (SVD) After adjusting for marginal effects (user pop, movie pop,..) Does not work Matrix highly incomplete, severe over-fitting Key issue Regularization of eigenvectors (factors) to avoid overfitting 114. Early work on complete matrices Tukeys 1-df model (1956) Rank 1 approximation of small nearly complete matrix Criss-cross regression (Gabriel, 1978) Incomplete matrices: Psychometrics (1-factor model only; small data sets; 1960s) Modern day recommender problems Highly incomplete, large, noisy. 115. Latent Factor Models newsy sporty newsy s item v z Affinity = uv Affinity = sz u s p o r t y 116. Factorization Brief Overview Latent user factors: (i , ui=(ui1,,uin)) (Nn + Mm) parameters Key technical issue: Latent movie factors: (j , vj=(v j1,.,v jn)) will overfit for moderate values of n,m Regularization Interaction jijiij BvuyE )( 117. Latent Factor Models: Different Aspects Matrix Factorization Factors in Euclidean space Factors on the simplex Incorporating features and ratings simultaneously Online updates 118. Maximum Margin Matrix Factorization (MMMF) Complete matrix by minimizing loss (hinge,squared- error) on observed entries subject to constraints on trace norm Srebro, Rennie, Jakkola (NIPS 2004) Convex, Semi-definite programming (expensive, not scalable) Fast MMMF (Rennie & Srebro, ICML, 2005) Constrain the Frobenious norm of left and right eigenvector matrices, not convex but becomes scalable. Other variation: Ensemble MMMF (DeCoste, ICML2005) Ensembles of partially trained MMMF (some improvements) 119. Matrix Factorization for Netflix prize data Minimize the objective function Simon Funk: Stochastic Gradient Descent Koren et al (KDD 2007): Alternate Least Squares They move to SGD later in the competition obsij j j i ij T iij vuvur )()( 222 120. ui vj rij au av 2 Optimization is through Iterated conditional modes Other variations like constraining the mean through sigmoid, using who-rated- whom Combining with Boltzmann Machines also improved performance ),(~ ),(~ ),(~ 2 IaMVN IaMVN Nr vj ui j T iij 0v 0u vu Probabilistic Matrix Factorization (Ruslan & Minh, 2008, NIPS) 121. Bayesian Probabilistic Matrix Factorization (Ruslan and Minh, ICML 2008) Fully Bayesian treatment using an MCMC approach Significant improvement Interpretation as a fully Bayesian hierarchical model shows why that is the case Failing to incorporate uncertainty leads to bias in estimates Multi-modal posterior, MCMC helps in converging to a better one r Var-comp: au MCEM also more resistant to over-fitting 122. Non-parametric Bayesian matrix completion (Zhou et al, SAM, 2010) Specify rank probabilistically (automatic rank selection) )/)1(,/(~ )(~ ),(~ 1 2 rrbraBeta Berz vuzNy k kk r k jkikkij ))1(/(Factors)#( )))1(/(,1(~ rbaraE rbaaBerzk 123. How to incorporate features: Deal with both warm start and cold-start Models to predict ratings for new pairs Warm-start: (user, movie) present in the training data with large sample size Cold-start: At least one of (user, movie) new or has small sample size Rough definition, warm-start/cold-start is a continuum. Challenges Highly incomplete (user, movie) matrix Heavy tailed degree distributions for users/movies Large fraction of ratings from small fraction of users/movies Handling both warm-start and cold-start effectively in the presence of predictive features 124. Possible approaches Large scale regression based on covariates Does not provide good estimates for heavy users/movies Large number of predictors to estimate interactions Collaborative filtering Neighborhood based Factorization Good for warm-start; cold-start dealt with separately Single model that handles cold-start and warm-start Heavy users/movies User/movie specific model Light users/movies fallback on regression model Smooth fallback mechanism for good performance 125. Add Feature-based Regression into Matrix Factorization RLFM: Regression-based Latent Factor Model 126. Regression-based Factorization Model (RLFM) Main idea: Flexible prior, predict factors through regressions Seamlessly handles cold-start and warm- start Modified state equation to incorporate covariates 127. RLFM: Model Rating: ),(~ 2 ijij Ny )(~ ijij Bernoulliy )(~ ijijij NPoissony Gaussian Model Logistic Model (for binary rating) Poisson Model (for counts) j t iji t ijij vubxt )( user i gives item j Bias of user i: ),0(~, 2 0 Nxg iii t i Popularity of item j: ),0(~, 2 0 Nxd jjj t j Factors of user i: ),0(~, 2 INGxu u u i u iii Factors of item j: ),0(~, 2 INDxv v v i v iji Could use other classes of regression models 128. Graphical representation of the model 129. Advantages of RLFM Better regularization of factors Covariates shrink towards a better centroid Cold-start: Fallback regression model (FeatureOnly) 130. RLFM: Illustration of Shrinkage Plot the first factor value for each user (fitted using Yahoo! FP data) 131. Model fitting: EM for our class of models 132. The parameters for RLFM Latent parameters Hyper-parameters }){},{},{},({ jiji vu )IaAI,aAD,G,,( vvuub 133. Computing the mode Minimized 134. The EM algorithm 135. Computing the E-step Often hard to compute in closed form Stochastic EM (Markov Chain EM; MCEM) Compute expectation by drawing samples from Effective for multi-modal posteriors but more expensive Iterated Conditional Modes algorithm (ICM) Faster but biased hyper-parameter estimates 136. Monte Carlo E-step Through a vanilla Gibbs sampler (conditionals closed form) Other conditionals also Gaussian and closed form Conditionals of users (movies) sampled simultaneously Small number of samples in early iterations, large numbers in later iterations 137. M-step (Why MCEM is better than ICM) Update G, optimize Update Au=au I Ignored by ICM, underestimates factor variability Factors over-shrunk, posterior not explored well 138. Experiment 1: Better regularization MovieLens-100K, avg RMSE using pre-specified splits ZeroMean, RLFM and FeatureOnly (no cold-start issues) Covariates: Users : age, gender, zipcode (1st digit only) Movies: genres 139. Experiment 2: Better handling of Cold-start MovieLens-1M; EachMovie Training-test split based on timestamp Same covariates as in Experiment 1. 140. Experiment 4: Predicting click-rate on articles Goal: Predict click-rate on articles for a user on F1 position Article lifetimes short, dynamic updates important User covariates: Age, Gender, Geo, Browse behavior Article covariates Content Category, keywords 2M ratings, 30K users, 4.5 K articles 141. Results on Y! FP data 142. Some other related approaches Stern, Herbrich and Graepel, WWW, 2009 Similar to RLFM, different parametrization and expectation propagation used to fit the models Porteus, Asuncion and Welling, AAAI, 2011 Non-parametric approach using a Dirichlet process Agarwal, Zhang and Mazumdar, Annals of Applied Statistics, 2011 Regression + random effects per user regularized through a Graphical Lasso 143. Add Topic Discovery into Matrix Factorization fLDA: Matrix Factorization through Latent Dirichlet Allocation 144. fLDA: Introduction Model the rating yij that user i gives to item j as the users affinity to the topics that the item has Unlike regular unsupervised LDA topic modeling, here the LDA topics are learnt in a supervised manner based on past rating data fLDA can be thought of as a multi-task learning version of the supervised LDA model [Blei07] for cold-start recommendation k jkikij zsy ... User i s affinity to topic k Pr(item j has topic k) estimated by averaging the LDA topic of each word in item j Old items: zjks are Item latent factors learnt from data with the LDA prior New items: zjks are predicted based on the bag of words in the items 145. 11, , 1W k1, , kW K1, , KW Topic 1 Topic k Topic K LDA Topic Modeling (1) LDA is effective for unsupervised topic discovery [Blei03] It models the generating process of a corpus of items (articles) For each topic k, draw a word distribution k = [ k1, , kW] ~ Dir( ) For each item j, draw a topic distribution j = [ j1, , jK] ~ Dir( ) For each word, say the nth word, in item j, Draw a topic zjn for that word from j = [ j1, , jK] Draw a word wjn from k = [ k1, , kW] with topic k = zjn Item j Topic distribution: [ j1, , jK] Words: wj1, , wjn, Per-word topic: zj1, , zjn, Assume zjn = topic k Observed 146. LDA Topic Modeling (2) Model training: Estimate the prior parameters and the posterior topic word distribution based on a training corpus of items EM + Gibbs sampling is a popular method Inference for new items Compute the item topic distribution based on the prior parameters and estimated in the training phase Supervised LDA [Blei07] Predict a target value for each item based on supervised LDA topics k jkkj zsy Target value of item j Pr(item j has topic k) estimated by averaging the topic of each word in item j Regression weight for topic k k jkikij zsy ...vs. One regression per user Same set of topics across different regressions 147. fLDA: Model Rating: ),(~ 2 ijij Ny )(~ ijij Bernoulliy )(~ ijijij NPoissony Gaussian Model Logistic Model (for binary rating) Poisson Model (for counts) jkikkji t ijij zsbxt )( user i gives item j Bias of user i: ),0(~, 2 0 Nxg iii t i Popularity of item j: ),0(~, 2 0 Nxd jjj t j Topic affinity of user i: ),0(~, 2 INHxs s s i s iii Pr(item j has topic k): )iteminwords#/()(1 jkzz jnnjk The LDA topic of the nth word in item j Observed words: ),,(~ jnjn zLDAw The nth word in item j 148. Model Fitting Given: Features X = {xi, xj, xij} Observed ratings y = {yij} and words w = {wjn} Estimate: Parameters: = [b, g0, d0, H, 2, a , a , As, , ] Regression weights and prior parameters Latent factors: = { i, j, si} and z = {zjn} User factors, item factors and per-word topic assignment Empirical Bayes approach: Maximum likelihood estimate of the parameters: The posterior distribution of the factors: dzdzwywy ]|,,,Pr[maxarg]|,Pr[maxarg ],|,Pr[ yz 149. The EM Algorithm Iterate through the E and M steps until convergence Let be the current estimate E-step: Compute The expectation is not in closed form We draw Gibbs samples and compute the Monte Carlo mean M-step: Find It consists of solving a number of regression and optimization problems )]|,,,Pr([log)( ),,|,( zwyEf n wyz )(maxarg )1( fn )( n 150. Supervised Topic Assignment ji jnij jn jkjn k jn kl jn kzyfZ WZ Z kz rated )|( )Rest|Pr( Same as unsupervised LDA Likelihood of observed ratings by users who rated item j when zjn is set to topic k Probability of observing yij given the model The topic of the nth word in item j 151. fLDA: Experimental Results (Movie) Task: Predict the rating that a user would give a movie Training/test split: Sort observations by time First 75% Training data Last 25% Test data Item warm-start scenario Only 2% new items in test data Model Test RMSE RLFM 0.9363 fLDA 0.9381 Factor-Only 0.9422 FilterBot 0.9517 unsup-LDA 0.9520 MostPopular 0.9726 Feature-Only 1.0906 Constant 1.1190 fLDA is as strong as the best method It does not reduce the performance in warm-start scenarios 152. fLDA: Experimental Results (Yahoo! Buzz) Task: Predict whether a user would buzz-up an article Severe item cold-start All items are new in test data Data Statistics 1.2M observations 4K users 10K articles fLDA significantly outperforms other models 153. Experimental Results: Buzzing Topics Top Terms (after stemming) Topic bush, tortur, interrog, terror, administr, CIA, offici, suspect, releas, investig, georg, memo, al CIA interrogation mexico, flu, pirat, swine, drug, ship, somali, border, mexican, hostag, offici, somalia, captain Swine flu NFL, player, team, suleman, game, nadya, star, high, octuplet, nadya_suleman, michael, week NFL games court, gai, marriag, suprem, right, judg, rule, sex, pope, supreme_court, appeal, ban, legal, allow Gay marriage palin, republican, parti, obama, limbaugh, sarah, rush, gop, presid, sarah_palin, sai, gov, alaska Sarah Palin idol, american, night, star, look, michel, win, dress, susan, danc, judg, boyl, michelle_obama American idol economi, recess, job, percent, econom, bank, expect, rate, jobless, year, unemploy, month Recession north, korea, china, north_korea, launch, nuclear, rocket, missil, south, said, russia North Korea issues 3/4 topics are interpretable; 1/2 are similar to unsupervised topics 154. fLDA Summary fLDA is a useful model for cold-start item recommendation It also provides interpretable recommendations for users Users preference to interpretable LDA topics Future directions: Investigate Gibbs sampling chains and the convergence properties of the EM algorithm Apply fLDA to other multi-task prediction problems fLDA can be used as a tool to generate supervised features (topics) from text data 155. Summary Regularizing factors through covariates effective Regression based factor model that regularizes better and deals with both cold-start and warm-start in a single framework in a seamless way looks attractive Fitting method scalable; Gibbs sampling for users and movies can be done in parallel. Regressions in M-step can be done with any off-the-shelf scalable linear regression routine Distributed computing on Hadoop: Multiple models and average across partitions (more later) 156. Online Components: Online Models, Intelligent Initialization, Explore / Exploit 157. Why Online Components? Cold start New items or new users come to the system How to obtain data for new items/users (explore/exploit) Once data becomes available, how to quickly update the model Periodic rebuild (e.g., daily): Expensive Continuous online update (e.g., every minute): Cheap Concept drift Item popularity, user interest, mood, and user-to-item affinity may change over time How to track the most recent behavior Down-weight old data How to model temporal patterns for better prediction may not need to be online if the patterns are stationary 158. Big Picture Most Popular Recommendation Personalized Recommendation Offline Models Collaborative filtering (cold-start problem) Online Models Real systems are dynamic Time-series models Incremental CF, online regression Intelligent Initialization Do not start cold Prior estimation Prior estimation, dimension reduction Explore/Exploit Actively acquire data Multi-armed bandits Bandits with covariates Segmented Most Popular Recommendation Extension: 159. Online Components for Most Popular Recommendation Online models, intelligent initialization & explore/exploit 160. Most popular recommendation: Outline Most popular recommendation (no personalization, all users see the same thing) Time-series models (online models) Prior estimation (initialization) Multi-armed bandits (explore/exploit) Sometimes hard to beat!! Segmented most popular recommendation Create user segments/clusters based on user features Do most popular recommendation for each segment 161. Most Popular Recommendation Problem definition: Pick k items (articles) from a pool of N to maximize the total number of clicks on the picked items Easy!? Pick the items having the highest click- through rates (CTRs) But The system is highly dynamic: Items come and go with short lifetimes CTR of each item changes over time How much traffic should be allocated to explore new items to achieve optimal performance Too little Unreliable CTR estimates Too much Little traffic to exploit the high CTR items 162. CTR Curves for Two Days on Yahoo! Front Page Traffic obtained from a controlled randomized experiment (no confounding) Things to note: (a) Short lifetimes, (b) temporal effects, (c) often breaking news stories Each curve is the CTR of an item in the Today Module on www.yahoo.com over time 163. For Simplicity, Assume Pick only one item for each user visit Multi-slot optimization later No user segmentation, no personalization (discussion later) The pool of candidate items is predetermined and is relatively small ( 1000) E.g., selected by human editors or by a first-phase filtering method Ideally, there should be a feedback loop Large item pool problem later Effects like user-fatigue, diversity in recommendations, multi-objective optimization not considered (discussion later) 164. Online Models How to track the changing CTR of an item Data: for each item, at time t, we observe Number of times the item nt was displayed (i.e., #views) Number of clicks ct on the item Problem Definition: Given c1, n1, , ct, nt, predict the CTR (click-through rate) pt+1 at time t+1 Potential solutions: Observed CTR at t: ct / nt highly unstable (nt is usually small) Cumulative CTR: ( all i ci) / ( all i ni) react to changes very slowly Moving window CTR: ( i last K ci) / ( i last K ni) reasonable But, no estimation of Var[pt+1] (useful for explore/exploit) 165. Online Models: Dynamic Gamma-Poisson Model-based approach (ct | nt, pt) ~ Poisson(nt pt) pt = pt-1 t, where t ~ Gamma(mean=1, var= ) Model parameters: p1 ~ Gamma(mean= 0, var= 0 2) is the offline CTR estimate specifies how dynamic/smooth the CTR is over time Posterior distribution (pt+1 | c1, n1, , ct, nt) ~ Gamma(?,?) Solve this recursively (online update rule) Show the item nt times Receive ct clicks pt = CTR at time t Notation: p10, 0 2 p2 n1 c1 n2 c2 166. Online Models: Derivation size)sample(effective/Let ),(~),,...,,|( 2 2 1111 ttt ttttt varmeanGammancncp )( ),(~),,...,,|( 2 | 2 | 2 | 2 1 |1 2 11111 ttttttt ttt ttttt varmeanGammancncp tttttt ttttttt tttt ttttttt c n varmeanGammancncp || 2 | || | 2 ||11 / /)( size)sample(effectiveLet ),(~),,...,,|( High CTR items more adaptive Estimated CTR distribution at time t Estimated CTR distribution at time t+1 167. Tracking behavior of Gamma- Poisson model Low click rate articles More temporal smoothing 168. Intelligent Initialization: Prior Estimation Prior CTR distribution: Gamma(mean= 0, var= 0 2) N historical items: ni = #views of item i in its first time interval ci = #clicks on item i in its first time interval Model ci ~ Poisson(ni pi) and pi ~ Gamma( 0, 0 2) ci ~ NegBinomial( 0, 0 2, ni) Maximum likelihood estimate (MLE) of ( 0, 0 2) Better prior: Cluster items and find MLE for each cluster Agarwal & Chen, 2011 (SIGMOD) i iii nccNN 2 0 0 2 0 2 0 2 0 2 0 2 0 2 0 2 0 0 2 0 2 0 2 00 loglogloglogmaxarg , 169. Explore/Exploit: Problem Definition time Item 1 Item 2 Item K x1% page views x2% page views xK% page views Determine (x1, x2, , xK) based on clicks and views observed before t in order to maximize the expected total number of clicks in the future t 1t 2 t now clicks in the future 170. Modeling the Uncertainty, NOT just the Mean Simplified setting: Two items CTR Probabilitydensity Item A Item B We know the CTR of Item A (say, shown 1 million times) We are uncertain about the CTR of Item B (only 100 times) If we only make a single decision, give 100% page views to Item A If we make multiple decisions in the future explore Item B since its CTR can potentially be higher qp dppfqp )()(Potential CTR of item A is q CTR of item B is p Probability density function of item Bs CTR is f(p) 171. Multi-Armed Bandits: Introduction (1) Bandit arms p1 p2 p3 (unknown payoff probabilities) Pulling arm i yields a reward: reward = 1 with probability pi (success) reward = 0 otherwise (failure) For now, we are attacking the problem of choosing best article/arm for all users 172. Multi-Armed Bandits: Introduction (2) Bandit arms p1 p2 p3 (unknown payoff probabilities) Goal: Pull arms sequentially to maximize the total reward Bandit scheme/policy: Sequential algorithm to play arms (items) Regret of a scheme = Expected loss relative to the oracle optimal scheme that always plays the best arm best means highest success probability But, the best arm is not known unless you have an oracle Regret is the price of exploration Low regret implies quick convergence to the best 173. Multi-Armed Bandits: Introduction (3) Bayesian approach Seeks to find the Bayes optimal solution to a Markov decision process (MDP) with assumptions about probability distributions Representative work: Gittins index, Whittles index Very computationally intensive Minimax approach Seeks to find a scheme that incurs bounded regret (with no or mild assumptions about probability distributions) Representative work: UCB by Lai, Auer Usually, computationally easy But, they tend to explore too much in practice (probably because the bounds are based on worse-case analysis) Skip details 174. Multi-Armed Bandits: Markov Decision Process (1) Select an arm now at time t=0, to maximize expected total number of clicks in t=0,,T State at time t: t = ( 1t, , Kt) it = State of arm i at time t (that captures all we know about arm i at t) Reward function Ri( t, t+1) Reward of pulling arm i that brings the state from t to t+1 Transition probability Pr[ t+1 | t, pulling arm i ] Policy : A function that maps a state to an arm (action) ( t) returns an arm (to pull) Value of policy starting from the current state 0 with horizon T ),(),(),( 1110)(0 0 TT VREV 11110)(001 ),(),()(,|Pr 0 dVR T Immediate reward Value of the remaining T-1 time slots if we start from state 1 175. Multi-Armed Bandits: MDP (2) Optimal policy: Things to notice: Value is defined recursively (actually T high-dim integrals) Dynamic programming can be used to find the optimal policy But, just evaluating the value of a fixed policy can be very expensive Bandit Problem: The pull of one arm does not change the state of other arms and the set of arms do not change over time ),(),(),( 1110)(0 0 TT VREV 11110)(001 ),(),()(,|Pr 0 dVR T Immediate reward Value of the remaining T-1 time slots if we start from state 1 ),(maxarg 0TV 176. Multi-Armed Bandits: MDP (3) Which arm should be pulled next? Not necessarily what looks best right now, since it might have had a few lucky successes Looks like it will be a function of successes and failures of all arms Consider a slightly different problem setting Infinite time horizon, but Future rewards are geometrically discounted Rtotal = R(0) + .R(1) + 2.R(2) + (0

enar short course

Technology

big data partition

big data problem

gb data

tb of data

big data analytics

terabytes of data good

gigabytes of data reliable

small data paradigm