large&scale&machine&learning&for&...

137
Large Scale Machine Learning for Informa2on Retrieval Bo Long, Liang Zhang LinkedIn Applied Relevance Science CIKM 2013, San Francisco

Upload: others

Post on 25-Sep-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Large  Scale  Machine  Learning  for  Informa2on  Retrieval  

 Bo  Long,  Liang  Zhang  LinkedIn  Applied  Relevance  Science  

CIKM  2013,  San  Francisco  

Page 2: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Structure  of  This  Tutorial  

•  Part  I:  Introduc2on    –  Challenges  from  Big  Data  –  Introduc2on  of  Distributed  Compu2ng  –  Introduc2on  of  Map-­‐Reduce  – Web  Recommender  Systems  -­‐-­‐  Our  Mo2va2on  Problem  

•  Part  II:  Large  Scale  Logis2c  Regression  •  Part  III:  Parallel  Matrix  Factoriza2on  •  Part  IV:  Bag  of  LiRle  Bootstraps  •  Part  V:  Future  of  Distributed  Compu2ng  Infrastructure  

Page 3: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Big  Data  becoming  Ubiquitous  

•  Bioinforma2cs  •  Astronomy  •  Internet  •  Telecommunica2ons  •  Climatology  •  …    

Page 4: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Big  Data:  Some  size  es2mates  

•  1000  human  genomes:  >  100TB  of  data  (1000  genomes  project)  

•  Sloan  Digital  Sky  Survey:  200GB  data  per  night  (>140TB  aggregated)  

•  Facebook:  A  billion  monthly  ac2ve  users  •  LinkedIn:    238M  members  worldwide  •  TwiRer:  500  million  tweets  a  day  •  Over  6  billion  mobile  phones  in  the  world  genera2ng  data  everyday  

Page 5: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Challenges:  Machine  Learning  for  Big  Data  

•  Before:  –  A  small/medium  sized  data  set  –  Train  models  on  a  single  machine  

•  Now:  –  Almost  infinite  supply  of  data  –  Does  the  problem  become  easier?  (Asympto2c  theory)  –  Not  quite  

•  More  data  comes  with  more  heterogeneity  •  Need  to  change  our  thinking  in  terms  of  

–  Models  –  Model  fidng  approaches  –  Data  analysis  –  …  

Page 6: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Some  Challenges  

•  Exploratory  Analysis  (EDA),  Visualiza2on  – Retrospec2ve  (on  Terabytes)  – More  Real  Time    (streaming  computa2ons  every  few  minutes/hours)  

•  Machine  Learning  Models  – Scale  (computa2onal  challenge)  – Curse  of  dimensionality    

•  Millions  of  predictors,  heterogeneity  – Temporal  and  Spa2al  correla2ons  

Page 7: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Some  Challenges  con2nued  

•  Experiments  – To  test  new  methods,  test  hypothesis  from  randomized  experiments  

– Adap2ve  experiments    

•  Forecas2ng    – Planning,  adver2sing  

•  Many  more  we  are  not  fully  well  versed  in    

   

Page 8: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Introduc2on  of    Distributed  Compu2ng  

Page 9: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

 Distributed  Compu2ng  for  Big  Data      

•  Distributed  compu2ng  invaluable  tool  to  scale  computa2ons  for  big  data  

•  Some  distributed  compu2ng  models  – Mul2-­‐threading  – Graphics  Processing  Units  (GPU)  – Message  Passing  Interface  (MPI)  – Map-­‐Reduce  

Page 10: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Evalua2ng  a  method  for  a  problem  

•  Scalability  –  Process  X  GB  in  Y  hours  

•  Ease  of  use  •  Reliability  (fault  tolerance)  •  Cost  

– Hardware  and  cost  of  maintenance  •  Good  for  the  computa2ons  required?  

–  E.g.,  Itera2ve  versus  one  pass  •  Resource  sharing  

Page 11: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Mul2threading  

•  Mul2ple  threads  take  advantage  of  mul2ple  CPUs  

•  Shared  memory  •  Threads  can  execute  independently  and  concurrently  

•  Can  only  handle  Gigabytes  of  data  •  Reliable  

Page 12: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Graphics  Processing  Units  (GPU)  •  Number  of  cores:  

–  CPU:  Order  of  10  –  GPU:  smaller  cores  

•  Order  of  1000  

•  Can  be  >100x  faster  than  CPU  –  Parallel  computa2onally  intensive  tasks  off-­‐loaded  to  GPU  

•  Good  for  certain  computa2onally-­‐intensive  tasks  

•  Can  only  handle  Gigabytes  of  data  

•  Not  trivial  to  use,  requires  good  understanding  of  low-­‐level  architecture  for  efficient  use  

Page 13: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Message  Passing  Interface  (MPI)  

•  Language  independent  communica2on  protocol  among  processes  (e.g.  computers)  

•  Most  suitable  for  master/slave  model  •  Can  handle  Terabytes  of  data  •  Good  for  itera2ve  processing  •  Fault  tolerance  is  low  

Page 14: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Map-­‐Reduce  (Dean  &  Ghemawat,  2004)  

Mappers  

Reducers  

Data  

Output  

•  Computa2on  split  to  Map  (scaRer)  and  Reduce  (gather)  stages  

•  Easy  to  Use:    –  User  needs  to  implement  two  func2ons:  Mapper  and  Reducer  

•  Easily  handles  Terabytes  of  data  

•  Very  good  fault  tolerance  (failed  tasks  automa2cally  get  restarted)  

Page 15: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Comparison  of  Distributed  Compu2ng  Methods  

Mul$threading   GPU   MPI   Map-­‐Reduce  

Scalability  (data  size)  

Gigabytes   Gigabytes   Terabytes   Terabytes  

Fault  Tolerance   High   High   Low   High  

Maintenance  Cost   Low   Medium   Medium   Medium-­‐High  

Itera2ve  Process  Complexity  

Cheap   Cheap   Cheap   Usually  expensive  

Resource  Sharing   Hard   Hard   Easy   Easy  

Easy  to  Implement?   Easy   Needs  understanding  of  low-­‐level  GPU  architecture  

Easy   Easy  

Page 16: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Introduc2on  of  Map-­‐Reduce  

Page 17: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Example  Problem  

•  Tabula2ng  word  counts  in  corpus  of  documents  

•  Output:  <word,  count>  

Page 18: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Word  Count  Through  Map-­‐Reduce  Hello  World  Bye  World    

Hello  Hadoop  Goodbye  Hadoop  

Mapper  1  

<Hello,  1>  <Hadoop,  1>  <Goodbye,  1>  <Hadoop,1>  

<Hello,  1>  <World,  1>  <Bye,  1>  <World,1>  

Mapper  2  

Reducer  1  Words  from  A-­‐G  

Reducer  2  Words  from  H-­‐Z  

<Bye,  1>  <Goodbye,  1>  

<Hello,  2>  <World,  2>  <Hadoop,  2>  

Page 19: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Key  Ideas  about  Map-­‐Reduce  Big  Data  

Par22on  1   Par22on  2   …   Par22on  N  

Mapper  1   Mapper  2   …   Mapper  N  

<Key,  Value>   <Key,  Value>  <Key,  Value>  <Key,  Value>  

Reducer  1   Reducer  2   Reducer  M  …  

Output  1   Output  1  Output  1  Output  1  

Page 20: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Key  Ideas  about  Map-­‐Reduce  •  Data  are  split  into  par22ons  and  stored  in  many  different  machines  on  disk  (distributed  storage)  

•  Mappers  process  data  chunks  independently  and  emit  <Key,  Value>  pairs  

•  Data  with  the  same  key  are  sent  to  the  same  reducer.  One  reducer  can  receive  mul2ple  keys  

•  Every  reducer  sorts  its  data  by  key  •  For  each  key,  the  reducer  processes  the  values  corresponding  to  the  key  according  to  the  customized  reducer  func2on  and  output  

Page 21: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

More  on  Map-­‐Reduce  •  Depends  on  distributed  file  systems  •  Typically  mappers  are  the  data  storage  nodes  •  Map/Reduce  tasks  automa2cally  get  restarted  when  they  fail  (good  fault  tolerance)  

•  Map  and  Reduce  I/O  are  all  on  disk  – Data  transmission  from  mappers  to  reducers  are  through  disk  copy  

•  Itera2ve  process  through  Map-­‐Reduce  –  Each  itera2on  becomes  a  map-­‐reduce  job  –  Can  be  expensive  since  map-­‐reduce  overhead  is  high  

Page 22: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Web  Recommender  Systems  –  Our  Mo2va2on  Problem  

Page 23: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Web  Recommender  Systems    at  LinkedIn  

Page 24: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Similar  problem:    Content  recommenda2on  on  Yahoo!  

front  page    

Recommend  content  links  (out  of  30-­‐40,  editorially  programmed)    4  slots  exposed,  F1  has  maximum  exposure    Routes  traffic  to  other  Y!  proper2es                      

F1   F2   F3   F4  

Today  module  

Page 25: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

item  j  from  a  set  of  candidates  with  item  features    

(e.g.  keywords,  categories…)  

User  i    with  user  features  (e.g.,  industry,    behavioral  features,  Demographic  features,……)  

               (i,  j)  :  response  yij  visits  

Algorithm  selects  

(click  or  not)  

Which  item  should  we  recommend  to  the  user?  �        The  item  with  highest  expected  gain  of  u2lity  •  Sample  u2li2es:  

•  CTR  •  Revenue  (e.g.  CTR  *  bid)  •  Time  spent  …  

Web  Recommender  Systems  

Page 26: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Challenges  of  Web    Recommender  Systems  

•  Large  scale  data  •  Low  latency  real-­‐2me  ranking    •  Low  posi2ve  rates  (sparsity  problem)  •  Proper2es  of  the  item  set  to  rank  per  user  

– Dynamic:  new  items  may  get  added  any  2me  –  Personalized:    

•  Ads:  adver2sers  specify  targeted  user  to  show  their  ads  •  Update  stream  in  social  network:  only  from  connec2ons  

– May  have  temporal  paRerns,  e.g.  novelty  effect  •  New  users  come  to  the  site  every  second  

Page 27: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

A  Naïve  Model  (Most-­‐Popular)  

•  Use  empirical  CTR  for  each  item  to  rank  •  Item  j:  CTRj  =  #Clicksj  /  #  Viewsj  •  Advantage:  

– Quick  update  of  CTR:  e.g.  every  5-­‐minutes  

•  Problem:  – Different  users  might  like  different  items  – Cold  start  &  sparsity:  many  items  may  have  very  few  clicks  

Page 28: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Segmented  Most  Popular  

•  However:  Sparsity  drama2cally  increases  as  the  number  of  features  grows  

•  Curse  of  dimensionality  

Male Female

Item 1

Item 2

CTR for Male and Item 1

CTR for Female and Item 1

CTR for Male and Item 2

CTR for Female and Item 2

Page 29: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&
Page 30: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Another  Challenge:  Data  Collec2on  

•  Assume  most-­‐popular  model  for  all  traffic  •  Item  A:  true  CTR  5%,  500  clicks,  10000  views  •  Item  B:  true  CTR  10%,  0  click,  10  views  •  We  ends  up  always  serving  item  A!  •  Mul2-­‐armed  bandit  problem  (Explore-­‐Exploit):  

– A  naïve  approach:  ε-­‐greedy  -­‐-­‐  Use  x%  bucket  to  serve  random  traffic  

– BeRer  approaches  later  in  the  tutorial    

Page 31: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Challenges  from  Data  Analysis  

•  Example:  How  to  generate  confidence  intervals  of  model  performance  like  AUC?  

•  Single  machine:  – Bootstrap  the  test  data  100  2mes  – Compute  the  std(AUC)  from  AUC  in  the  100  bootstrapped  data  

•  How  do  we  do  it  for  large  scale  data?  – Bootstrap  non-­‐trivial!  

Page 32: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Large  Scale  Logis2c  Regression  for  Response  Predic2on  in  Web  Recommender  Systems  

Page 33: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Web  Recommender  Systems    at  LinkedIn  

Page 34: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

item  j  from  a  set  of  candidates  with  item  features    

(e.g.  keywords,  categories…)  

User  i    with  user  features  (e.g.,  industry,    behavioral  features,  Demographic  features,……)  

               (i,  j)  :  response  yij  visits  

Algorithm  selects  

(click  or  not)  

Which  item  should  we  recommend  to  the  user?  �        The  item  with  highest  expected  gain  of  u2lity  •  Sample  u2li2es:  

•  CTR  •  Revenue  (e.g.  CTR  *  bid)  •  Time  spent  …  

Recap:  Response  Predic2on  in  Web  Recommender  Systems  

Page 35: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Possible Approaches

§  Logistic Regression §  Naïve Bayes §  Support Vector Machines (SVM) §  Decision Trees §  Neural Networks §  …

Page 36: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Challenges  of  Response  Predic2on  •  Abundance  of  features  

–  Context:  2me,  item  placement,  IP  geo,  other  content  on  page  –  Member:  gender,  age,  loca2on,  industry,  2tle,  func2on  –  Item:  topic,  keywords,  standardized  features  

•  Need  models  that  can  handle  sparse  data  and  many  features  –  Logis2c  regression,  Decision  trees/GBDTs,  etc.  

•  Has  to  perform  at  scale  •  Logis2c  regression  strikes  a  good  balance  

–  Well  understood,  industry  proven  –  Can  be  scaled  for  both  training  and  inference  –  Straighworward  to  reason  about  model  

Page 37: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Response  Predic2on  for  Ads  

•  Items  to  rank:  Ad  campaigns  created  by  adver2sers  

•  Ranking  func2on:    – predicted  CTR  X  campaign  bid  

•  System  shows  the  top-­‐ranked  items  to  user  •  Charge  adver2sers  if  clicked  •  Second-­‐price  auc2on  

Page 38: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

CTR  Predic2on  Model  for  Ads  •  Feature  vectors  

–  Member  feature  vector:  xi  –  Campaign  feature  vector:  cj  –  Context  feature  vector:  zt  

•  Model:  

Page 39: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

CTR  Predic2on  Model  for  Ads  •  Feature  vectors  

–  Member  feature  vector:  xi  –  Campaign  feature  vector:  cj  –  Context  feature  vector:  zk  

•  Model:  

Cold-start component

Warm-start per-campaign component

Page 40: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

CTR  Predic2on  Model  for  Ads  •  Feature  vectors  

–  Member  feature  vector:  xi  –  Campaign  feature  vector:  cj  –  Context  feature  vector:  zk  

•  Model:  

Cold-start component

Warm-start per-campaign component

Cold-­‐start:        Warm-­‐start:      Both  can  have  L2  penal2es.    

Page 41: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Model  Fidng  •  Single  machine  (well  understood)  

–  conjugate  gradient  –  L-­‐BFGS  –  Trusted  region  –  …  

•  Model  Training  with  Large  scale  data  –  Cold-­‐start  component  Θc is  more  stable  

•  Weekly/bi-­‐weekly  training  good  enough  •  However:  difficulty  from  need  for  large-­‐scale  logis2c  regression  

–  Warm-­‐start  per-­‐campaign  model  Θw is  more  dynamic  •  New  items  can  get  generated  any  2me  •  Big  loss  if  opportuni2es  missed  •  Need  to  update  the  warm-­‐start  component  as  frequently  as  possible  

Page 42: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Model  Fidng  •  Single  machine  (well  understood)  

–  conjugate  gradient  –  L-­‐BFGS  –  Trusted  region  –  …  

•  Model  Training  with  Large  scale  data  –  Cold-­‐start  component  Θc is  more  stable  

•  Weekly/bi-­‐weekly  training  good  enough  •  However:  difficulty  from  need  for  large-­‐scale  logis2c  regression  

–  Warm-­‐start  per-­‐campaign  model  Θw is  more  dynamic  •  New  items  can  get  generated  any  2me  •  Big  loss  if  opportuni2es  missed  •  Need  to  update  the  warm-­‐start  component  as  frequently  as  possible  

Large Scale Logistic Regression

Per-item logistic regression given Θc

Page 43: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Challenges  of  Large    Scale  Logis2c  Regression  

•  Logis2c  regression  for  small  data  sets  has  been  studied  and  widely  applied  since  1970s  

•  Unknown  weights  can  be  learned  through  Newton  method,  conjugate  gradient  ascent,  coordinate  descent,  etc.  

•  However,  for  web  applica2ons  we  usually  see  – Hundreds  of  millions  of  samples  – Hundreds  of  thousands  of  features  

•  Q:  How  to  fit  large  scale  Logis2c  regression  in  a  scalable  way?  

Page 44: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Large  Scale  Logis2c  Regression  

•  Naïve:  –  Par22on  the  data  and  run  logis2c  regression  for  each  par22on  

–  Take  the  mean  of  the  learned  coefficients  –  Problem:  Not  guaranteed  to  converge  to  the  model  from  single  machine!  

•  Alterna2ng  Direc2on  Method  of  Mul2pliers  (ADMM)  –  Boyd  et  al.  2011  –  Set  up  a  constraint  that  each  par22on’s  coefficient  =  global  consensus  

–  Solve  the  op2miza2on  problem  using  Lagrange  Mul2pliers  

Page 45: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Model  Fidng  for  Large-­‐Scale  Ads  Data  

Page 46: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

A  Deep-­‐dive  to  ADMM  

Page 47: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Dual  Ascent  Method  

•  Consider  a  convex  op2miza2on  problem  

•  Lagrangian  for  the  problem:  

•  Dual  Ascent:  

2Precursors

In this section, we briefly review two optimization algorithms that areprecursors to the alternating direction method of multipliers. Whilewe will not use this material in the sequel, it provides some usefulbackground and motivation.

2.1 Dual Ascent

Consider the equality-constrained convex optimization problem

minimize f(x)subject to Ax = b,

(2.1)

with variable x ! Rn, where A ! Rm!n and f : Rn " R is convex.The Lagrangian for problem (2.1) is

L(x,y) = f(x) + yT (Ax # b)

and the dual function is

g(y) = infx

L(x,y) = #f"(#AT y) # bT y,

where y is the dual variable or Lagrange multiplier, and f" is the convexconjugate of f ; see [20, §3.3] or [140, §12] for background. The dual

7

2Precursors

In this section, we briefly review two optimization algorithms that areprecursors to the alternating direction method of multipliers. Whilewe will not use this material in the sequel, it provides some usefulbackground and motivation.

2.1 Dual Ascent

Consider the equality-constrained convex optimization problem

minimize f(x)subject to Ax = b,

(2.1)

with variable x ! Rn, where A ! Rm!n and f : Rn " R is convex.The Lagrangian for problem (2.1) is

L(x,y) = f(x) + yT (Ax # b)

and the dual function is

g(y) = infx

L(x,y) = #f"(#AT y) # bT y,

where y is the dual variable or Lagrange multiplier, and f" is the convexconjugate of f ; see [20, §3.3] or [140, §12] for background. The dual

7

8 Precursors

problem is

maximize g(y),

with variable y ! Rm. Assuming that strong duality holds, the optimalvalues of the primal and dual problems are the same. We can recovera primal optimal point x! from a dual optimal point y! as

x! = argminx

L(x,y!),

provided there is only one minimizer of L(x,y!). (This is the caseif, e.g., f is strictly convex.) In the sequel, we will use the notationargminx F (x) to denote any minimizer of F , even when F does nothave a unique minimizer.

In the dual ascent method, we solve the dual problem using gradientascent. Assuming that g is di!erentiable, the gradient "g(y) can beevaluated as follows. We first find x+ = argminx L(x,y); then we have"g(y) = Ax+ # b, which is the residual for the equality constraint. Thedual ascent method consists of iterating the updates

xk+1 := argminx

L(x,yk) (2.2)

yk+1 := yk + !k(Axk+1 # b), (2.3)

where !k > 0 is a step size, and the superscript is the iteration counter.The first step (2.2) is an x-minimization step, and the second step (2.3)is a dual variable update. The dual variable y can be interpreted asa vector of prices, and the y-update is then called a price update orprice adjustment step. This algorithm is called dual ascent since, withappropriate choice of !k, the dual function increases in each step, i.e.,g(yk+1) > g(yk).

The dual ascent method can be used even in some cases when g isnot di!erentiable. In this case, the residual Axk+1 # b is not the gradi-ent of g, but the negative of a subgradient of #g. This case requires adi!erent choice of the !k than when g is di!erentiable, and convergenceis not monotone; it is often the case that g(yk+1) $> g(yk). In this case,the algorithm is usually called the dual subgradient method [152].

If !k is chosen appropriately and several other assumptions hold,then xk converges to an optimal point and yk converges to an optimal

Page 48: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Augmented  Lagrangians  •  Bring  robustness  to  the  dual  ascent  method  

•  Yield  convergence  without  assump2ons  like  strict  convexity  or  finiteness  of  f  

•     

•  The  value  of  ρ  influences  the  convergence  rate  

10 Precursors

collected (gathered) in order to compute the residual Axk+1 ! b. Oncethe (global) dual variable yk+1 is computed, it must be distributed(broadcast) to the processors that carry out the N individual xi mini-mization steps (2.4).

Dual decomposition is an old idea in optimization, and traces backat least to the early 1960s. Related ideas appear in well known workby Dantzig and Wolfe [44] and Benders [13] on large-scale linear pro-gramming, as well as in Dantzig’s seminal book [43]. The general ideaof dual decomposition appears to be originally due to Everett [69],and is explored in many early references [107, 84, 117, 14]. The useof nondi!erentiable optimization, such as the subgradient method, tosolve the dual problem is discussed by Shor [152]. Good references ondual methods and decomposition include the book by Bertsekas [16,chapter 6] and the survey by Nedic and Ozdaglar [131] on distributedoptimization, which discusses dual decomposition methods and con-sensus problems. A number of papers also discuss variants on standarddual decomposition, such as [129].

More generally, decentralized optimization has been an active topicof research since the 1980s. For instance, Tsitsiklis and his co-authorsworked on a number of decentralized detection and consensus problemsinvolving the minimization of a smooth function f known to multi-ple agents [160, 161, 17]. Some good reference books on parallel opti-mization include those by Bertsekas and Tsitsiklis [17] and Censor andZenios [31]. There has also been some recent work on problems whereeach agent has its own convex, potentially nondi!erentiable, objectivefunction [130]. See [54] for a recent discussion of distributed methodsfor graph-structured optimization problems.

2.3 Augmented Lagrangians and the Method of Multipliers

Augmented Lagrangian methods were developed in part to bringrobustness to the dual ascent method, and in particular, to yield con-vergence without assumptions like strict convexity or finiteness of f .The augmented Lagrangian for (2.1) is

L!(x,y) = f(x) + yT (Ax ! b) + (!/2)"Ax ! b"22, (2.6)2.3 Augmented Lagrangians and the Method of Multipliers 11

where ! > 0 is called the penalty parameter. (Note that L0 is thestandard Lagrangian for the problem.) The augmented Lagrangiancan be viewed as the (unaugmented) Lagrangian associated with theproblem

minimize f(x) + (!/2)!Ax " b!22

subject to Ax = b.

This problem is clearly equivalent to the original problem (2.1), sincefor any feasible x the term added to the objective is zero. The associateddual function is g!(y) = infx L!(x,y).

The benefit of including the penalty term is that g! can be shown tobe di!erentiable under rather mild conditions on the original problem.The gradient of the augmented dual function is found the same way aswith the ordinary Lagrangian, i.e., by minimizing over x, and then eval-uating the resulting equality constraint residual. Applying dual ascentto the modified problem yields the algorithm

xk+1 := argminx

L!(x,yk) (2.7)

yk+1 := yk + !(Axk+1 " b), (2.8)

which is known as the method of multipliers for solving (2.1). This isthe same as standard dual ascent, except that the x-minimization stepuses the augmented Lagrangian, and the penalty parameter ! is usedas the step size "k. The method of multipliers converges under far moregeneral conditions than dual ascent, including cases when f takes onthe value +# or is not strictly convex.

It is easy to motivate the choice of the particular step size ! inthe dual update (2.8). For simplicity, we assume here that f is di!er-entiable, though this is not required for the algorithm to work. Theoptimality conditions for (2.1) are primal and dual feasibility, i.e.,

Ax" " b = 0, $f(x") + AT y" = 0,

respectively. By definition, xk+1 minimizes L!(x,yk), so

0 = $xL!(xk+1,yk)

= $xf(xk+1) + AT!yk + !(Axk+1 " b)

"

= $xf(xk+1) + AT yk+1.

Page 49: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Alterna2ng  Direc2on  Method  of  Mul2pliers  (ADMM)  

•  Problem:  

•  Augmented  Lagrangians  

 •   ADMM:  

3Alternating Direction Method of Multipliers

3.1 Algorithm

ADMM is an algorithm that is intended to blend the decomposabilityof dual ascent with the superior convergence properties of the methodof multipliers. The algorithm solves problems in the form

minimize f(x) + g(z)subject to Ax + Bz = c

(3.1)

with variables x ! Rn and z ! Rm, where A ! Rp!n, B ! Rp!m, andc ! Rp. We will assume that f and g are convex; more specific assump-tions will be discussed in §3.2. The only di!erence from the generallinear equality-constrained problem (2.1) is that the variable, called xthere, has been split into two parts, called x and z here, with the objec-tive function separable across this splitting. The optimal value of theproblem (3.1) will be denoted by

p! = inf{f(x) + g(z) | Ax + Bz = c}.

As in the method of multipliers, we form the augmented Lagrangian

L"(x,z,y) = f(x) + g(z) + yT (Ax + Bz " c) + (!/2)#Ax + Bz " c#22.

13

3Alternating Direction Method of Multipliers

3.1 Algorithm

ADMM is an algorithm that is intended to blend the decomposabilityof dual ascent with the superior convergence properties of the methodof multipliers. The algorithm solves problems in the form

minimize f(x) + g(z)subject to Ax + Bz = c

(3.1)

with variables x ! Rn and z ! Rm, where A ! Rp!n, B ! Rp!m, andc ! Rp. We will assume that f and g are convex; more specific assump-tions will be discussed in §3.2. The only di!erence from the generallinear equality-constrained problem (2.1) is that the variable, called xthere, has been split into two parts, called x and z here, with the objec-tive function separable across this splitting. The optimal value of theproblem (3.1) will be denoted by

p! = inf{f(x) + g(z) | Ax + Bz = c}.

As in the method of multipliers, we form the augmented Lagrangian

L"(x,z,y) = f(x) + g(z) + yT (Ax + Bz " c) + (!/2)#Ax + Bz " c#22.

13

14 Alternating Direction Method of Multipliers

ADMM consists of the iterations

xk+1 := argminx

L!(x,zk,yk) (3.2)

zk+1 := argminz

L!(xk+1,z,yk) (3.3)

yk+1 := yk + !(Axk+1 + Bzk+1 ! c), (3.4)

where ! > 0. The algorithm is very similar to dual ascent and themethod of multipliers: it consists of an x-minimization step (3.2), az-minimization step (3.3), and a dual variable update (3.4). As in themethod of multipliers, the dual variable update uses a step size equalto the augmented Lagrangian parameter !.

The method of multipliers for (3.1) has the form

(xk+1,zk+1) := argminx,z

L!(x,z,yk)

yk+1 := yk + !(Axk+1 + Bzk+1 ! c).

Here the augmented Lagrangian is minimized jointly with respect tothe two primal variables. In ADMM, on the other hand, x and z areupdated in an alternating or sequential fashion, which accounts for theterm alternating direction. ADMM can be viewed as a version of themethod of multipliers where a single Gauss-Seidel pass [90, §10.1] overx and z is used instead of the usual joint minimization. Separating theminimization over x and z into two steps is precisely what allows fordecomposition when f or g are separable.

The algorithm state in ADMM consists of zk and yk. In other words,(zk+1,yk+1) is a function of (zk,yk). The variable xk is not part of thestate; it is an intermediate result computed from the previous state(zk!1,yk!1).

If we switch (re-label) x and z, f and g, and A and B in the prob-lem (3.1), we obtain a variation on ADMM with the order of the x-update step (3.2) and z-update step (3.3) reversed. The roles of x andz are almost symmetric, but not quite, since the dual update is doneafter the z-update but before the x-update.

Page 50: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Large  Scale  Logis2c  Regression  via  ADMM  

•  Nota2on:    –  (A,  b):  data  –  xi:  Coefficients  for  par22on  i  –  z:  Consensus  coefficient  vector  –  r(z):  penalty  component  such  as  ||z||2  

•  Problem  

•  ADMM  

64 Distributed Model Fitting

Some common loss functions are hinge loss (1 ! µi)+, exponentialloss exp(!µi), and logistic loss log(1 + exp(!µi)); the most com-mon regularizers are !1 and !2 (squared). The support vector machine(SVM) [151] corresponds to hinge loss with a quadratic penalty, whileexponential loss yields boosting [78] and logistic loss yields logisticregression.

8.2 Splitting across Examples

Here we discuss how to solve the model fitting problem (8.1) witha modest number of features but a very large number of trainingexamples. Most classical statistical estimation problems belong to thisregime, with large volumes of relatively low-dimensional data. The goalis to solve the problem in a distributed way, with each processor han-dling a subset of the training data. This is useful either when there areso many training examples that it is inconvenient or impossible to pro-cess them on a single machine or when the data is naturally collectedor stored in a distributed fashion. This includes, for example, onlinesocial network data, webserver access logs, wireless sensor networks,and many cloud computing applications more generally.

We partition A and b by rows,

A =

!

"#A1...

AN

$

%& , b =

!

"#b1...

bN

$

%& ,

with Ai " Rmi!n and bi " Rmi , where'N

i=1 mi = m. Thus, Ai and bi

represent the ith block of data and will be handled by the ith processor.We first put the model fitting problem in the consensus form

minimize'N

i=1 li(Aixi ! bi) + r(z)subject to xi ! z = 0, i = 1, . . . ,N,

(8.3)

with variables xi " Rn and z " Rn. Here, li refers (with some abuse ofnotation) to the loss function for the ith block of data. The problem cannow be solved by applying the generic global variable consensus ADMM

8.2 Splitting across Examples 65

algorithm described in §7.1, given here with scaled dual variable:

xk+1i := argmin

xi

!li(Aixi ! bi) + (!/2)"xi ! zk + uk

i "22

"

zk+1 := argminz

!r(z) + (N!/2)"z ! xk+1 ! uk"2

2

"

uk+1i := uk

i + xk+1i ! zk+1.

The first step, which consists of an "2-regularized model fitting prob-lem, can be carried out in parallel for each data block. The secondstep requires gathering variables to form the average. The minimiza-tion in the second step can be carried out componentwise (and usuallyanalytically) when r is assumed to be fully separable.

The algorithm described above only requires that the loss functionl be separable across the blocks of data; the regularizer r does not needto be separable at all. (However, when r is not separable, the z-updatemay require the solution of a nontrivial optimization problem.)

8.2.1 Lasso

For the lasso, this yields the distributed algorithm

xk+1i := argmin

xi

!(1/2)"Aixi ! bi"2

2 + (!/2)"xi ! zk + uki "2

2

"

zk+1 := S!/"N (xk+1 + uk)uk+1

i := uki + xk+1

i ! zk+1.

Each xi-update takes the form of a Tikhonov-regularized least squares(i.e., ridge regression) problem, with analytical solution

xk+1i := (AT

i Ai + !I)!1(ATi bi + !(zk ! uk

i )).

The techniques from §4.2 apply: If a direct method is used, then the fac-torization of AT

i Ai + !I can be cached to speed up subsequent updates,and if mi < n, then the matrix inversion lemma can be applied to letus factor the smaller matrix AiAT

i + !I instead.Comparing this distributed-data lasso algorithm with the serial ver-

sion in §6.4, we see that the only di!erence is the collection and aver-aging steps, which couple the computations for the data blocks.

An ADMM-based distributed lasso algorithm is described in [121],with applications in signal processing and wireless communications.

Per-partition Regression

Consensus Computation

Page 51: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Large  Scale  Logis2c  Regression    via  ADMM  

BIG  DATA  

Par22on  1   Par22on  2   Par22on  3   Par22on  K  

Logis2c  Regression  

Logis2c  Regression  

Logis2c  Regression  

Logis2c  Regression  

Consensus  Computa2on  

Iteration 1

Page 52: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Large  Scale  Logis2c  Regression    via  ADMM  

BIG  DATA  

Par22on  1   Par22on  2   Par22on  3   Par22on  K  

Logis2c  Regression  

Consensus  Computa2on  

Logis2c  Regression  

Logis2c  Regression  

Logis2c  Regression  

Iteration 1

Page 53: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Large  Scale  Logis2c  Regression    via  ADMM  

BIG  DATA  

Par22on  1   Par22on  2   Par22on  3   Par22on  K  

Logis2c  Regression  

Logis2c  Regression  

Logis2c  Regression  

Logis2c  Regression  

Consensus  Computa2on  

Iteration 2

Page 54: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

An  Example  Implementa2on  

•  ADMM  for  Logis2c  regression  model  fidng  with  L2  penalty  

•  Each  itera2on  of  ADMM  is  a  Map-­‐Reduce  job  – Mapper:  par22on  the  data  into  K  par22ons  –  Reducer:  For  each  par22on,  use  liblinear/glmnet  to  fit  a  L2  logis2c  regression  

– Gateway:  consensus  computa2on  by  results  from  all  reducers,  and  sends  back  the  consensus  to  each  reducer  node  

Page 55: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

BeRer  Convergence  Can    Be  Achieved  By  

1.  BeRer  Ini2aliza2on:  Use  results  from  Naïve  method  to  ini2alize  the  parameters  

2.  Adap2vely  change  step  size  (ρ)  for  each  itera2on  based  on  the  convergence  status  of  the  consensus  

•  ADMM-­‐M:  ADMM  with  tuning  1  •  ADMM-­‐MA:  ADMM  with  tuning  1+2  

Page 56: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Evalua2on  of  ADMM  Convergence  

Page 57: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

KDD  CUP  2010  Data  

•  Bridge  to  Algebra  2008-­‐2009  data  in    hRps://pslcdatashop.web.cmu.edu/KDDCup/downloads.jsp  •  Binary  response,  20M  covariates  •  Only  keep  covariates  with  >=  10  occurrences  =>  2.2M  covariates  

•  Training  data:  8,407,752  samples  •  Test  data  :  510,302  samples  

Page 58: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Avg  Training  Loglikelihood  vs  Number  of  Itera2ons  

Page 59: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Test  AUC  vs  Number  of  Itera2ons    

Page 60: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

0 5 10 15 20

−0.2

4−0

.22

−0.2

0

number of iterations

log−

likel

ihoo

d ●

●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

ADMMADMM−MADMM−MA

Offline  Ads  Data:  Effect  of    ADMM  Convergence  Tunings    

Page 61: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Explore-­‐Exploit  

Page 62: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Recap:  Challenges  from  Data  Collec2on  

•  Assume  most-­‐popular  model  for  all  traffic  •  Item  A:  true  CTR  5%,  500  clicks,  10000  views  •  Item  B:  true  CTR  10%,  0  click,  10  views  •  We  ends  up  always  serving  item  A!  •  Mul2-­‐armed  bandit  problem  (Explore-­‐Exploit):  

– A  naïve  approach:  ε-­‐greedy  -­‐-­‐  Use  x%  bucket  to  serve  random  traffic  

– Q:  Can  we  do  beRer?  •  Number  of  items  are  very  large  and  ads  clicks  are  very  sparse  

•  ε-­‐greedy  will  also  have  revenue  loss  

Page 63: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Exis2ng  Approaches    for  Explore-­‐Exploit  

•  ε-­‐greedy    

•  UCB  (Upper  Confidence  Bound)  

•  Gidns  Index  

•  Thompson  sampling  

Page 64: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Thompson  Sampling  (Thompson  1933)  

•  Basic  Idea:  –  For  every  user  request  &  item  to  rank,  draw  a  sample  from  posterior  

of  Θc  and  Θw    –  Compute  CTR  using  the  sampled  coefficients  instead  of  posterior  

mode  –  More  explora2on  for  items  that  have  high  posterior  variances  due  to  

small  sample  sizes  •  Has  been  proven  to  be  op2mal  for  large  scale  data  (Chapelle  and  Li  

2011)  •  Our  approach  

–  Θc    is  learned  from  a  lot  of  data,  posterior  variance  can  be  approximated  as  0  

–  Θw  is  campaign-­‐specific,  need  to  draw  sample  from  its  posterior  that  is  learned  from  per-­‐item  logis2c  regression  

–  Laplace  approxima2on  for  compu2ng  posterior  variance  

Page 65: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Experiments  

Page 66: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Models  Considered  

•  CONTROL:  per-­‐campaign  CTR  coun2ng  model  

•  COLD-­‐ONLY:  only  cold-­‐start  component  

•  LASER:  our  model  (cold-­‐start  +  warm-­‐start)  

•  LASER-­‐EE:  our  model  with  Explore-­‐Exploit  using  Thompson  sampling  

Page 67: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Metrics  

•  Model  metrics  – Test  Log-­‐likelihood  – AUC/ROC  – Observed/Expected  ra2o  

•  Business  metrics  (Online  A/B  Test)  – CTR  – CPM  (Revenue  per  impression)  

Page 68: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Observed  /  Expected  Ra2o  

•  Observed:  #Clicks  in  the  data  •  Expected:  Sum  of  predicted  CTR  for  all  impressions  

•  Not  a  “standard”  classifier  metric,  but  in  many  ways  more  useful  for  this  applica2on  

•  What  we  usually  see:  Observed  /  Expected  <  1  

Page 69: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Observed  /  Expected  Ra2o  •  Over-­‐predic2on  is  a  serious  problem  with  ads  

–  Instance  of  the  “winner’s  curse”  – When  choosing  from  among  thousands  of  candidates,  an  item  with  mistakenly  over-­‐es2mated  CTR  may  end  up  winning  the  auc2on  

•  Par2cularly  helpful  for  examining  per-­‐segment  data  –  E.g.  by  bid,  number  of  impressions  in  training  (warmness),  geo,  etc.  

–  Allows  us  to  see  where  the  model  might  be  giving  too  much  weight  to  the  wrong  campaigns  

•  High  correla2on  between  O/E  ra2o  and  model  performance  online  

Page 70: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Offline  Experiment  

•  45  Days  of  adver2sing  events  log    •  First  30  days  for  training  data  •  Last  15  days  for  tes2ng  data  •  A�er  down  sampling,  both  train  and  test  data  have  hundreds  of  millions  of  events  

•  Thousands  of  features  

Page 71: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Offline:  ROC  Curves  

False Positive Rate

True

Pos

itive

Rat

e

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

●●●●●●●●●

●●

●●

●●●●●●●

●●

●●

CONTROL [ 0.672 ]COLD−ONLY [ 0.757 ]LASER [ 0.778 ]

Page 72: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Online  A/B  Test  •  Three  models  

–  CONTROL  (10%)  –  LASER  (85%)  –  LASER-­‐EE  (5%)  

•  Segmented  Analysis    –  8  segments  by  campaign  warmness  

•  Degree  of  warmness:  the  number  of  training  samples  available  in  the  training  data  for  the  campaign  

•  Segment  #1:  Campaigns  with  almost  no  data  in  training  •  Segment  #8:  Campaigns  that  are  served  most  heavily  in  the  previous  batches  so  that  their  CTR  es2mate  can  be  quite  accurate    

Page 73: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Daily  CTR  Li�  Over  Control  Pe

rcen

tage

of C

TR L

ift

+%

+%

+%

+%

+%D

ay 1

Day

2

Day

3

Day

4

Day

5

Day

6

Day

7

● ●

●●

LASERLASER−EE

Page 74: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Daily  CPM  Li�  Over  Control  Pe

rcen

tage

of e

CPM

Lift

+%

+%

+%

+%

+%

+%

Day

1

Day

2

Day

3

Day

4

Day

5

Day

6

Day

7

● ●

LASERLASER−EE

Page 75: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

CTR  Li�  By  Campaign    Warmness  Segments  

Campaign Warmness Segment

Lift

Perc

enta

ge o

f CTR

−%

−%

−%

0%

+%

+%

1 2 3 4 5 6 7 8

LASERLASER−EE

Page 76: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

CPM  Li�  By  Campaign    Warmness  Segments  

Campaign Warmness Segment

Lift

Perc

enta

ge o

f CPM

−%

−%

−%

0%

+%

+%

1 2 3 4 5 6 7 8

LASERLASER−EE

Page 77: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

O/E  Ra2o  By  Campaign    Warmness  Segments  

Campaign Warmness Segment

Obs

erve

d C

lick/

Expe

cted

Clic

ks

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8

CONTROLLASERLASER−EE

Page 78: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Number  of  Campaigns  Served  Improvement  from  E/E  

Page 79: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Insights  •  Overall  performance:  

–  LASER  and  LASER-­‐EE  are  both  much  beRer  than  control  –  LASER  and  LASER-­‐EE  performance  are  very  similar  

•  Segmented  analysis  by  campaign  warmness  –  Segment  #1  (very  cold)  

•  LASER-­‐EE  much  worse  than  LASER  due  to  its  explora2on  property  •  LASER  much  beRer  than  CONTROL  due  to  cold-­‐start  features  

–  Segments  #3  -­‐  #5  •  LASER-­‐EE  significantly  beRer  than  LASER  •  Winner’s  curse  hit  LASER  

–  Segment  #6  -­‐  #8  (very  warm)  •  LASER-­‐EE  and  LASER  are  equivalent  

•  Number  of  campaigns  served  –  LASER-­‐EE  serves  significantly  more  campaigns  than  LASER  –  Provides  healthier  market  place  

Page 80: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Parallel  Matrix  Factoriza2on  

Page 81: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Recap:  Proper2es  of  Web  Recommender  Systems  

•  One  or  mul2ple  metrics  to  op2mize  –  Click  Through  Rate  (CTR)  (focus  of  this  talk)  –  Revenue  per  impression  –  Time  spent  on  the  landing  page  –  Ad  conversion  rate  –  …  

•  Imbalanced  data  (clicks  are  very  sparse)  •  Large  scale  data  •  Cold-­‐start  

–  User  features:  Age,  gender,  posi2on,  industry,  …  –  Item  features:  Category,  key  words,  creator  features,  …  

Page 82: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Matrix  Factoriza2on  Models  

•  Response  matrix  Y    – yij  =  1  if  user  i  clicked  item  j  – yij  =  0  if  user  i  viewed  item  j  but  did  not  click  

•  Y  is  a  highly  incomplete  matrix  •  Y  ≈  U’V  

– ui’  is  the  i-­‐th  row  of  U,  latent  profile  of  user  i  – vj’  is  the  j-­‐th  row  of  V,  latent  profile  of  item  j  

•  Widely  used  approach  such  as  Newlix  compe22on  

Page 83: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Probabilis2c  Matrix  Factoriza2on  

•     

Global Features

Item effect

User factors

Item factors

User effect

Salakhutdinov  and  Mnih  (2007)  

Page 84: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Flexible  Regression  Priors  •     

•  g(·∙),  h(·∙),  G(·∙),  H(·∙)  can  be  any  regression  func2ons,  e.g.  simple  linear  regression  

•  Agarwal  and  Chen  (KDD  2009);  Zhang  et  al.  (RecSys  2011)  

User covariates

Item covariates

Page 85: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

•  Monte  Carlo  EM  (Booth  and  Hobert  1999)  •  Let      •  Let    •  E  Step:  

– Obtain  N  samples  of  condi2onal  posterior  

•  M  Step:  

Model  Fidng  Using  MCEM    (Single  Machine)  

Page 86: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Handling  Binary  Responses  

•  Gaussian  responses:                                                                                                                              have  closed  form    •  Binary  responses  +  Logis2c:  no  longer  closed  form  

•  Varia2onal  approxima2on  (VAR)  

•  Adap2ve  rejec2on  sampling  (ARS)  

Page 87: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Varia2onal  Approxima2on  (VAR)  

•  Ini2ally  let  ξij  =  1  •  Before  each  E-­‐step,  create  pseudo  Gaussian  response  for  each  binary  observa2on  

•  Run  E-­‐Step  and  M-­‐Step  using  the  Gaussian  pseudo  response  

•  A�er  M-­‐step,  let    Agarwal  and  Chen  (KDD  2009)  

Page 88: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Adap2ve  Rejec2on  Sampling  (ARS)  

•  For  each  E-­‐step,  obtain  precise  condi2onal  posterior  samples  of    

•  Adap2vely  update  the  upper  and  lower  bound  of  the  density  func2on  to  do  rejec2on  sampling  

Page 89: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Adap2ve  Rejec2on  Sampling  (ARS)  •  Goal:  Obtain  a  sample  x*  from  a  log-­‐concave  target  density  

func2on  p(x)  •  Try  to  avoid  evalua2ng  p(x)  as  much  as  possible  •  Ini2alize  point  set  S  with  3  ini2al  points  •  Construct  lower  bound  lower(x)  and  upper  bound  upper(x)  using  

the  current  S  •  Let  e(x)  =  exp(upper(x)),  s(x)  =  exp(lower(x))  •  Let  e1(x)  be  the  density  func2on  from  e(x)  •  Repeat  the  following  steps  un2l  a  sample  is  obtained:  

–  Draw  x*  ~  e1(x)  and  z  ~  Unif(0,1)  –  If  z<=s(x*)/e(x*),  accept  x*  and  break  –  If  z<=p(x*)/e(x*),  accept  x*  and  break;  otherwise  reject  x*  –  If  x*  rejected,  update  e(x)  and  s(x)  by  construc2ng  new  lower  and  

upper  bounds  

Page 90: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Single  machine  model  fidng  is  good,  but…  

•  We  are  working  with  very  large  scale  data  sets  

•  Parallel  matrix  factoriza2on  methods  using  Map-­‐Reduce  have  to  be  developed  

 

Page 91: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

•  Monte  Carlo  EM  (Booth  and  Hobert  1999)  •  Let      •  Let    •  E  Step:  

– Obtain  N  samples  of  condi2onal  posterior  

•  M  Step:  

Model  Fidng  Using  MCEM  

Page 92: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Parallel  Matrix  Factoriza2on  •  Par22on  data  into  m  par22ons  •  For  each  par22on                                            run  MCEM  algorithm  and  get            .  

•     

•  Ensemble  runs:  for  k  =  1,  …  ,  n  –  Repar22on  data  into  m  par22ons  with  a  new  seed  –  Run  E-­‐step  only  job  for  each  par22on  given    

•  Average  over  user/item  factors  for  all  par22ons  and  k’s  to  obtain  the  final  es2mate  

Page 93: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

One  Map-­‐Reduce    job  

Parallel  Matrix  Factoriza2on  •  Par22on  data  into  m  par22ons  •  For  each  par22on                                            run  MCEM  algorithm  and  get            .  

•     

•  Ensemble  runs:  for  k  =  1,  …  ,  n  –  Repar22on  data  into  m  par22ons  with  a  new  seed  –  Run  E-­‐step  only  job  for  each  par22on  given    

•  Average  over  user/item  factors  for  all  par22ons  and  k’s  to  obtain  the  final  es2mate  

Each  ensemble    run  is  a  Map-­‐Reduce    

job  

Page 94: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Key  Points  

•  Par22oning  is  tricky!  – By  events?  By  items?  By  users?  

•  Empirically,  “divide  and  conquer”  +  average  over              to  obtain          work  well!  

•  Ensemble  runs:  A�er  obtained          ,  we  run  n  E-­‐step-­‐only  jobs  and  take  average,  for  each  job  using  a  different  user-­‐item  mix.    

Page 95: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Iden2fiability  Issues  (MCEM-­‐ARSID)  

•  Same  log-­‐likelihood  can  be  achieved  by  – g  (  )  =  g  (  )  +  r,  h  (  )  =  h  (  )  –  r  

•  Center    α,  β,  u  to  zero-­‐mean  every  E-­‐step  

– u  =  -­‐u,  v  =  -­‐v  •  Constrain  v  to  be  posi2ve  

–   Switching  u.1,  v.1  with  u.2,  v.2  •  ui  ~  N(G(xi)  ,  I),  vj  ~  N(H(xj),  λI)  •  Constraint:  Diagonal  entries  λ1  >=  λ2  >=  …  

Page 96: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

MovieLens  1M  Data  

•  75%  training  and  25%  test  split  by  2me  •  Imbalanced  data  

– User  ra2ng  =  1:  Posi2ve  – User  ra2ng  =  2,  3,  4,  5:  Nega2ve  – 5%  posi2ve  rate  

•  Balanced  data  – User  ra2ng  =  1,  2,  3:  Posi2ve  – User  ra2ng  =  4,  5:  Nega2ve  – 44%  posi2ve  rate  

Page 97: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

•  MCEM-­‐VAR  and  MCEM-­‐ARS  slightly  beRer  than  SGD  for  1  par22on  •  A  lot  of  effort  spent  on  tuning  L2  penalty  parameters  and  learning  rate  

for  SGD  

Page 98: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Big  difference  between  VAR  and  ARS  for  imbalanced  data!  

Page 99: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Yahoo!  FrontPage  Today  Module  Recommenda2on  

Page 100: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

How  to  Handle  New  Ar2cles  

•  New  ar2cles  can  get  generated  any  2me  during  the  day  

•  Problem:  How  to  quickly  build  models  to  reflect  user  preference  for  the  fresh  ar2cles  

•  Offline  large-­‐scale  matrix  factoriza2on  can  not  be  real-­‐2me!  

Page 101: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Online  Logis2c  Regression  (OLR)  §  User i with feature xi, article j §  Binary response y (click/non-click) §  § 

§  Prior §  Using Laplace approximation or variational Bayesian

methods to obtain posterior

§  Prior for next batch §  Can approximate and as diagonal for high dim xi

Page 102: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

User  Features  for  OLR  •  Age,  gender,  job  posi2on  etc  for  login  users  

•  General  behavior  targe2ng  (BT)  features  –  Music?  Finance?  Poli2cs?  

•  User  profiles  from  historical  view/click  behavior  on  previous  items  in  the  data,  e.g.  –  Item-­‐profile:  use  previously  clicked  item  ids  as  the  user  profile  –  Category-­‐profile:  use  item  category  affinity  score  as  profile.  The  score  can  be  simply  user’s  historical  CTR  on  each  category.  

–  Are  there  beRer  ways  to  generate  user  profiles?  –  Yes!  By  matrix  factoriza2on!  

Page 103: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Matrix  Factoriza2on  For    User  Profile  in  OLR  

•  Offline  user  profile  building  period,  obtain  the  user  factor            for  user  i  

•  Online  modeling  using  OLR  –  If  a  user  has  a  profile  (warm-­‐start),  use                as  the  user  covariates  

–  If  not  (cold-­‐start),  use                          as  the  user  covariates  

Page 104: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Large  Yahoo!  Front  Page  Data  •  Data  for  building  user  profile:  8M  users  with  at  least  10  clicks  (heavy  users)  in  June  2011,  1B  events  

•  Data  for  training  and  tes2ng  OLR  model:  Random  served  data  with  2.4M  clicks  in  July  2011  

•  Heavy  users  contributed  around  30%  of  clicks  •  User  covariates  /  features  for  OLR:  

–  Intercept-­‐only  (MOST  POPULAR)  –  124  Behavior  targe2ng  features  (BT-­‐ONLY)  –  BT  +  top  1000  clicked  ar2cle  ids  (ITEM-­‐PROFILE)  –  BT  +  user  profile  with  CTR  on  43  binary  content  categories  (CATEGORY-­‐PROFILE)  

–  BT  +  user  latent  factor  from  matrix  factoriza2on  

Page 105: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Offline  Evalua2on  Metric  Related  to  Clicks  

•  For  model  M  and  J  live  items  (ar2cles)  at  any  2me  

•  If  M  =  random  (constant)  model  E[S(M)]  =  #clicks  

•  Unbiased  es2mate  of  expected  total  clicks  (Langford  et  al.  2008)  

Page 106: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Click  Li�  Performance  For  Different  User  Profiles  

Warm  Start:  Users  with  at  least  one  sample  in  training  data  Cold  Start:  Users  with  no  data  in  training  data  

Page 107: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&
Page 108: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Conclusion  

•  ARS  to  probabilis2c  matrix  factoriza2on  framework  for  imbalanced  binary  data  

•  Model  fidng  approach  for  large  scale  data  using  Map-­‐Reduce  

•  Iden2fiability  the  key  for  parallel  matrix  factoriza2on  (cold-­‐start)  

•  Our  proposed  approach  significantly  outperforms  baseline  methods  such  as  varia2onal  approxima2on  

Page 109: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Bag  of  LiRle  Bootstraps  

Kleiner  et  al.  2012  

Page 110: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Mo2va2on:  Challenges    from  Data  Analysis  

•  Example:  How  to  generate  confidence  intervals  of  model  performance  like  AUC?  

•  Single  machine:  – Bootstrap  the  test  data  100  2mes  – Compute  the  std(AUC)  from  AUC  in  the  100  test  data  

•  How  do  we  do  it  for  large  scale  data?  – Bootstrap  non-­‐trivial!  

Page 111: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Bootstrap  (Efron,  1979)  •  A  re-­‐sampling  based  method  to  obtain  sta2s2cal  distribu2on  of  sample  es2mators  

•  Why  are  we  interested  ?  –  Re-­‐sampling  is  embarrassingly  parallelizable    

•  For  example:  –  Standard  devia2on  of  the  mean  of  N  samples  (μ)  –  For  i  =  1  to  r  do    

•  Randomly  sample  with  replacement  N  2mes  from  the  original  sample  -­‐>  bootstrap  data  i  

•  Compute  mean  of  the  i-­‐th  bootstrap  data  -­‐>  μi  

–  Es2mate  of  Sd(μ)  =  Sd([μ1,…μr])  –  r  is  usually  a  large  number,  e.g.  200  

Page 112: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Bootstrap  for  Big  Data  

•  Can  have  r  nodes  running  in  parallel,  each  sampling  one  bootstrap  data  

•  However…  – N  can  be  very  large  – Data  may  not  fit  into  memory  – Collec2ng  N  samples  with  replacement  on  each  node  can  be  computa2onally  expensive  

Page 113: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

M  out  of  N  Bootstrap    (Bikel  et  al.  1997)  

•  Obtain  SdM(μ)  by  sampling  M  samples  with  replacement  for  each  bootstrap,  where  M<N  

•  Apply  analy2cal  correc2on  to  SdM(μ)  to  obtain  Sd(μ)  using  prior  knowledge  of  convergence  rate  of  sample  es2mates  

•  However…  –  Prior  knowledge  is  required  –  Choice  of  M  is  cri2cal  to  performance  –  Finding  op2mal  value  of  M  needs  more    computa2on  

Page 114: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Bag  of  LiRle  Bootstraps  (BLB)  •  Example:  Standard  devia2on  of  the  mean    •  Generate  S  sampled  data  sets,  each  obtained  by  random  

sampling  without  replacement  a  subset  of  size  b  (or  par22on  the  original  data  into  S  par22ons,  each  with  size  b)  

•  For  each  data  p  =  1  to  S  do  –  For  i  =  1  to  r  do    

•  N  samples  with  replacement  on  data  of  size  b  •  Compute  mean  of  the  resampled  data  μpi  

–  Compute  Sdp(μ)  =  Sd([μp1,…μpr])  •  Es2mate  of  Sd(μ)  =  Avg([Sd1(μ),…,  SdS(μ)])  

Page 115: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Bag  of  LiRle  Bootstraps  (BLB)  •  Interest:  ξ(θ),  where  θ  is  an  es2mate  obtained  from  size  N  data  –   ξ  is  some  func2on  of  θ,  such  as  standard  devia2on,  …  

•  Generate  S  sampled  data  sets,  each  obtained  from  random  sampling  without  replacement  a  subset  of  size  b  (or  par22on  the  original  data  into  S  par22ons,  each  with  size  b)  

•  For  each  data  p  =  1  to  S  do  –  For  i  =  1  to  r  do    

•  Sample  N  samples  with  replacement  on  data  of  size  b  •  Compute  the  es2mate  of  the  resampled  data  θpi  

–  Compute  ξp(θ)  =  ξ([θp1,…θpr])  •  Es2mate  of  ξ(μ)  =  Avg([ξ1(θ),…,  ξS(θ)])  

Page 116: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Bag  of  LiRle  Bootstraps  (BLB)  •  Interest:  ξ(θ),  where  θ  is  an  es2mate  obtained  from  size  N  data  –   ξ  is  some  func2on  of  θ,  such  as  standard  devia2on,  …  

•  Generate  S  sampled  data  sets,  each  obtained  from  random  sampling  without  replacement  a  subset  of  size  b  (or  par22on  the  original  data  into  S  par22ons,  each  with  size  b)  

•  For  each  data  p  =  1  to  S  do  –  For  i  =  1  to  r  do    

•  Sample  N  samples  with  replacement  on  the  data  of  size  b  •  Compute  the  es2mate  of  the  resampled  data  θpi  

–  Compute  ξp(θ)  =  ξ([θp1,…θpr])  •  Es2mate  of  ξ(μ)  =  Avg([ξ1(θ),…,  ξS(θ)])  

Mapper  Reducer  

Gateway  

Page 117: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Why  is  BLB  Efficient  

•  Before:  – N  samples  with  replacement  from  size  N  data  is  expensive  when  N  is  large  

•  Now:  – N  samples  with  replacement  from  size  b  data  – b  can  be  several  magnitude  smaller  than  N  (e.g.  b  =  Nγ,  γ  in  [0.5,  1))  

– Equivalent  to:  A  mul2nomial  sampler  with  dim  =  b  – Storage  =  O(b),  Computa2onal  complexity  =  O(b)  

Page 118: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Simula2on  Experiment  

•  95%  CI  of  Logis2c  Regression  Coefficients  •  N  =  20000,  10  explanatory  variables  •  Rela2ve  Error  =  |Es2mated  CI  width  –  True  CI  width  |  /  True  CI  width  

•  BLB-­‐γ:  BLB  with  b  =  Nγ  •  BOFN-­‐γ:  b  out  of  N  sampling  with  b  =  Nγ  

•  BOOT:  Naïve  bootstrap  

Page 119: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Simula2on  Experiment  

Page 120: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Real  Data  

•  95%  CI  of  Logis2c  Regression  Coefficients  •  N  =  6M,  3000  explanatory  variables  •  Data  size  =  150GB,  r  =  50,  S  =  5,  γ  =  0.7  

Data  stored  on  disk   Data  stored  in  memory  using  Spark  

Page 121: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Hyper-­‐parameter  Selec2on  

•  From  empirical  experiments  in  the  paper  –  S>=3  and  r>=50  is  sufficient  for  low  rela2ve  errors  

•  Adap2vely  selec2ng  r  and  S  –  Increase  r  and  s  un2l  es2mated  value  converges  –  For  r:  Con2nue  to  process  resamples  and  update  ξp(θ)  un2l  it  has  ceased  to  change  significantly.    

–  For  S:  Con2nue  to  process  more  subsamples  (i.e.,  increasing  S)  un2l  BLB’s  output  value  has  stabilized.  

Page 122: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Hyper-­‐parameter  Selec2on  

BLB-­‐0.7  

•  Adap2vely  selec2ng  r  and  S  –  Increase  r  and  S  un2l  es2mated  value  converges  

Adap2ve  BLB  

Page 123: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Summary  of  BLB  

•  A  new  algorithm  for  bootstrapping  on  big  data  

•  Advantages  – Fast  and  efficient  – Easy  to  parallelize  – Easy  to  understand  and  implement  – Friendly  to  Hadoop,  makes  it  rou2ne  to  perform  sta2s2cal  calcula2ons  on  Big  data  

Page 124: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Part  V:  Future  of  Distributed  Compu2ng  Infrastructure  

Page 125: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Problem  of  ADMM  in  Hadoop  

•  ADMM  in  hadoop  can  take  hours  to  converge  

•  Is  there  a  beRer  way  to  handle  itera2ve  learning  process?  

Page 126: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Spark  overview  

126  

§  What is Spark –  A fast cluster computing system compatible with Hadoop

§  Works with HDFS –  Improves efficiency through in-memory computing primitives –  Written in Scala, provides APIs in Scala, Java and Python

§  Often 2-10× less code with Scala

§  What can it do –  Developed for iterative algorithms

§  Load data into memory and query it repeatedly more quickly than with disk-based systems like Hadoop

–  Retain the attractive properties of MapReduce §  Fault tolerance, data locality, scalability

Page 127: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Itera2ve  Process  in  Hadoop  

Disk   Mappers   Disk   Reducers  

Disk  Mappers  Disk  Reducers  

Disk   Mappers   Disk   Reducers  

Page 128: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Itera2ve  Process  in  Spark    

Data  In  Disk  

Mappers  

Memory  

Reducers  

Mappers   Reducers  

Mappers   Reducers  

Memory  

Page 129: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Spark  cluster  speed-­‐up  preliminary  results    

0  

20  

40  

60  

80  

100  

120  

140  

300   600   900   1200  

120  Cpus  

816  Cpus  

129  

Sec

onds

per

iter

atio

n

Number of partitions

Page 130: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Ini2al  Spark  Result  Summary  

ü Spark  is  very  efficient  for  large-­‐scale  machine  learning  purposes  – ADMM  on  Ads  data,  20  mins  (Spark)  VS  4-­‐5  hours  (Hadoop)  

– Provides  an  offline  experimental  system  to  vet  various  modeling  ideas  with  high  agility  

– Models  get  trained  faster  thus  can  react  to  temporal  shi�s  of  online  system  more  quickly  

130  

Page 131: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

GraphLab  

•  An  open-­‐source  graph-­‐based,  high  performance,  distributed  computa2on  framework  in  C++  

•  hRp://graphlab.org/  •  HDFS  integra2on  •  Major  design  

– Sparse  data  with  local  dependencies  –  Itera2ve  algorithms  – Poten2ally  asynchronous  execu2on  among  nodes  

Page 132: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

GraphLab  

•  Graph-­‐parallel  •  Map-­‐Reduce:  computa2on  applied  to  independent  records  

•  GraphLab:  dependent  records  stored  as  ver2ces  in  a  large  distributed  data-­‐graph  

•  Computa2on  in  parallel  on  each  vertex  and  can  interact  with  neighboring  ver2ces  

Page 133: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

GraphX  

•  Combines  the  advantages  of  both  data-­‐parallel  and  graph-­‐parallel  systems  

•  Distributed  graph  computa2on  on  Spark  

Page 134: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Final  Remarks  

Page 135: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Summary  •  Large  scale  machine  learning  is  s2ll  a  new  area,  a  lot  of  open  problems  

•  The  advancing  distributed  compu2ng  infrastructure  plays  a  big  role  in  the  algorithms  

•  In  today’s  tutorial,  we  covered  – Distributed  compu2ng  and  Map-­‐Reduce  – Web  recommender  systems:  our  mo2va2on  problem  –  Large  scale  logis2c  regression  –  Parallel  matrix  factoriza2on  –  Bag  of  liRle  bootstraps:  One  analysis  tool  for  big  data  –  Future  of  distributed  compu2ng  infrastructure  

Page 136: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Bibliography  Agarwal,  D.  and  Chen,  B.  (2009).  Regression-­‐based  latent  factor  models.  In  Proceedings  of  the  15th  ACM  SIGKDD  interna2onal  conference  on  Knowledge  discovery  and  data  mining,  19–28.  ACM.    Agarwal,  D.,  Chen,  B.,  and  Elango,  P.  (2010).  Fast  online  learning  through  offline  ini2aliza2on  for  2me-­‐sensi2ve  recommenda2on.  In  Proceedings  of  the  16th  ACM  SIGKDD  interna2onal  conference  on  Knowledge  discovery  and  data  mining,  703–712.  ACM.    Bell,  R.,  Koren,  Y.,  and  Volinsky,  C.  (2007).  Modeling  rela2onships  at  mul2ple  scales  to  improve  accuracy  of  large  recommender  systems.  In  Proceedings  of  the  13th  ACM  SIGKDD  interna2onal  conference  on  Knowledge  discovery  and  data  mining,  95–104.  ACM.    Booth,  J.  G.,  &  Hobert,  J.  P.  (1999).  Maximizing  generalized  linear  mixed  model  likelihoods  with  an  automated  Monte  Carlo  EM  algorithm.  Journal  of  the  Royal  Sta2s2cal  Society:  Series  B  (Sta2s2cal  Methodology),  61(1),  265-­‐285.  Boyd,  S.,  Parikh,  N.,  Chu,  E.,  Peleato,  B.,  &  Eckstein,  J.  (2011).  Distributed  op2miza2on  and  sta2s2cal  learning  via  the  alterna2ng  direc2on  method  of  mul2pliers.  Founda2ons  and  Trends®  in  Machine  Learning,  3(1),  1-­‐122.  Bickel,  P.  J.,  Götze,  F.,  &  van  Zwet,  W.  R.  (2012).  Resampling  fewer  than  n  observa2ons:  gains,  losses,  and  remedies  for  losses  (pp.  267-­‐297).  Springer  New  York.  Dean,  J.,  &  Ghemawat,  S.  (2008).  MapReduce:  simplified  data  processing  on  large  clusters.  Communica2ons  of  the  ACM,  51(1),  107-­‐113.  

Page 137: Large&Scale&Machine&Learning&for& Informaon&Retrieval&blong/LSML-tutorial-cikm... · Large&Scale&Machine&Learning&for& Informaon&Retrieval& &Bo&Long,&Liang&Zhang& LinkedIn&Applied&Relevance&Science&

Bibliography  Efron,  B.  (1979).  Bootstrap  methods:  another  look  at  the  jackknife.  The  annals  of  Sta2s2cs,  1-­‐26.  Kleiner,  A.,  Talwalkar,  A.,  Sarkar,  P.,  &  Jordan,  M.  (2012).  The  big  data  bootstrap.  arXiv  preprint  arXiv:1206.6415.  Khanna,  R.,  Zhang,  L.,  Agarwal,  D.  and  Chen,  B.  (2012).  Parallel  Matrix  Factoriza2on  for  Binary  Response.  IEEE  BigData  2013.  Mnih,  Andriy,  and  Ruslan  Salakhutdinov.  (2007).  "Probabilis2c  matrix  factoriza2on."  Advances  in  neural  informa8on  processing  systems.  Thompson,  W.  R.  (1933).  On  the  likelihood  that  one  unknown  probability  exceeds  another  in  view  of  the  evidence  of  two  samples.  Biometrika,  25(3/4):285–294.    Zhang,  L.,  Agarwal,  D.,  and  Chen,  B.  (2011).  Generalizing  matrix  factoriza2on  through  flexible  regression  priors.  In  Proceedings  of  the  fi�h  ACM  conference  on  Recommender  systems,  13–20.  ACM.