Регина Барзилай "Извлечение информации из...

78
Informa(on Extrac(on for Social Media Regina Barzilay Chris(na Sauper, Aria Haghighi, and Ted Benson 1

Upload: yandex

Post on 02-Dec-2014

639 views

Category:

Technology


0 download

DESCRIPTION

31 января, семинар "День MIT в Яндексе"Регина Барзилай "Извлечение информации из социальных медиа"- Методы машинного обучения в применении к извлечению информации из сетевого пользовательского контента.- Рассмотрение набора задач, связанных с извлечением информации, таких как анализ рецензий по составляющим и создание базы событий по твитам.- Автоматическое построение контентной структуры документа на основе большого потока пользовательского контента с сильным шумом.- Автоматическая агрегация содержимого рецензий и извлечении событий из потока сообщений в твиттере.

TRANSCRIPT

Page 1: Регина Барзилай "Извлечение информации из социальных медиа"

Informa(on  Extrac(on  for  Social  Media  

Regina  Barzilay  

Chris(na  Sauper,    Aria  Haghighi,  and  Ted  Benson  

1  

Page 2: Регина Барзилай "Извлечение информации из социальных медиа"

Selec(ng  a  Hotel  

2  

Page 3: Регина Барзилай "Извлечение информации из социальных медиа"

Selec(ng  a  Hotel  

3  

Page 4: Регина Барзилай "Извлечение информации из социальных медиа"

Selec(ng  a  Hotel  

4  

Page 5: Регина Барзилай "Извлечение информации из социальных медиа"

User-­‐generated  Content  

•  Large  amounts  of  user-­‐generated  content  

•  Increasingly  important  in  decision  making  •  Time-­‐consuming  to  read  it  all  

NLP can help! 5  

Page 6: Регина Барзилай "Извлечение информации из социальных медиа"

The  Power  of  Word  Counts  

Simple  sta(s(cal  models  are  effec(ve  for  many  informa(on  extrac(on  tasks  

•  Bag-­‐of-­‐words  approaches  for  classifica(on  

 

 

stocks  

trading  financial   bank   cloudy  

cold  

plants  

storm  

6  

Page 7: Регина Барзилай "Извлечение информации из социальных медиа"

The  Power  of  Word  Counts  

Simple  sta(s(cal  models  are  effec(ve  for  many  Informa(on  Extrac(on  tasks    •  Sequence  labeling  for  seman(c  role  labeling  

the   earthquake   injured   three   people  

EVENT   CASUALTIES   CASUALTIES  

7  

NONE  NONE  

Page 8: Регина Барзилай "Извлечение информации из социальных медиа"

The  Power  of  Word  Counts  

8  

                                                                                                       

Page 9: Регина Барзилай "Извлечение информации из социальных медиа"

The  Power  of  Word  Counts  

9  

Every  'me  I  fire  a  linguist,  the  performance  of  the  speech  recognizer  goes  up    (F.  Jelinek)                                                                                                        

Page 10: Регина Барзилай "Извлечение информации из социальных медиа"

Beyond  Wall  Street  Journal  

Moving  from  formal  text…  

   …to  social  media  

10  

Page 11: Регина Барзилай "Извлечение информации из социальных медиа"

•  Model  document  structure  as  part  of  the  extrac(on  process  

•  Exploit  large  amount  of  raw  data  to  supplement    annota(ons  

Our  Approach  

11  

Page 12: Регина Барзилай "Извлечение информации из социальных медиа"

So   so   so   good!   One   of   my   favorite   restaurants   in  Boston!  I  have  to  take  off  a  single  star,  because  there  are   a   couple   dishes   I   didn't   enjoy,   but   if   you   go   and  order  well,  this  is  a  6  star  experience!    You  start  with  bread  and  the  most  delicious  olive  oil.  It  has  a  strong  olive  taste,  I  had  to  forcefully  stop  myself  from  finishing  our  basket.      My   sugges(on   is   to  order   family   style,   and  based  on  the   number   of   people   you   have,   order  main   entrees  for   half   your   party   size   and   3-­‐4   small   plates   for   the  other  half.  (ex.  Party  of  4  =  2  entrees  and  6-­‐8  apps)    The   entrees   really   are   to   die   for,   every   one   I've   had  has  been  delicious.  Such  a  great  combo  of  flavors  and  spices,  they  some  seriously  ar(s(c  crea(ons!  The  best  small   plates   are   the   Sultans   Delight   (fall   apart   lamb  and   unreal   baba),   the   spiced   carrots   (seems   simple,  but   they   are   amazing!),   the   falafel....ok,   there   are   a  bunch  that  are  great,  but  those  are  some  favorites!    I  would   skip   the  deviled  eggs,  mul(ple  people   talked  them  up  to  me  before  I  went,  really  not  that  exci(ng.  The   chick   peas   in   vermicelli,   also   not   great,   in   fact   I  really   did   not   like   it   with   the   fake   orange   taste.   It  reminded  me  of  one  of   those   chocolate  oranges   you  can  break  apart.  The  black  eyed  pea  soup  was  nothing  special.    For   dessert,   please   get   the   baked   Alaska,   it   was  unbelievable!  

A  new  favorite!      #1.  The  hosts  seated  us  almost  immediately  (even  though  we  didn't  have  a  reserva(on  on  a  Friday  night).    #2.  The  food  was  amazing  (N.B.  the  bread  in  the  basket  goes  really  well  with  the  warmed,  spiced  olive  oil  from  one  of  our  meze  plates),  We  had  the  olives  w/  za'atar  (a  lot  for  two  people,  $5),  quail  kebabs  (delicious,  tender,  spicy,  2  for  $13),  monkfish  curry  (yummy,  $26)  and  a  tea-­‐ser  dessert  with  sour  cherries  that  tastes  a  lot  like  vermicelli/milk  dessert  (12).    #3.  The  service  was  excellent-­‐-­‐perfectly  (med  plates,  etc.      #4.  Ana  Sortun  was  there!  She's  awesome  and  she  was  hands-­‐on-­‐-­‐adding  things  to  dishes,  etc.    I  would  go  back-­‐-­‐I  s(ll  like  the  hummus  at  Sofra  and  wasn't  sure  that  Oleana  would  be  more  impressive-­‐-­‐but  it  was  excellent.      Great  place  for  a  special  occasion  meal  or  a  night  out  with  tourists.  

The  suspense  factor  as  each  surprise  dish  was  delivered  added  to  the  whole  experience,  and  the  (ming  was  flawless-­‐-­‐we'd  finish  a  dish,  have  ample  (me  to  enjoy  it,  a  few  minutes  for  some  sips  of  wine  and  conversa(on,  and  just  as  there  was  a  breath,  the  next  dish  would  arrive.  The  whole  meal  felt  like  a  well-­‐orchestrated  performance  rather  than  just  a  meal.    As  for  the  food-­‐-­‐there's  not  enough  to  be  said.  I  realize  this  is  a  lille  cliched,  but  honestly,  I  haven't  tasted  food  like  this  since  living  in  the  bay  area,  with  crea(ve  combina(ons  and  well-­‐designed  dishes  and  contrasts  that  surprise  the  palele  and  are  a  delight  to  eat;  not  simply  good,  but  joy-­‐inducing.  The  highlight  of  the  night  was  a  dish  of  crab  cakes  with  asparagus-­‐-­‐they  had  a  small  poached  quail  egg  on  each  cake,  combined  with  a  lemon  flavor  from  a  lille  juice  and  zest  it  almost  made  a  meringue  that  was  incredible.  This  is  without  a  doubt  the  best  single  dish  I  have  had  at  any  restaurant  in  Boston,  and  I'd  go  back  to  Oleana  just  for  this.  Or  any  other  of  the  dishes  we  had  that  night,  frankly.  Some  were  beler  than  others,  but  each  was  unique  and  in  some  way  surprising  and  fun.    Absolutely  fantas(c  experience  all  around,  and  so  far  the  best  overall  restaurant  experience  I've  had  in  the  Boston  area.  These  people  understand  that  a  meal  is  more  than  just  about  the  food,  it's  about  the  service,  the  wine,  the  scenery,  and  on  top  of  all  that,  the  flavors  and  combina(ons  of  culinary  delights.  Oleana  turned  an  otherwise  ordinary  night  into  an  experience  I  won't  forget,  and  I  can't  wait  to  return.  

I  wandered  in  here  on  a  whim  with  a  friend  a  while  back,  completely  underdressed  but  on  the  lookout  for  a  good  meal.  When  we  went  around  the  side  to  the  entrance,  someone  called  to  us  from  the  roof-­‐-­‐"watch  out,  there's  glass  on  the  floor,  I  dropped  a  light  bulb.  Can  you  go  inside  and  grab  someone  for  me  and  ask  them  to  bring  a  screwdriver?"  Sure,  no  problem-­‐-­‐so  I  went  inside  and  told  the  hostess,  "Hey  your  maintenance  guy  needs  a  screwdriver  up  on  the  roof."  She  laughed,  "Oh...  that's  not  a  maintenance  guy,  that's  the  owner."    And  from  that  moment  I  knew  this  place  was  special.  It's  a  good  sign  when  the  owner  of  a  restaurant  is  up  on  the  roof  changing  lightbulbs-­‐-­‐it's  clear  what  kind  of  care  and  alen(on  goes  into  every  detail.  If  only  I  had  known  then  how  that  would  translate  to  the  en(re  experience,  and  especially  the  food.    Once  out  on  the  pa(o,  we  waited  for  only  a  few  minutes  by  the  fountain  before  being  seated.  I  had  a  glass  of  wine  and  was  just  enjoying  the  pa(o,  the  bread,  and  the  otherworldly  feel  of  the  place.  You  can't  be  stressed  out  here;  it's  designed  perfectly  to  be  almost  a  Shangri-­‐La  of  spaces,  and  all  in  the  middle  of  Inman  square.  Unexpected.  Impeccable.    At  this  point  we  were  so  confident  in  the  holis(c  quality  of  the  restaurant  that  we  decided  to  trust  the  chef  and  go  all  out.  We  ordered  a  tas(ng  menu,  and  two  other  mezos/appe(zers  that  sounded  good  from  the  specials  menu.  If  you're  here,  I  *highly*  recommend  this.  You  probably  won't  end  up  spending  any  more  than  if  you  had  ordered  individually,  but  you'll  taste  some  incredible  things  you  might  not  have  thought  to  get.    

Oleana  serves  inspired,  well  prepared  food  from  the  best  possible  ingredients.    The  menu  is  well  priced  for  the  quality.    The  wine  list  is  very  food  friendly,  includes  many  organic  and  bio-­‐dynamic  wines,  and  is  also  reasonably  priced.    It's  a  great  place  for  vegetarians.    A  vegetarian  tas(ng  menu  was  available  the  night  I  went  -­‐  it  was  superb  and  plen(ful  (I  could  not  finish  it).        The  service  is  friendly,  prompt,  and  helpful.    The  space  is  relaxing  and  casually  elegant.    I  was  there  on  a  Tuesday  evening  when  two  men  were  quietly  playing  lovely  world  music.    As  I  was  visi(ng  the  Boston  area  with  my  family,  we  brought  along  our  children  (8&10).    While  I  would  not  call  Oleana  family-­‐friendly,  they  were  accommoda(ng  of  our  children.    I  should  note  one  of  my  children  is  a  highly  selec(ve  eater,  but  they  are  both  used  to  ea(ng  in  high-­‐end  restaurants  and  a  very  well  behaved  in  nice  restaurants  (or  so  we  are  told).    Given  that  I  live  in  the  San  Francisco  Bay  Area,  I'm  spoiled  by  excellent  vegetarian  friendly  restaurants.    Now  I  have  a  spot  to  return  to  in  the  Boston  area  that  meets  expecta(ons.  

This  is  a  fantas(c  restaurant  in  Cambridge.  The  decor,  music  and  smells  will  make  you  feel  like  you  are  in  another  world.  I  was  a  lille  skep(cal  when  I  heard  Turkish  Food,  but  those  feelings  were  quickly  squashed  when  I  had  the  food.      We  started  with  the  Fried  Mussels  with  Hot  Peppers  and  Turkish  Tarator  Sauce.  The  Mussels  were  fried  to  perfec(on  and  the  baler  was  very  light.  I  could  have  eaten  a  thousand  of  them,  but  I  love  fried  food.  The  Vermont  Quail  was  very  tasty  as  well.  The  quail  was  very  tender,  which  is  hard  to  do  because  those  lille  guys  are  so  small.  The  last  starter  was  the  Sultans  Delight.  The  Tamarind  Glazed  Beef  was  so  tender.  The  Smokey  Eggplant  Puree  went  well  with  the  dish  and  was  even  beler  slathered  on  the  bread.      We  shared  an  entree  for  the  evening.  I  highly  suggest  the  Azuluna  Pork,  Crispy  Pea  Paella,  Fried  Fiddleheads  and  Paprika  Sauce.  The  pork  was  just  as  tender  as  the  beef  and  the  accompaniments  went  so  well  with  the  meat.  The  meal  was  very  flavorful  and  seasoned  to  perfec(on.      I  wish  I  could  have  golen  dessert,  but  I  was  so  full.  I  did  see  them  bring  some  out  and  they  looked  wonderful  and  decadent.  I  had  the  Sangria  to  drink,  it  was  refreshing  because  it  was  humid  outside,  but  it  was  not  the  best  in  the  world.  I  think  I  might  try  something  new  if  I  ever  go  back  again.      The  servers  are  great,  so  nice  and  knowledgeable  about  the  menu.  It  really  means  a  lot  to  me  to  see  someone  get  excited  about  a  menu.    

Delicious.  Delighsul.  Worth  it.    I've  eaten  at  Oleana  several  (mes  and  the  food  is  always  very  good.  The  service  is  typically  really  solid  -­‐  I've  had  some  dinners  where  the  waitstaff  was  really  top-­‐notch  and  other  (mes  when  it's  been  good  but  not  spectacular.      Either  way  the  hummus  and  falafel  small  plates  (meze)  are  just  SO  tasty.  Definitely  a  great  place  to  celebrate  a  special  event  or  just  when  you  need  a  par(cular  pick-­‐me-­‐up.  Not  the  fussiest  or  the  fanciest  food  (this  is  a  compliment  from  me!)  nor  the  most  elegant  ambiance  (wish  it  was  a  lille  more  quiet),,  but  clearly  a  very  special  place  for  a  great  meal!  Plus  you  NEED  to  get  the  Baked  Alaska  (YUM).  It  should  be  a  requirement  for  going  to  the  place.    (I  would  give  the  food  5  stars;  the  overall  ambiance  -­‐  read:  it  can  be  prely  loud  -­‐  and  waitstaff  variability  knocked  the  overall  experience  to  a  4).  

Had  dinner  here  on  Friday  night  and  it  was  superb!!    Cute  ambiance...great  for  in(mate  dinner  or  date  place.  Dim  ligh(ng,  nice  decor,  not  too  loud.      Service  was  excellent  as  well  as  all  the  recommenda(ons.      Started  with  the  Moroccan-­‐style  Octopus  and  Fatoush.  Tasty,  light,  unique  flavors,  great  presenta(on.  Everything  was  quickly  eaten  up  with  smiles.      For  entrees,    had  the  Beef  Kabob  special  (which  was  the  hit!,  beef  with  delicious  flavors  and  cooked  perfectly  med-­‐rare  tender),  Cod,  and  Lamb.  Everything  was  tasty  yet  light,  with  delicate  complimen(ng  flavors.      Dessert  was  the  winner-­‐Passion  fruit  Bisteeya....goodness  what  is  this??  IT  was  to  DIE  for  and  a  perfect  ending  to  the  meal.  I  think  we  literally  ate  this  in  2  seconds  and  contemplated  ordering  a  second.  It's  light,  tart,  fluffy,  creamy,  and  thirst  quenching  all  at  the  same  (me.      Needless  to  say  Oleana  became  a  favorite  in  one  evening...the  food  is  just  very  unique.  I  love  have  great  flavors  without  feeling  like  I  gained  10  pounds  ea(ng  a  wonderful  dinner.  I'll  definitely  be  back  soon!  12  hlp://condensr.com  

Page 13: Регина Барзилай "Извлечение информации из социальных медиа"

Mo(va(ng  Example  Aspect Snippets

atmosphere “stylish decor” “awesome art”

food “loved it!”

“tasty calzones!”

service “fast and friendly” “impatient waiters”

Importance  of  Context:  

Ordered  chicken  parm  and  loved  it!  Friend  had  the  veal.  The  service  was  ...  

...  by  local  ar(sts.  

food

 {  13  

Page 14: Регина Барзилай "Извлечение информации из социальных медиа"

Mul(-­‐Aspect  Summariza(on  

Sequence  Labeling  Task  I  ordered  lunch  from  them  the  other  day  and  I  was    [FOOD  pleasantly  surprised].    Our  waiter  dazzled  me  with  his  blue  eyes  and  genuine  smile,  and  all  the  waiters  were  [SERVICE  extremely  professional  and  efficient].    

Content  Topic  Model    

I  ordered  lunch  from  them  the  other  day  and  I  was  pleasantly  surprised.    Our  waiter  dazzled  me  with  his  blue  eyes  and  genuine  smile,  and  all  the  waiters  were  extremely  professional  and  efficient.    

14  

Page 15: Регина Барзилай "Извлечение информации из социальных медиа"

The  Big  Disconnect  

-­‐  Topic  Models  -­‐  Rhetorical  Structure  Analysis  

Discourse  Modeling  

-­‐  Informa(on  Extrac(on  -­‐  Sen(ment  Analysis  

Analysis  Applica(ons  

15  

Page 16: Регина Барзилай "Извлечение информации из социальных медиа"

Approach  Overview  

Task  Labels:  Observed  

Task  Labels:  Observed  

I  had  the  shrimp  salad  and  was  [FOOD  pleasantly  surprised].    The  [ATMOSPHERE  decor  was    tasteful]  and  staff  was  [SERVICE  extremely  professional  and  efficient].    

words

labels

16  

Page 17: Регина Барзилай "Извлечение информации из социальных медиа"

Approach  Overview  

Task  Labels:  Observed  

Goal:    Analysis  applica(ons  sensi(ve  to  document  structure  

Task  Labels:  Observed  

Content  Labels:  Latent  

17  

Page 18: Регина Барзилай "Извлечение информации из социальных медиа"

Approach  Overview  

•  Jointly  learn  structure  and  task  parameters  – Topics  are  latent  variables  shaped  by  task  

•  Principled  way  to  incorporate  unlabeled  data  – More  unlabeled  data,  beler  performance  

18  

Page 19: Регина Барзилай "Извлечение информации из социальных медиа"

Factoriza(on  {  

Product  over  sentences  

Bag-­‐of-­‐words  

{  

CRF  

{  {  Topic  Trans.  

19  

Page 20: Регина Барзилай "Извлечение информации из социальных медиа"

Mul(-­‐Aspect  Summariza(on  Content  Model:  Sentence-­‐Level  HMM  

... chicken, parm, ordered, loved, ... {   }  

Task:  Token-­‐Level  condi(onal  random  field  

Ordered chicken parm and loved it

20  

Page 21: Регина Барзилай "Извлечение информации из социальных медиа"

Augmen(ng  CRF  with  Topics  

...  

...   Add  context  features  

topic  3  

21  

Page 22: Регина Барзилай "Извлечение информации из социальных медиа"

Joint  Learning  Objec(ve  {  

Content  and  task  params.  

22  

Page 23: Регина Барзилай "Извлечение информации из социальных медиа"

Joint  Learning  E-­‐Step:  

Can  be  computed  using  Forward-­‐Backward  algorithm  

23  

Page 24: Регина Барзилай "Извлечение информации из социальных медиа"

Joint  Learning  M-­‐Step:  

For        :  Standard  normaliza(on  of    T counts  from  E-­‐Step.  For        :    

weighted  condi(onal  likelihood  objec(ve  

24  

Page 25: Регина Барзилай "Извлечение информации из социальных медиа"

Supervised  Objec(ve  

{  Labeled  data  for  content  and  task  parameters  

25  

Page 26: Регина Барзилай "Извлечение информации из социальных медиа"

Semi-­‐Supervised  Objec(ve  

{  Labeled  data  for  content  and  task  

parameters  

{  

Unlabeled  data  for  content    parameters  

26  

Page 27: Регина Барзилай "Извлечение информации из социальных медиа"

Data  set  

•  Amazon  TV  reviews  – Train:  35  reviews  – Test:  24  reviews  – Unlabeled:  12,600  reviews  

•  Yelp  restaurant  reviews  – Train:  48  reviews  – Test:  48  reviews  – Unlabeled:  33,000  reviews  

27  

Page 28: Регина Барзилай "Извлечение информации из социальных медиа"

Informa(on  Extrac(on  Goal:  Extract  phrases  from  review  text  in  pre-­‐specified  categories  Input:  User-­‐generated  review  text,  labeled  training  data            Output:  Labeled  phrases  in  each  category      

28  

I  came  here  with  my  husband  for  the  tas(ng  menu,  and  we  were  not  disappointed.    We  got  to  sit  at  the  chef’s  table,  which  overlooked  the  kitchen.    The  service  was  polite  and  knowledgeable,  the  atmosphere  was  elegant  and  energePc  and  the  food  was  wonderfully  creaPve  and  delicious.  

FOOD  SERVICE  ATMOSPHERE  PRICE  OVERALL  

Page 29: Регина Барзилай "Извлечение информации из социальных медиа"

Systems  

•  NoCM:  Just  the  CRF,  no  content  model  

•  IndepCM:  Es(mate  content  model  parameters  first,  then  use  them  in  the  CRF.  

•  JointCM:  Es(mate  content  and  CRF  parameters  jointly  using  EM  

29  

Page 30: Регина Барзилай "Извлечение информации из социальных медиа"

Results  

Token  F-­‐measure  Evalua(on  

30  

Page 31: Регина Барзилай "Извлечение информации из социальных медиа"

Impact  of  Unlabeled  Data  

41,5  

47.3   47.8  

38  

44  

50  

0   6  300   12  600  

Number  of  Unlabeled  Reviews  

Setup:  Using  the  Amazon  corpus,  fix  the  amount  of                            labeled  data,  vary  the  amount  of  unlabeled  data  

31  

Page 32: Регина Барзилай "Извлечение информации из социальных медиа"

Mul(-­‐Aspect  Sen(ment  Ranking  Task:  Predict  sen(ment  (1-­‐10)  for  each  aspect  

 

 

Approach:  

•  Same  objec(ve  as  summariza(on  

•  Different  E-­‐  and  M-­‐Steps  [See  paper]  

Aspect Rating

picture 9.0

audio 9.5

extra 7.0

32  

Page 33: Регина Барзилай "Извлечение информации из социальных медиа"

L2  Error:  Lower  is  beler  

Mul(-­‐Aspect  Sen(ment  Ranking  DVD  Review  Domain  

33  

Page 34: Регина Барзилай "Извлечение информации из социальных медиа"

Paper  &  Code  

•  Paper:  hlp://groups.csail.mit.edu/rbg/code/content_structure/sauper-­‐emnlp-­‐10.pdf  

•  Code:  hlp://groups.csail.mit.edu/rbg/code/content_structure/code.tar.gz  

•  Data:  hlp://groups.csail.mit.edu/rbg/code/content_structure/data.tgz  

34  

Page 35: Регина Барзилай "Извлечение информации из социальных медиа"

The  fried  oysters  were  very  good  

The  casish  tasted  dry  and  bland  and  boring  

The  star  of  the  plate  was  the  grits  

The  gnocchi  with  mushrooms  was  outstanding  

The  casish  approaches  perfec(on  

The  shrimp  and  grits  are  nothing  less  than  spectacular  

+  

─  

+  

+  

+  

+  

«  «  «  «  «  

#1  

#2  

Agree  to  Disagree  

35  

Page 36: Регина Барзилай "Извлечение информации из социальных медиа"

Review  Aggrega(on  •  Hundreds  of  reviews  for  each  product  •  Opinions  vary  widely          

→ Need  to  aggregate  sta(s(cs  •  Histograms  show  sen(ment  distribu(on,  but  it’s  not  enough  

36  

Page 37: Регина Барзилай "Извлечение информации из социальных медиа"

Aspect-­‐based  Analysis  

Prior  work:  Use  a  set  of  predefined  domain-­‐specific  product  aspects  (e.g.,  Snyder  and  Barzilay  2007)  

             

 →  Coarse  level  analysis  

37  

Page 38: Регина Барзилай "Извлечение информации из социальных медиа"

Informa(ve  Aggrega(on  

Useful  informa(on:  – What’s  the  best  dish  at  this  restaurant?  

– What  do  people  dislike  about  this  restaurant?  

– Which  dishes  do  people  disagree  about?  

38  

Page 39: Регина Барзилай "Извлечение информации из социальных медиа"

We  had  a  great  Pme  last  night   at   this   restaurant.    T h e   s u s h i   w a s   s o  incredibly  fresh.    We  had  a   bad   experience   at   the  ba r ,   t hough .     My  chocolate   marPni   was  absolutely  terrible.      We  will   be   back,   but   we’ll  skip  the  drinks.  

Wow,   I   can’t   believe  how  much  this  place  has  changed!     They   used   to  be   mediocre,   but   now  they  never  fail  to  amaze.    We  started  off  at  the  bar  with   awesome   sake  bombs.    When  we  got  to  our   table,   the   sushi   was  fantasPc.    

I  have  such  mixed  things  t o   s a y   a b o u t   t h i s  restaurant.     On   one  hand,   their   sushi   is  unquesPonably   the   best  in  the  city.    On  the  other,  the   atmosphere   isn’t  that   great.     Plus,   their  drinks   are   completely  watered  down.  

Aggrega(on  of  product-­‐specific  aspects  

Informa(ve  Aggrega(on  

We  had  a  great  Pme  last  night   at   this   restaurant.    T h e   s u s h i   w a s   s o  incredibly  fresh.    We  had  a   bad   experience   at   the  ba r ,   t hough .     My  chocolate   marPni   was  absolutely  terrible.      We  will   be   back,   but   we’ll  skip  the  drinks.  

Wow,   I   can’t   believe  how  much  this  place  has  changed!     They   used   to  be   mediocre,   but   now  they  never  fail  to  amaze.    We  started  off  at  the  bar  with   awesome   sake  bombs.    When  we  got  to  our   table,   the   sushi   was  fantasPc.    

I  have  such  mixed  things  t o   s a y   a b o u t   t h i s  restaurant.     On   one  hand,   their   sushi   is  unquesPonably   the   best  in  the  city.    On  the  other,  the   atmosphere   isn’t  that   great.     Plus,   their  drinks   are   completely  watered  down.  

Sushi  

Chicken  

100%  posiPve  

33%  posiPve  

Japanese  Restaurant  

Relevant  aspects   User  sen(ment  

39  

Page 40: Регина Барзилай "Извлечение информации из социальных медиа"

Corpus-­‐driven  Aspect  Defini(on  

Define  aspects  dynamically  based  on  reviews  

We   had   a   great   Pme  l a s t   n i gh t   a t   th i s  restaurant.     The   sushi  was  so  incredibly  fresh.    W e   h a d   a   b a d  experience   at   the   bar,  though.    My   chocolate  marPni   was   absolutely  terrible.       We   will   be  back,  but  we’ll   skip   the  drinks.  

Wow,   I   can’t   believe  how   much   this   place  has   changed!     They  used   to   be   mediocre,  but  now  they  never  fail  to   amaze.    We   started  off   at   the   bar   with  awesome   sake   bombs.    When   we   got   to   our  table,   the   sushi   was  fantasPc.    

I   have   such   mixed  things  to  say  about  this  restaurant.     On   one  hand,   their   sushi   is  unquesPonably   the  best  in  the  city.    On  the  other,   the   atmosphere  isn’t   that   great.     Plus,  t h e i r   d r i n k s   a r e  completely   watered  down.  

Bakery  

-­‐  Cookies  -­‐  Cakes  -­‐  Pies  

We   had   a   great   Pme  l a s t   n i gh t   a t   th i s  restaurant.     The   sushi  was  so  incredibly  fresh.    W e   h a d   a   b a d  experience   at   the   bar,  though.    My   chocolate  marPni   was   absolutely  terrible.       We   will   be  back,  but  we’ll   skip   the  drinks.  

Wow,   I   can’t   believe  how   much   this   place  has   changed!     They  used   to   be   mediocre,  but  now  they  never  fail  to   amaze.    We   started  off   at   the   bar   with  awesome   sake   bombs.    When   we   got   to   our  table,   the   sushi   was  fantasPc.    

I   have   such   mixed  things  to  say  about  this  restaurant.     On   one  hand,   their   sushi   is  unquesPonably   the  best  in  the  city.    On  the  other,   the   atmosphere  isn’t   that   great.     Plus,  t h e i r   d r i n k s   a r e  completely   watered  down.  

Japanese  Restaurant  

-­‐  Sushi  -­‐  Sake  -­‐  Dessert  

→  Aspects  specific  to  each  product  

40  

Page 41: Регина Барзилай "Извлечение информации из социальных медиа"

Corpus-­‐driven  Aspect  Defini(on  

Allows  comparison  across  mul(ple  reviews  

 

–  Consensus  (both  posi(ve  and  nega(ve)  What’s  the  best/worst  aspect  of  this  product?  

I   buy   all   of   my   baked  g o o d s   f r om   t h i s  bakery.    Their  bread  is  so   delicious!     It’s   also  good   for   all   kinds   of  baked   goods.     They  also   have   some   truly  beauPful   cakes   on  display.     Even   their  cookies  are  great!  

I   picked   up   a   birthday  cake   for   my   son   here  yesterday.     It  was   the  most   amazing   cake  I’ve   ever   seen!     The  de co r aPon s   we r e  outstanding,   and   all  the   kids   loved   the  chocolate   icing.     I’ll  definitely  come  back!  

This   place   is   nice   for  some   baked   goods,  but   some   things   are  really   nasty.     The   loaf  of   bread   I   bought  was  stale!     They   were  happy   to   take   it   back  and   give   me   another,  but   I’ll   be   watching  next  Pme.  

Bakery  

…truly  beauPful  cakes  on  display.   …most  amazing  cake  I’ve  ever  seen!    

41  

Page 42: Регина Барзилай "Извлечение информации из социальных медиа"

Corpus-­‐driven  Aspect  Defini(on  

Allows  comparison  across  mul(ple  reviews  

 

–  Consensus  (both  posi(ve  and  nega(ve)  What’s  the  best/worst  aspect  of  this  product?  

–  Conflicts  of  opinion  What  aspects  do  people  disagree  about?  

I   buy   all   of   my   baked  g o o d s   f r om   t h i s  bakery.    Their  bread  is  so   delicious!     It’s   also  good   for   all   kinds   of  baked   goods.     They  also   have   some   truly  beauPful   cakes   on  display.     Even   their  cookies  are  great!  

I   picked   up   a   birthday  cake   for   my   son   here  yesterday.     It  was   the  most   amazing   cake  I’ve   ever   seen!     The  de co r aPon s   we r e  outstanding,   and   all  the   kids   loved   the  chocolate   icing.     I’ll  definitely  come  back!  

This   place   is   nice   for  some   baked   goods,  but   some   things   are  really   nasty.     The   loaf  of   bread   I   bought  was  stale!     They   were  happy   to   take   it   back  and   give   me   another,  but   I’ll   be   watching  next  Pme.  

Bakery  

Their  bread  is  so  delicious!   The  loaf  of  bread  I  bought  was  stale!  

42  

Page 43: Регина Барзилай "Извлечение информации из социальных медиа"

Task:  Input  Input:    

–  Food-­‐related  snippets  from  restaurant  reviews  •  Concise  descrip(on  of  a  user’s  opinion  

–  Automa(cally  extracted  from  full  review  text  (Sauper  et  al.  2010)  

–  Segmented  by  restaurant,  but  no  addi(onal  annota(on  

   

the  sushi  was  so  incredibly  fresh  best  chicken  katsu  in  town  drinks  are  fun,  fresh,  and  delicious  

I’d  recommend  the  apple  pie  the  bread  was  disappoinPngly  stale  chocolate  torte  is  the  stuff  of  dreams  

43  

Japanese  Restaurant   Bakery  

We  went  to  the  restaurant,  and  the  sushi  was  incredibly  fresh.  

Page 44: Регина Барзилай "Извлечение информации из социальных медиа"

Task:  Output  Output:  

–  Relevant  aspects  for  each  restaurant  –  Aspect  label  for  each  snippet  –  Sen(ment  label  for  each  snippet  

44  

+  they  had  a  decent  burrito  −  the  burrito  was  mediocre  at  best  −  the  burrito  was  heavily  cilantroed  

+  the  salsa  is  incredible  +  the  mango  salsa  is  perfectly  diced  +  hola  free  chips  &  salsa  

Burrito   Salsa  Mexican  Restaurant  

Page 45: Регина Барзилай "Извлечение информации из социальных медиа"

Possible  Solu(on  

Use  clustering  based  on  lexical  similarity  

Problem:    Clusters  and  aspects  are  not  aligned!  

the  marPnis  were  very  good  the  marPnis  were  tasty  

the  wine  list  was  pricey  their  wine  selec(on  is  horrible  

the  sushi  was  the  best  I’d  ever  had  best  paella  I’d  ever  had  

the  fillet  was  the  best  steak  we’d  ever  had  it’s  the  best  soup  I’ve  ever  had  

ParPal  output  of  state-­‐of-­‐the-­‐art  clustering  system  

45  

Page 46: Регина Барзилай "Извлечение информации из социальных медиа"

Our  Solu(on  

•  Jointly  model  aspect  and  sen(ment  

•  Leverage  data  to  dis(nguish  sen(ment  and  aspect  

46  

Bakery   Japanese  

Review  1  

Review  2  

Review  3  

delicious  fresh  

fantas(c  amazing  

beau(ful  stale  

fantas(c  smooth  

beau(ful  fresh  

delicious  bland  

pies  cookies  

cakes  pies  

cakes  bread  

salmon  sake  

maki  salmon  

maki  miso  

Page 47: Регина Барзилай "Извлечение информации из социальных медиа"

Model:  Overview  

•  Each  snippet  has  an  aspect  and  a  sen(ment  •  Each  word  is  drawn  from  a  topic  distribu(on:  –  Aspects  are  specific  to  a  single  product  

–  Sen(ment  is  global  across  all  products  

–  Background  distribu(on  is  global  

•  Transi(on  distribu(on  encodes  word  topic  transi(ons  

great   horrible   amazing  

dessert  pizza   pad  thai  

our  was   food  

47  

They  had  wonderful  appePzers.  

Page 48: Регина Барзилай "Извлечение информации из социальных медиа"

Model:  Genera(ve  Story  

1.  Global  distribu(ons  

2.  Restaurant-­‐level  distribu(ons  

3.  Snippet-­‐level  latent  structure  

4.  Words  

48  

Page 49: Регина Барзилай "Извлечение информации из социальных медиа"

Model:  Genera(ve  Story  

B  

Background    distribu(on  

Sen(ment  distribu(ons  

+   -­‐  

Globally,  a.  Background  distribu(on    

word  distribu(on  for  stop  words  and  in-­‐domain  white  noise  

b.  Sen(ment  distribu(ons              ,  word  distribu(ons  over  posi(ve  and  nega(ve  sen(ment  words  small  bias  for  seed  words  

c.  Transi(on  distribu(on  first-­‐order  Markov  distribu(on  of  word  topic  transi(ons  

Λ  

Transi(on  distribu(on  

49  

Page 50: Регина Барзилай "Извлечение информации из социальных медиа"

Model:  Genera(ve  Story  

For  each  restaurant      ,  a.  Aspect  distribu(ons  

word  distribu(on  for  each  aspect  

b.  Aspect-­‐sen(ment  binomials  probability  of  posi(ve  vs.  nega(ve  sen(ment  for  each  aspect  

c.  Aspect  mul(nomial  probability  of  each  aspect  

Aspect  distribu(ons  

1   …2   K   ψ

Aspect  mul(nomial  Aspect-­‐sen(ment  binomials  

…φ1   φ2   φK  

50  

Page 51: Регина Барзилай "Извлечение информации из социальных медиа"

Model:  Genera(ve  Story  

For  each  snippet          from  restaurant      ,  a.  Aspect  

chosen  from  aspect  mul(nomial    

b.  Sen(ment  chosen  from  aspect-­‐sen(ment  binomial    

c.  Sequence  of  word  topics  Background,  Aspect,  or  Sen(ment  selected  from  transi(on  distribu(on    

2  ψAspect  

φ2   +  Sen(ment  

Word  topic  sequence  

B   B  A   S   S  Λ  51  

Page 52: Регина Барзилай "Извлечение информации из социальных медиа"

Model:  Genera(ve  Story  

For  each  word          ,  a.  Word  

chosen  from  topic-­‐specific  distribu(on  based  on  word  topic  sequence  

2  

+  

Aspect  

Sen(ment  

Word  topic  sequence  

B   B  A   S   S   Background  

B  B   B  A   S   S  

The   pizza   was   really   great  

2  

+  

B  

52  

Page 53: Регина Барзилай "Извлечение информации из социальных медиа"

Standard  Varia(onal  Inference  

•  Desired  posterior:  

Model  parameters  

Observed  data  

Latent  structure  

53  

Page 54: Регина Барзилай "Извлечение информации из социальных медиа"

Standard  Varia(onal  Inference  

•  Desired  posterior:  

•  Op(mizing  directly  is  intractable  •  Instead,  op(mize  varia(onal  objec(ve  with  mean-­‐field  factoriza(on:  

54  

s.t.                    factorizes    

Page 55: Регина Барзилай "Извлечение информации из социальных медиа"

Data  Set  

Food-­‐related  snippets  from  Yelp  restaurant  reviews  (Sauper  et  al.  2010)  

–  13,879  total  snippets  –  328  restaurants  –  42.1  snippets  per  restaurant  (high  variance)  –  7.8  words  per  snippet  

 Seed  words  for  sen(ment  distribu(ons  –  42  posi(ve,  33  nega(ve  –  Relevant  to  domain  (e.g.,  “delicious”)  

55  

Page 56: Регина Барзилай "Извлечение информации из социальных медиа"

Experiments:  Aspect  Clustering  

•  Gold  standard  –  Clusters  over  3,250  snippets  –  Collected  via  Mechanical  Turk  

•  Baseline  –  CLUTO  clustering  weighted  by  TF*IDF  

•  MUC  cluster  evalua(on  metric  –  Based  on  number  of  cluster  merges  and  splits  required  to  achieve  gold  data  

•  Both  systems  allowed  10  clusters  per  restaurant  

56  

Page 57: Регина Барзилай "Извлечение информации из социальных медиа"

Experiments:  Aspect  Clustering  

69,3  

75,5  

60  

70  

80  

Baseline   Our  model  

MUC  F1  

the  marPnis  are  very  good  the  marPni  selec(on  looked  delicious  the  s’mores  marPni  sounded  excellent  

Our  model  

the  marPnis  are  very  good  the  mozzarella  was  very  fresh  

the  fish  and  various  meets  were  well  made  

Baseline   Baseline  

the  carrot  cake  was  delicious  it  was  rich,  creamy,  and  delicious  

the  pasta  bolognese  was  rich  and  robust  

Our  model  

the  carrot  cake  was  delicious  the  best  carrot  cake  I’ve  ever  eaten  carrot  cake  was  deliciously  moist  

57  

Page 58: Регина Барзилай "Извлечение информации из социальных медиа"

Error  Analysis  Number  of  sen(ment  and  aspect  errors  approximately  equal  

58  

Aspect  errors  −  Similar  aspect  words  in  different  

contexts  

Sen(ment  errors  −  Rare  sen(ment  words  

−  Nega(on,  some(mes  

the  cream  cheese  was  n’t  bad  

belgian  frites  are  very  crave-­‐able  the  blackened  chicken  was  meh  chicken  enchiladas  are  yummy  

the  cream  cheese  wasn’t  bad  ice  cream  was  just  delicious  

Page 59: Регина Барзилай "Извлечение информации из социальных медиа"

Paper  &  Code  

•  Paper  hlp://groups.csail.mit.edu/rbg/code/content_a�tude/sauper-­‐acl-­‐11.pdf  

•  Code  hlp://groups.csail.mit.edu/rbg/code/content_a�tude/code.tar.gz  

59  

Page 60: Регина Барзилай "Извлечение информации из социальных медиа"

The  Task  

•  Goal:  Automa(c  construc(on  of  even  records    from    Twiler  

•  Input:  Stream  of  Twiler  messages  

•  Output:  Table  of  event  records  

Seated  at  @carnegiehall  wai'ng  for  @CraigyFerg’s  show  

@DJPaulyD  absolutely  killed  it  at  Terminal  5  last  night.  

Craig,  nice  seeing  you  #noelnight  this  weekend  @becksdavis!  

Ar#st   Venue  Craig  Ferguson   Carnegie  Hall  DJ  Pauly  D   Terminal  5  

60  

Page 61: Регина Барзилай "Извлечение информации из социальных медиа"

B.B. King Blues Club

Sunday Gospel Brunch

Highline BallroomJ. Cole

Beacon TheaterHall & Oates

Jeff Tweedy Bowery Ballroom

Best Buy TheaterJim Gaffigan

Amos Lee Bardavon Opera House

Artist Venue

Example  Output  

61  

Page 62: Регина Барзилай "Извлечение информации из социальных медиа"

IE  for  Social  Media:  Challenges  

•  Messages  are  short  ⇒  Individual  message  may  not  contain  all  event  fields.  

•  Message  are  expressed  in  colloquial  language  ⇒ Mapping  between  messages  and  event  record  is  not  

obvious  

Ar(st:  Craig  Ferguson  Venue:  Carnegie  Hall  

Seated  at  @carnegiehall  wai(ng  for  @CraigyFerg’s  show  

RT  @leerader  :  ge�ng  REALLY  stoked  for  #CraigyAtCarnegie  sat  night.  

62  

Page 63: Регина Барзилай "Извлечение информации из социальных медиа"

IE  for  Social  Media:  Opportunity  

Significant  redundancy  in  Twiler  stream:  

Approach:    Drive  event  extrac(on  by  modeling  agreement  in  message  stream.  

Seated  at  @carnegiehall  wai'ng  for  @CraigyFerg’s  show  

@DJPaulyD  absolutely  killed  it  at  Terminal  5  last  night.  

Craig,  nice  seeing  you  #noelnight  this  weekend  @becksdavis!  

63  

Page 64: Регина Барзилай "Извлечение информации из социальных медиа"

Model  Func(onality  •  Message  level  analysis:  Tag  words  in  message  with  

 event-­‐field  labels.  

@YonderMountain      rocking      Mercury      Lounge  

ar'st   none   venue   venue  

Message  (x)

Label  (y)

64  

Page 65: Регина Барзилай "Извлечение информации из социальных медиа"

Model  Func(onality  •  Message  level  analysis:  Tag  words  in  message  with  

 event-­‐field  labels.  

•  Message  clustering:  Group  messages  based  on  events.  

•  Event  records:                                  Induce  canonical  value  for  each  field.  

Ar#st   Venue  Craig  Ferguson   Carnegie  Hall  

ArPst   Venue  Radiohead   Coliseum  

Craig  Ferguson,  what  a  riot!  Carnegie  is  in  s'tches  

Record   (R) Alignment  #CraigAtCarnie    is  star'ng  now!  #iamsoexcited  

Going  to  see  Radiohead  at  the  Coliseum  tonight!  

Pumped  for  R  A  D  I  O  H  E  A  D  !!!  

(A)

65  

Page 66: Регина Барзилай "Извлечение информации из социальных медиа"

Model  Overview  

Source  of  supervision:    Example  event  records    -­‐  Alignment  between  records  and  messages  not  observed.    -­‐  Message  level  field  annota(ons  not  observed.  

July  16,  5:30pm  at  American  Folk  Art  Museum  

Jun  17,  8:00  PM  at  Izod  Center  

Jun  17,  8:00  PM  at  Tarrytown  Music  Hall  

66  

Page 67: Регина Барзилай "Извлечение информации из социальных медиа"

XiYi

�SEQ

R�k

R�k+1

R�k�1

�UNQ �th field(across records)

�POP

R�k

Ai

Yi

Xi

Ai

Yi

Xi

R�k R�+1

k

�CON

k

kth record

Figure 3: Factor graph representation of our model. Circles represent variables and squares represent factors. Forreadability, we depict the graph broken out as a set of templates; the full graph is the combination of these factortemplates applied to each variable. See Section 4 for further details.

over pairwise cliques:

⇥SEQ(x, y) = exp{�TSEQfSEQ(x, y)}

= exp

�⇧

⇤�TSEQ

j

fSEQ(x, yj , yj+1)

⇥⌃

This factor is meant to encode the typical messagecontexts in which fields are evoked (e.g. going to seeX tonight). Many of the features characterize howlikely a given token label, such as ARTIST, is for agiven position in the message sequence conditioningarbitrarily on message text context.

The feature function fSEQ(x, y) for this compo-nent encodes each token’s identity; word shape2;whether that token matches a set of regular expres-sions encoding common emoticons, time references,and venue types; and whether the token matches abag of words observed in artist names (scraped fromWikipedia; 21,475 distinct tokens from 22,833 dis-tinct names) or a bag of words observed in NewYork City venue names (scraped from NYC.com;304 distinct tokens from 169 distinct names).3 Theonly edge feature is label-to-label.

4.2 Record Uniqueness FactorOne challenge with Twitter is the so-called echochamber effect: when a topic becomes popular, or“trends,” it quickly dominates the conversation on-line. As a result some events may have only a fewreferent messages while other more popular eventsmay have thousands or more. In such a circum-stance, the messages for a popular event may collectto form multiple identical record clusters. Since we

2e.g.: xxx, XXX, Xxx, or other3These are just features, not a filter; we are free to extract

any artist or venue regardless of their inclusion in this list.

fix the number of records learned, such behavior in-hibits the discovery of less talked-about events. In-stead, we would rather have just two records: onewith two aligned messages and another with thou-sands. To encourage this outcome, we introduce apotential that rewards fields for being unique acrossrecords.

The uniqueness potential ⇥UNQ(R�) encodes thepreference that each of the values R�, . . . , R�

K foreach field ⌃ do not overlap textually. This factor fac-torizes over pairs of records:

⇥UNQ(R�) =�

k �=k0

⇥UNQ(R�k, R

�k0)

where R�k and R�

k0 are the values of field ⌃ for tworecords Rk and Rk0 . The potential over this pair ofvalues is given by:

⇥UNQ(R�k, R

�k0) = exp{��T

SIMfSIM (R�k, R

�k0)}

where fSIM is computes the likeness of the two val-ues at the token level:

fSIM (R�k, R

�k0) =

|R�k ⇥R�

k0 |max(|R�

k|, |R�k0 |)

This uniqueness potential does not encode anypreference for record values; it simply encourageseach field ⌃ to be distinct across records.

4.3 Term Popularity FactorThe term popularity factor ⇥POP is the first of twofactors that guide the clustering of messages. Be-cause speech on Twitter is colloquial, we would likethese clusters to be amenable to many variations ofthe canonical record properties that are ultimatelylearned. The ⇥POP factor accomplishes this by rep-resenting a lenient compatibility score between a

XiYi

�SEQ

R�k

R�k+1

R�k�1

�UNQ �th field(across records)

�POP

R�k

Ai

Yi

Xi

Ai

Yi

Xi

R�k R�+1

k

�CON

k

kth record

Figure 3: Factor graph representation of our model. Circles represent variables and squares represent factors. Forreadability, we depict the graph broken out as a set of templates; the full graph is the combination of these factortemplates applied to each variable. See Section 4 for further details.

over pairwise cliques:

⇥SEQ(x, y) = exp{�TSEQfSEQ(x, y)}

= exp

�⇧

⇤�TSEQ

j

fSEQ(x, yj , yj+1)

⇥⌃

This factor is meant to encode the typical messagecontexts in which fields are evoked (e.g. going to seeX tonight). Many of the features characterize howlikely a given token label, such as ARTIST, is for agiven position in the message sequence conditioningarbitrarily on message text context.

The feature function fSEQ(x, y) for this compo-nent encodes each token’s identity; word shape2;whether that token matches a set of regular expres-sions encoding common emoticons, time references,and venue types; and whether the token matches abag of words observed in artist names (scraped fromWikipedia; 21,475 distinct tokens from 22,833 dis-tinct names) or a bag of words observed in NewYork City venue names (scraped from NYC.com;304 distinct tokens from 169 distinct names).3 Theonly edge feature is label-to-label.

4.2 Record Uniqueness FactorOne challenge with Twitter is the so-called echochamber effect: when a topic becomes popular, or“trends,” it quickly dominates the conversation on-line. As a result some events may have only a fewreferent messages while other more popular eventsmay have thousands or more. In such a circum-stance, the messages for a popular event may collectto form multiple identical record clusters. Since we

2e.g.: xxx, XXX, Xxx, or other3These are just features, not a filter; we are free to extract

any artist or venue regardless of their inclusion in this list.

fix the number of records learned, such behavior in-hibits the discovery of less talked-about events. In-stead, we would rather have just two records: onewith two aligned messages and another with thou-sands. To encourage this outcome, we introduce apotential that rewards fields for being unique acrossrecords.

The uniqueness potential ⇥UNQ(R�) encodes thepreference that each of the values R�, . . . , R�

K foreach field ⌃ do not overlap textually. This factor fac-torizes over pairs of records:

⇥UNQ(R�) =�

k �=k0

⇥UNQ(R�k, R

�k0)

where R�k and R�

k0 are the values of field ⌃ for tworecords Rk and Rk0 . The potential over this pair ofvalues is given by:

⇥UNQ(R�k, R

�k0) = exp{��T

SIMfSIM (R�k, R

�k0)}

where fSIM is computes the likeness of the two val-ues at the token level:

fSIM (R�k, R

�k0) =

|R�k ⇥R�

k0 |max(|R�

k|, |R�k0 |)

This uniqueness potential does not encode anypreference for record values; it simply encourageseach field ⌃ to be distinct across records.

4.3 Term Popularity FactorThe term popularity factor ⇥POP is the first of twofactors that guide the clustering of messages. Be-cause speech on Twitter is colloquial, we would likethese clusters to be amenable to many variations ofthe canonical record properties that are ultimatelylearned. The ⇥POP factor accomplishes this by rep-resenting a lenient compatibility score between a

Model  Overview  •  (y)  Message  level  analysis  

•  (A)  Message  clustering  

•  (R)  Event  records  

Learn  jointly  in    factor  graph  model  

P (R,A, y|x) �

XiYi

�SEQ

R�k

R�k+1

R�k�1

�UNQ �th field(across records)

�POP

R�k

Ai

Yi

Xi

Ai

Yi

Xi

R�k R�+1

k

�CON

k

kth record

Figure 3: Factor graph representation of our model. Circles represent variables and squares represent factors. Forreadability, we depict the graph broken out as a set of templates; the full graph is the combination of these factortemplates applied to each variable. See Section 4 for further details.

over pairwise cliques:

⇥SEQ(x, y) = exp{�TSEQfSEQ(x, y)}

= exp

�⇧

⇤�TSEQ

j

fSEQ(x, yj , yj+1)

⇥⌃

This factor is meant to encode the typical messagecontexts in which fields are evoked (e.g. going to seeX tonight). Many of the features characterize howlikely a given token label, such as ARTIST, is for agiven position in the message sequence conditioningarbitrarily on message text context.

The feature function fSEQ(x, y) for this compo-nent encodes each token’s identity; word shape2;whether that token matches a set of regular expres-sions encoding common emoticons, time references,and venue types; and whether the token matches abag of words observed in artist names (scraped fromWikipedia; 21,475 distinct tokens from 22,833 dis-tinct names) or a bag of words observed in NewYork City venue names (scraped from NYC.com;304 distinct tokens from 169 distinct names).3 Theonly edge feature is label-to-label.

4.2 Record Uniqueness FactorOne challenge with Twitter is the so-called echochamber effect: when a topic becomes popular, or“trends,” it quickly dominates the conversation on-line. As a result some events may have only a fewreferent messages while other more popular eventsmay have thousands or more. In such a circum-stance, the messages for a popular event may collectto form multiple identical record clusters. Since we

2e.g.: xxx, XXX, Xxx, or other3These are just features, not a filter; we are free to extract

any artist or venue regardless of their inclusion in this list.

fix the number of records learned, such behavior in-hibits the discovery of less talked-about events. In-stead, we would rather have just two records: onewith two aligned messages and another with thou-sands. To encourage this outcome, we introduce apotential that rewards fields for being unique acrossrecords.

The uniqueness potential ⇥UNQ(R�) encodes thepreference that each of the values R�, . . . , R�

K foreach field ⌃ do not overlap textually. This factor fac-torizes over pairs of records:

⇥UNQ(R�) =�

k �=k0

⇥UNQ(R�k, R

�k0)

where R�k and R�

k0 are the values of field ⌃ for tworecords Rk and Rk0 . The potential over this pair ofvalues is given by:

⇥UNQ(R�k, R

�k0) = exp{��T

SIMfSIM (R�k, R

�k0)}

where fSIM is computes the likeness of the two val-ues at the token level:

fSIM (R�k, R

�k0) =

|R�k ⇥R�

k0 |max(|R�

k|, |R�k0 |)

This uniqueness potential does not encode anypreference for record values; it simply encourageseach field ⌃ to be distinct across records.

4.3 Term Popularity FactorThe term popularity factor ⇥POP is the first of twofactors that guide the clustering of messages. Be-cause speech on Twitter is colloquial, we would likethese clusters to be amenable to many variations ofthe canonical record properties that are ultimatelylearned. The ⇥POP factor accomplishes this by rep-resenting a lenient compatibility score between a

Sequence    Labeling  

XiYi

�SEQ

R�k

R�k+1

R�k�1

�UNQ �th field(across records)

�POP

R�k

Ai

Yi

Xi

Ai

Yi

Xi

R�k R�+1

k

�CON

k

kth record

Figure 3: Factor graph representation of our model. Circles represent variables and squares represent factors. Forreadability, we depict the graph broken out as a set of templates; the full graph is the combination of these factortemplates applied to each variable. See Section 4 for further details.

over pairwise cliques:

⇥SEQ(x, y) = exp{�TSEQfSEQ(x, y)}

= exp

�⇧

⇤�TSEQ

j

fSEQ(x, yj , yj+1)

⇥⌃

This factor is meant to encode the typical messagecontexts in which fields are evoked (e.g. going to seeX tonight). Many of the features characterize howlikely a given token label, such as ARTIST, is for agiven position in the message sequence conditioningarbitrarily on message text context.

The feature function fSEQ(x, y) for this compo-nent encodes each token’s identity; word shape2;whether that token matches a set of regular expres-sions encoding common emoticons, time references,and venue types; and whether the token matches abag of words observed in artist names (scraped fromWikipedia; 21,475 distinct tokens from 22,833 dis-tinct names) or a bag of words observed in NewYork City venue names (scraped from NYC.com;304 distinct tokens from 169 distinct names).3 Theonly edge feature is label-to-label.

4.2 Record Uniqueness FactorOne challenge with Twitter is the so-called echochamber effect: when a topic becomes popular, or“trends,” it quickly dominates the conversation on-line. As a result some events may have only a fewreferent messages while other more popular eventsmay have thousands or more. In such a circum-stance, the messages for a popular event may collectto form multiple identical record clusters. Since we

2e.g.: xxx, XXX, Xxx, or other3These are just features, not a filter; we are free to extract

any artist or venue regardless of their inclusion in this list.

fix the number of records learned, such behavior in-hibits the discovery of less talked-about events. In-stead, we would rather have just two records: onewith two aligned messages and another with thou-sands. To encourage this outcome, we introduce apotential that rewards fields for being unique acrossrecords.

The uniqueness potential ⇥UNQ(R�) encodes thepreference that each of the values R�, . . . , R�

K foreach field ⌃ do not overlap textually. This factor fac-torizes over pairs of records:

⇥UNQ(R�) =�

k �=k0

⇥UNQ(R�k, R

�k0)

where R�k and R�

k0 are the values of field ⌃ for tworecords Rk and Rk0 . The potential over this pair ofvalues is given by:

⇥UNQ(R�k, R

�k0) = exp{��T

SIMfSIM (R�k, R

�k0)}

where fSIM is computes the likeness of the two val-ues at the token level:

fSIM (R�k, R

�k0) =

|R�k ⇥R�

k0 |max(|R�

k|, |R�k0 |)

This uniqueness potential does not encode anypreference for record values; it simply encourageseach field ⌃ to be distinct across records.

4.3 Term Popularity FactorThe term popularity factor ⇥POP is the first of twofactors that guide the clustering of messages. Be-cause speech on Twitter is colloquial, we would likethese clusters to be amenable to many variations ofthe canonical record properties that are ultimatelylearned. The ⇥POP factor accomplishes this by rep-resenting a lenient compatibility score between a

Record  Uniqueness  

Term  Popularity  

Record  Consistency  

(            )  (            )  (                )  (                )  

67  

Page 68: Регина Барзилай "Извлечение информации из социальных медиа"

Sequence  Labeling  Factor  

XiYi

�SEQ

R�k

R�k+1

R�k�1

�UNQ �th field(across records)

�POP

R�k

Ai

Yi

Xi

Ai

Yi

Xi

R�k R�+1

k

�CON

k

kth record

Figure 3: Factor graph representation of our model. Circles represent variables and squares represent factors. Forreadability, we depict the graph broken out as a set of templates; the full graph is the combination of these factortemplates applied to each variable. See Section 4 for further details.

over pairwise cliques:

⇥SEQ(x, y) = exp{�TSEQfSEQ(x, y)}

= exp

�⇧

⇤�TSEQ

j

fSEQ(x, yj , yj+1)

⇥⌃

This factor is meant to encode the typical messagecontexts in which fields are evoked (e.g. going to seeX tonight). Many of the features characterize howlikely a given token label, such as ARTIST, is for agiven position in the message sequence conditioningarbitrarily on message text context.

The feature function fSEQ(x, y) for this compo-nent encodes each token’s identity; word shape2;whether that token matches a set of regular expres-sions encoding common emoticons, time references,and venue types; and whether the token matches abag of words observed in artist names (scraped fromWikipedia; 21,475 distinct tokens from 22,833 dis-tinct names) or a bag of words observed in NewYork City venue names (scraped from NYC.com;304 distinct tokens from 169 distinct names).3 Theonly edge feature is label-to-label.

4.2 Record Uniqueness FactorOne challenge with Twitter is the so-called echochamber effect: when a topic becomes popular, or“trends,” it quickly dominates the conversation on-line. As a result some events may have only a fewreferent messages while other more popular eventsmay have thousands or more. In such a circum-stance, the messages for a popular event may collectto form multiple identical record clusters. Since we

2e.g.: xxx, XXX, Xxx, or other3These are just features, not a filter; we are free to extract

any artist or venue regardless of their inclusion in this list.

fix the number of records learned, such behavior in-hibits the discovery of less talked-about events. In-stead, we would rather have just two records: onewith two aligned messages and another with thou-sands. To encourage this outcome, we introduce apotential that rewards fields for being unique acrossrecords.

The uniqueness potential ⇥UNQ(R�) encodes thepreference that each of the values R�, . . . , R�

K foreach field ⌃ do not overlap textually. This factor fac-torizes over pairs of records:

⇥UNQ(R�) =�

k �=k0

⇥UNQ(R�k, R

�k0)

where R�k and R�

k0 are the values of field ⌃ for tworecords Rk and Rk0 . The potential over this pair ofvalues is given by:

⇥UNQ(R�k, R

�k0) = exp{��T

SIMfSIM (R�k, R

�k0)}

where fSIM is computes the likeness of the two val-ues at the token level:

fSIM (R�k, R

�k0) =

|R�k ⇥R�

k0 |max(|R�

k|, |R�k0 |)

This uniqueness potential does not encode anypreference for record values; it simply encourageseach field ⌃ to be distinct across records.

4.3 Term Popularity FactorThe term popularity factor ⇥POP is the first of twofactors that guide the clustering of messages. Be-cause speech on Twitter is colloquial, we would likethese clusters to be amenable to many variations ofthe canonical record properties that are ultimatelylearned. The ⇥POP factor accomplishes this by rep-resenting a lenient compatibility score between a

@YonderMountain      rocking      Mercury      Lounge  

ar'st   none   venue   venue  

•  Similar  to  chain  CRF  

•  Features  on  token  and  label  –  Wikipedia  match,  context,  etc.    

⇥SEQ(x, y) = exp{�TSEQfSEQ(x, y)}

IsWikipediaMatch word+1=“rocking”

IsUserMention ….

68  

Page 69: Регина Барзилай "Извлечение информации из социальных медиа"

Ar#st   Venue  Dave  MaWhews  Band   Slims  

Term  Popularity  Factor  

XiYi

�SEQ

R�k

R�k+1

R�k�1

�UNQ �th field(across records)

�POP

R�k

Ai

Yi

Xi

Ai

Yi

Xi

R�k R�+1

k

�CON

k

kth record

Figure 3: Factor graph representation of our model. Circles represent variables and squares represent factors. Forreadability, we depict the graph broken out as a set of templates; the full graph is the combination of these factortemplates applied to each variable. See Section 4 for further details.

over pairwise cliques:

⇥SEQ(x, y) = exp{�TSEQfSEQ(x, y)}

= exp

�⇧

⇤�TSEQ

j

fSEQ(x, yj , yj+1)

⇥⌃

This factor is meant to encode the typical messagecontexts in which fields are evoked (e.g. going to seeX tonight). Many of the features characterize howlikely a given token label, such as ARTIST, is for agiven position in the message sequence conditioningarbitrarily on message text context.

The feature function fSEQ(x, y) for this compo-nent encodes each token’s identity; word shape2;whether that token matches a set of regular expres-sions encoding common emoticons, time references,and venue types; and whether the token matches abag of words observed in artist names (scraped fromWikipedia; 21,475 distinct tokens from 22,833 dis-tinct names) or a bag of words observed in NewYork City venue names (scraped from NYC.com;304 distinct tokens from 169 distinct names).3 Theonly edge feature is label-to-label.

4.2 Record Uniqueness FactorOne challenge with Twitter is the so-called echochamber effect: when a topic becomes popular, or“trends,” it quickly dominates the conversation on-line. As a result some events may have only a fewreferent messages while other more popular eventsmay have thousands or more. In such a circum-stance, the messages for a popular event may collectto form multiple identical record clusters. Since we

2e.g.: xxx, XXX, Xxx, or other3These are just features, not a filter; we are free to extract

any artist or venue regardless of their inclusion in this list.

fix the number of records learned, such behavior in-hibits the discovery of less talked-about events. In-stead, we would rather have just two records: onewith two aligned messages and another with thou-sands. To encourage this outcome, we introduce apotential that rewards fields for being unique acrossrecords.

The uniqueness potential ⇥UNQ(R�) encodes thepreference that each of the values R�, . . . , R�

K foreach field ⌃ do not overlap textually. This factor fac-torizes over pairs of records:

⇥UNQ(R�) =�

k �=k0

⇥UNQ(R�k, R

�k0)

where R�k and R�

k0 are the values of field ⌃ for tworecords Rk and Rk0 . The potential over this pair ofvalues is given by:

⇥UNQ(R�k, R

�k0) = exp{��T

SIMfSIM (R�k, R

�k0)}

where fSIM is computes the likeness of the two val-ues at the token level:

fSIM (R�k, R

�k0) =

|R�k ⇥R�

k0 |max(|R�

k|, |R�k0 |)

This uniqueness potential does not encode anypreference for record values; it simply encourageseach field ⌃ to be distinct across records.

4.3 Term Popularity FactorThe term popularity factor ⇥POP is the first of twofactors that guide the clustering of messages. Be-cause speech on Twitter is colloquial, we would likethese clusters to be amenable to many variations ofthe canonical record properties that are ultimatelylearned. The ⇥POP factor accomplishes this by rep-resenting a lenient compatibility score between a

     Dave        MaWhews                at                        Slims  

venue   venue  ar'st   ar'st  

�POP (x, y,R�A = v) =

X

j

max

kSim(xj , yj , vk)

•  Match  each  labeled  

       message  token  to  best  

       record  value  token  

•  Token  matching  

       is  IDF-­‐weighted    

 

  69  

Page 70: Регина Барзилай "Извлечение информации из социальных медиа"

Record  Uniqueness  Factor  

XiYi

�SEQ

R�k

R�k+1

R�k�1

�UNQ �th field(across records)

�POP

R�k

Ai

Yi

Xi

Ai

Yi

Xi

R�k R�+1

k

�CON

k

kth record

Figure 3: Factor graph representation of our model. Circles represent variables and squares represent factors. Forreadability, we depict the graph broken out as a set of templates; the full graph is the combination of these factortemplates applied to each variable. See Section 4 for further details.

over pairwise cliques:

⇥SEQ(x, y) = exp{�TSEQfSEQ(x, y)}

= exp

�⇧

⇤�TSEQ

j

fSEQ(x, yj , yj+1)

⇥⌃

This factor is meant to encode the typical messagecontexts in which fields are evoked (e.g. going to seeX tonight). Many of the features characterize howlikely a given token label, such as ARTIST, is for agiven position in the message sequence conditioningarbitrarily on message text context.

The feature function fSEQ(x, y) for this compo-nent encodes each token’s identity; word shape2;whether that token matches a set of regular expres-sions encoding common emoticons, time references,and venue types; and whether the token matches abag of words observed in artist names (scraped fromWikipedia; 21,475 distinct tokens from 22,833 dis-tinct names) or a bag of words observed in NewYork City venue names (scraped from NYC.com;304 distinct tokens from 169 distinct names).3 Theonly edge feature is label-to-label.

4.2 Record Uniqueness FactorOne challenge with Twitter is the so-called echochamber effect: when a topic becomes popular, or“trends,” it quickly dominates the conversation on-line. As a result some events may have only a fewreferent messages while other more popular eventsmay have thousands or more. In such a circum-stance, the messages for a popular event may collectto form multiple identical record clusters. Since we

2e.g.: xxx, XXX, Xxx, or other3These are just features, not a filter; we are free to extract

any artist or venue regardless of their inclusion in this list.

fix the number of records learned, such behavior in-hibits the discovery of less talked-about events. In-stead, we would rather have just two records: onewith two aligned messages and another with thou-sands. To encourage this outcome, we introduce apotential that rewards fields for being unique acrossrecords.

The uniqueness potential ⇥UNQ(R�) encodes thepreference that each of the values R�, . . . , R�

K foreach field ⌃ do not overlap textually. This factor fac-torizes over pairs of records:

⇥UNQ(R�) =�

k �=k0

⇥UNQ(R�k, R

�k0)

where R�k and R�

k0 are the values of field ⌃ for tworecords Rk and Rk0 . The potential over this pair ofvalues is given by:

⇥UNQ(R�k, R

�k0) = exp{��T

SIMfSIM (R�k, R

�k0)}

where fSIM is computes the likeness of the two val-ues at the token level:

fSIM (R�k, R

�k0) =

|R�k ⇥R�

k0 |max(|R�

k|, |R�k0 |)

This uniqueness potential does not encode anypreference for record values; it simply encourageseach field ⌃ to be distinct across records.

4.3 Term Popularity FactorThe term popularity factor ⇥POP is the first of twofactors that guide the clustering of messages. Be-cause speech on Twitter is colloquial, we would likethese clusters to be amenable to many variations ofthe canonical record properties that are ultimatelylearned. The ⇥POP factor accomplishes this by rep-resenting a lenient compatibility score between a�UNQ(R

�) =Y

k 6=k0

�UNQ(R�k, R

�k0)

�UNQ(R�k, R

�k0) = exp{�Sim(R�

k, R0�k )}

•  Discourage  similar  record  values  

Ar#st  Yonder  Mountain  Band  

ArPst  Yonder  Mountain  

70  

Page 71: Регина Барзилай "Извлечение информации из социальных медиа"

Record  Consistency  Factor  

XiYi

�SEQ

R�k

R�k+1

R�k�1

�UNQ �th field(across records)

�POP

R�k

Ai

Yi

Xi

Ai

Yi

Xi

R�k R�+1

k

�CON

k

kth record

Figure 3: Factor graph representation of our model. Circles represent variables and squares represent factors. Forreadability, we depict the graph broken out as a set of templates; the full graph is the combination of these factortemplates applied to each variable. See Section 4 for further details.

over pairwise cliques:

⇥SEQ(x, y) = exp{�TSEQfSEQ(x, y)}

= exp

�⇧

⇤�TSEQ

j

fSEQ(x, yj , yj+1)

⇥⌃

This factor is meant to encode the typical messagecontexts in which fields are evoked (e.g. going to seeX tonight). Many of the features characterize howlikely a given token label, such as ARTIST, is for agiven position in the message sequence conditioningarbitrarily on message text context.

The feature function fSEQ(x, y) for this compo-nent encodes each token’s identity; word shape2;whether that token matches a set of regular expres-sions encoding common emoticons, time references,and venue types; and whether the token matches abag of words observed in artist names (scraped fromWikipedia; 21,475 distinct tokens from 22,833 dis-tinct names) or a bag of words observed in NewYork City venue names (scraped from NYC.com;304 distinct tokens from 169 distinct names).3 Theonly edge feature is label-to-label.

4.2 Record Uniqueness FactorOne challenge with Twitter is the so-called echochamber effect: when a topic becomes popular, or“trends,” it quickly dominates the conversation on-line. As a result some events may have only a fewreferent messages while other more popular eventsmay have thousands or more. In such a circum-stance, the messages for a popular event may collectto form multiple identical record clusters. Since we

2e.g.: xxx, XXX, Xxx, or other3These are just features, not a filter; we are free to extract

any artist or venue regardless of their inclusion in this list.

fix the number of records learned, such behavior in-hibits the discovery of less talked-about events. In-stead, we would rather have just two records: onewith two aligned messages and another with thou-sands. To encourage this outcome, we introduce apotential that rewards fields for being unique acrossrecords.

The uniqueness potential ⇥UNQ(R�) encodes thepreference that each of the values R�, . . . , R�

K foreach field ⌃ do not overlap textually. This factor fac-torizes over pairs of records:

⇥UNQ(R�) =�

k �=k0

⇥UNQ(R�k, R

�k0)

where R�k and R�

k0 are the values of field ⌃ for tworecords Rk and Rk0 . The potential over this pair ofvalues is given by:

⇥UNQ(R�k, R

�k0) = exp{��T

SIMfSIM (R�k, R

�k0)}

where fSIM is computes the likeness of the two val-ues at the token level:

fSIM (R�k, R

�k0) =

|R�k ⇥R�

k0 |max(|R�

k|, |R�k0 |)

This uniqueness potential does not encode anypreference for record values; it simply encourageseach field ⌃ to be distinct across records.

4.3 Term Popularity FactorThe term popularity factor ⇥POP is the first of twofactors that guide the clustering of messages. Be-cause speech on Twitter is colloquial, we would likethese clusters to be amenable to many variations ofthe canonical record properties that are ultimatelylearned. The ⇥POP factor accomplishes this by rep-resenting a lenient compatibility score between a

•  Encourage  all  record  values  to  be  in  single  message  

•  Ac(ve  when  there  is  some  match  for  all  record  fields  

 

 

 

�CON (x, y,RA) =

I[�POP (x, y,R�A) > 0, 8⇥]

Ar#st   Venue  Dave  MaWhews  Band   Slims  

     Dave        MaWhews                at                        Slims  

venue   venue  ar'st   ar'st  

71  

Page 72: Регина Барзилай "Извлечение информации из социальных медиа"

Inference  •  Varia(onal  mean-­‐field  inference    to  approximate  posterior  

P (R,A,y|x) � Q(R,A,y)

=

KY

k=1

Y

q(R�k)

! nY

i=1

q(Ai)q(yi)

!

72  

Page 73: Регина Барзилай "Извлечение информации из социальных медиа"

Experiments:  Dataset  

Twiler  data:    Three  weekends  of  filtered  messages:  •  Authors  from  New  York,    •  Concert  related  messages  (MIRA  based  classifier)  

 Resul(ng  dataset:    5,800  messages  •  Training    –  2,184  messages  (one  weekend)  •  Test  –  3,662  messages  (two  weekends)  

Gold  event  records:  •  New  York  city  events  from  NYC.com  •  11  events  in  training,  31  events  in  test.    

73  

Page 74: Регина Барзилай "Извлечение информации из социальных медиа"

Experiment:  Baselines  

Vo(ng  methodology  of  Mann  and  Yarowsky  (2005):  •  Aggregate  output  of  baseline  IE  predic(ons  of  each  message.  •  Select  top  K  events  based  on  number  of  votes  

Baseline  IE  predictors.  •  List  baseline:    String  overlap  with  given  list  of  ar(sts  and  venues  (Wikipedia)  •  CRF  Vo(ng  baseline:  Extract  record  for  each  labeled  pair  of  fields  •  CRF  Low-­‐Threshold:    CRF  vo(ng  but  extract    records  with  lower  extrac(on  threshold  

74  

Page 75: Регина Барзилай "Извлечение информации из социальных медиа"

0,2  

0,3  

0,4  

0,5  

0,6  

0,7  

0,8  

0,9  

10   20   30   40   50  

Precision

 (Manual  Evelua(

on)  

Number  of  Records  Kept  

Low  Thresh   CRF   List   Our  Work   Our  Work  +  Con  

Precision  

75  

Page 76: Регина Барзилай "Извлечение информации из социальных медиа"

0,2  

0,25  

0,3  

0,35  

0,4  

0,45  

0,5  

0,55  

0,6  

0,65  

0,7  

1,00   1,5   2   2,5   3   3,5   4   4,5   5  

Recall  against  G

old  Even

t  Records  

k,  as  a  mul(ple  of  the  number  of  gold  records  

Low  Thresh   CRF   List   Our  Work  

Recall  

76  

Page 77: Регина Барзилай "Извлечение информации из социальных медиа"

Paper  &  Code  

•  Paper  hlp://people.csail.mit.edu/regina/my_papers/twiler_acl2011.pdf  

•  Code  hlp://groups.csail.mit.edu/rbg/code/twiler  

77  

Page 78: Регина Барзилай "Извлечение информации из социальных медиа"

Conclusion  

•  Social  media    presents  unique  challenges  and  opportuni(es  for  NLP  technologies  

 •  Linguis(cally-­‐rich  models    can  compensate  for  noise  inherent  in    social  media  streams    

 •  Joint  modeling  of  rich  linguis(c  rela(ons  boosts  predic(on  accuracy  

78