discerningintelligencefromtext)€¦ · documents triple store ... nick saban$ lou saban$...

82
Discerning Intelligence from Text (at the UofA) Denilson Barbosa [email protected]

Upload: duongthien

Post on 23-Apr-2018

224 views

Category:

Documents


3 download

TRANSCRIPT

Discerning  Intelligence  from  Text  (at  the  UofA)  

Denilson  Barbosa  [email protected]  

Web  search  is  changing  

• …  from  IR-­‐style  document  retrieval    

• …  to  ques?on-­‐answering  over  en??es  extracted  from  the  web  

• The  answer  is  the  Chowan  Braves,  and  it  can  be  found  in  Lou  Saban’s  Wikipedia  page  (ranked  #1),  obituary  (ranked  #4),  and  so  on…  

[email protected]   2  

 which  team  did  Lou  Saban  coach  last?    which  was  Lou  Saban’s  last  team  ?  

in  good  company…  

[email protected]   3  

An  explicit  answer  

it  is  not  all  bad  news…  

[email protected]   4  

Structured  knowledge  (harnessed  from  the  Web)  

[email protected]   5  

AUer  his  departure  from  Buffalo,  Saban  returned  to  coach  college  football  teams  including  Miami,  Army,  and  UCF.  

Surface-­‐level  relaDon  ExtracDon  

Recognize Entities

Resolve Coreferences

Split Sentences

Find Relations

[email protected]  

Documents

Triple store

<“Lou  Saban”,  departed  from,  “Buffalo  Bills”>  <“Lou  Saban”,  coach,  “Miami  Hurricanes”>  <“Lou  Saban”,  coach,  “Army  Black  Knights  football”>  <“Lou  Saban”,  coach,  “University  of  Central  Florida”>  

6  

From  triples  to  a  KB…  

• There  is  a  very,  very,  very  long  way…  §  Predicate  disambigua?on  into  seman?c  “rela?ons”…  §  Named  en?ty  disambigua?on…  §  Assigning  en??es  to  classes…  §  Grouping  classes  into  a  hierarchy…  §  Ordering  facts  temporally…    

•  It  would  have  been  virtually    impossible    without  Wikipedia  

[email protected]   7  

?????  

In  this  talk…  

• Work  on  en?ty  linking  with  random  walks  …  [CIKM’2014]  

• A  bit  of  the  work  on  open  rela?on  extrac?on  –  less  on  disambigua?on  §  SONEX  (clustering-­‐based)  [TWEB  ‘2012]  §  EXEMPLAR  (dependences  based)  [EMNLP’2013]  §  With  Tree  Kernels  [NAACL’2013]  §  EFFICIENS  (cost-­‐constrained)  

• A  bit  of  our  work  on  understanding  disputes  in  Wikipedia  [Hypertext2012]  [ACM  TIST’2015]  

[email protected]   8  

In  this  talk…  

• Work  on  en?ty  linking  …  [CIKM’2014]  

[email protected]   9  

En?ty  Linking  

The  enDty  graph  

• We  perform  disambigua?on  of  a  graph  where    nodes  have  ids  of  en??es  in  the  KB    with  their  respec?ve  context  (i.e.,  text!)  

[email protected]   10  

AUer  his  departure  

from  Buffalo,    

Saban    

returned  to  coach  

college  football  

teams  including  

Miami,    

Army,    

and  UCF.  

Nick  Saban  

Lou  Saban  

Buffalo  Bills  

Buffalo  Bulls  

UCF  Knights  football  

Army  Black  Knights  football  

Miami  Heat  

Miami  Hurricanes  

Miami  Dolphins  

University  of  Central  Florida  

US  Army  

≠  The  En?ty  Graph  has    text  about  the  en??es  

The  KB  has  facts  and  asser?ons  

The  enDty  graph  

• Typically,  built  from  Wikipedia  

• Nodes  are  Wikipedia  ar?cles  §  All  known  names  §  Context:  whole  ar?cle  §  Metadata:    

•  types,  keyphrases,    •  type  compa?bility…  

• Edges:  E1  –  E2  iff:  §  There  is  a  wikilink  from  E1  to  E2  §  There  is  ar?cle  E3  that  men?ons  E1  and  E2  close  to  each  other  

• Alias  dic?onary:  §  Mapping  from  names  to  ids  

[email protected]   11  

EnDty  linking  –  main  steps  

• Candidate  SelecDon:  find  a  small  set  of  good  candidates  for  each  men?on  à  using  the  alias  dic?onary  

• MenDon  disambiguaDon:  assign  each  men?on  to  one  of  its  candidates  

[email protected]   12  12  

AUer  his  departure  

from  Buffalo,    

Saban    

returned  to  coach  

college  football  

teams  including  

Miami,    

Army,    

and  UCF.  

Nick  Saban  

Lou  Saban  

Buffalo  Bills  

Buffalo  Bulls  

UCF  Knights  football  

Army  Black  Knights  football  

Miami  Heat  

Miami  Hurricanes  

Miami  Dolphins  

University  of  Central  Florida  

US  Army  

Candidate  SelecDon  

• On  the  KB:  alias-­‐dic?onary  expansion    §  Saban  :  {Nick  Saban,  Lou  Saban,  Saban  Capital  Group,  …}  

• On  the  document:  §  Lookups:  alias-­‐dic?onary/Wikipedia  disambigua?on  pages  §  Co-­‐reference  resolu?on[Cucerzan’07]  §  Acronym  expansion[Zhang  et.al’10,  Zhang  et.al’11]  (ABC  -­‐>  Australian  Broadcas?ng  Corpora?on)  

[email protected]   13  

Local  menDon  disambiguaDon—e.g.,  [Cucerzan’2007]  

• Disambiguate  each  men?on  in  isola?on  

[email protected]   14  

ent(m) = argmax

e2candidates(m)(↵ · prior(m, e) + � · sim(m, e))

• freq(e|m)  • indegree(e)  • length(context(e))  

• cosine/Dice/KL(      context(m),      context(e))  

Local  menDon  disambiguaDon  

• Problema?c  assump?on:  men?ons  are  independent  of  each  other            Saban  =  Nick  Saban  è  Miami  =  Miami  Dolphins            Saban  =  Lou  Saban    è  Miami  =  Miami  Hurricanes  

[email protected]   15  

AUer  his  departure  

from  Buffalo,    

Saban    

returned  to  coach  

college  football  

teams  including  

Miami,    

Army,    

and  UCF.  

Nick  Saban  

Lou  Saban  

Buffalo  Bills  

Buffalo  Bulls  

UCF  Knights  football  

Army  Black  Knights  football  

Miami  Heat  

Miami  Hurricanes  

Miami  Dolphins  

University  of  Central  Florida  

US  Army  

Global  menDon  disambiguaDon—e.g.,  [Hoffart’2011]  

• Disambiguate  all  men?ons  at  once  

[email protected]   16  

AUer  his  departure  

from  Buffalo,    

Saban    

returned  to  coach  

college  football  

teams  including  

Miami,    

Army,    

and  UCF.  

Nick  Saban  

Lou  Saban  

Buffalo  Bills  

Buffalo  Bulls  

UCF  Knights  football  

Army  Black  Knights  football  

Miami  Heat  

Miami  Hurricanes  

Miami  Dolphins  

University  of  Central  Florida  

US  Army  

ent(m) = argmax

e2candidates(m)(↵ · prior(m, e) + � · sim(m, e) +

� · coherence(Gent)))

Global  menDon  disambiguaDon  

• Coherence  captures  the  assump?on  that  the  input  document  has  a  single  theme  or  topic  §  E.g.,  rock  music,  or  the  world  cup  final  match  §  NP-­‐hard  op?miza?on  in  general  

[email protected]   17  

UCF  Knights  football  

AUer  his  departure  

from  Buffalo,    

Saban    

returned  to  coach  

college  football  

teams  including  

Miami,    

Army,    

and  UCF.  

Nick  Saban  

Lou  Saban  

Buffalo  Bills  

Buffalo  Bulls  

Army  Black  Knights  football  

Miami  Heat  

Miami  Hurricanes  

Miami  Dolphins  

University  of  Central  Florida  

US  Army  

Global  menDon  disambiguaDon  

•  [Hoffart  et  al  2011]  –  dense  sub-­‐graph  problem  

• Greedy  algorithm:  remove  non-­‐taboo  en??es  un?l  a  minimal  subgraph  with  highest  weight  is  found  

[email protected]   18  

en?ty-­‐en?ty  • overlap  anchor  words  • overlap  links  • type  similarity  

men?on-­‐en?ty  • sim(m,e)  • keyphraseness(m,e)  

post-­‐processing  

Global  menDon  disambiguaDon  

• Rel-­‐RW  :  Robust  en?ty  linking  with  Random  Walks  [CIKM2014]  

• Global  no?on  of  en?ty-­‐en?ty  similarity  

• Greedy  algorithm:  itera?vely  disambiguate  men?ons;  start  with  the  easiest  ones  

[email protected]   19  

UCF  Knights  football  

AUer  his  departure  

from  Buffalo,    

Saban    

returned  to  coach  

college  football  

teams  including  

Miami,    

Army,    

and  UCF.  

Nick  Saban  

Lou  Saban  

Buffalo  Bills  

Buffalo  Bulls  

Army  Black  Knights  football  

Miami  Heat  

Miami  Hurricanes  

Miami  Dolphins  

University  of  Central  Florida  

US  Army  

Random  Walks  as  context  representaDon  

• Random  walks  capture  indirect  relatedness  between  nodes  in  the  graph  

[email protected]   20  

UCF  Knights  football  

AUer  his  departure  

from  Buffalo,    

Saban    

returned  to  coach  

college  football  

teams  including  

Miami,    

Army,    

and  UCF.  

Nick  Saban  

Lou  Saban  

Buffalo  Bills  

Buffalo  Bulls  

Army  Black  Knights  football  

Miami  Heat  

Miami  Hurricanes  

Miami  Dolphins  

University  of  Central  Florida  

US  Army  

k  candidates  n  nodes  in  total  

Random  Walks  as  context  representaDon  

• One  vector  for  each  en?ty,  and  another  for  the  whole  document  

• Similarity  is  measured  using  Zero-­‐KL  Divergence  

[email protected]   21  

Relatedness  between  en??es  

En?ty  Seman?c  Signature  

Document  Seman?c  Signature  

SemanDc  Signatures  of  EnDDes  

• Restart  from  the  en?ty  with  probability  α  (e.g.  0.15)  §  Un?l  convergence  

• Repeat  for  the  candidate  men?ons  only  

[email protected]   22  

UCF  Knights  football  

Nick  Saban  

Lou  Saban  

Buffalo  Bills  

Buffalo  Bulls  

Army  Black  Knights  football  

Miami  Heat  

Miami  Hurricanes  

Miami  Dolphins  

University  of  Central  Florida  

US  Army  

.  .  .  

.  .  .  

SemanDc  Signatures  of  Documents  

•  (From  [Milne  &  Wi|en  2008]):  If  there  are  unambiguous  men?ons,  use  only  their  en??es  to  find  the  signature  of  the  document  

• Otherwise,  use  all  candidate  en??es  

[email protected]   23  

unambiguous  

UCF  Knights  football  

AUer  his  departure  

from  Buffalo,    

Saban    

returned  to  coach  

college  football  

teams  including  

Miami,    

Army,    

and  UCF.  

Nick  Saban  

Lou  Saban  

Buffalo  Bills  

Buffalo  Bulls  

Army  Black  Knights  football  

Miami  Heat  

Miami  Hurricanes  

Miami  Dolphins  

University  of  Central  Florida  

US  Army  

Algorithm  

• Find  candidate  en??es  for  each  men?on  

• Compute  prior(m,e)  and  the  context(m,e)  

• Sort  men?ons  by  ambiguity  (i.e.,  number  of  candidates)  

• Go  through  each  men?on  m  in  ascending  order:  

•  SSd  =  seman?c  signature  of  document  

•  Assign  to  m  the  candidate  e  with  highest  combined  score  prior(m,e)  *  context(m,e)  +  sim(SSe  ,  SSd)  

•  Update  the  set  of  en??es  for  the  document    

[email protected]   24  

Algorithm  

[email protected]   25  

UCF  Knights  football  [0.133,  0.18,  0.50]  

University  of  Central  Florida  [0.167,  0.13,  0.52]  

Miami  [0.632,  0.07,  0.61]  

Miami  Hurricanes  football  [0.029,  0.12,  0.58]  

Buffalo,  New  York  [0.467,  0.07,  0.54]  

Miami  Dolphins  [0.011,  0.10,  0.56]  

Lou  Saban  [0.009,  0.28,  0.41]  

Men?ons  [ambiguity]  

Candidates  [PriorProb,  CtxSim,  SemSim]  

UCF  Knights  basketball  [0.041,  0.13,  0.34]  

Nick  Saban  [0.009,  0.15,  0.54]  Saban  Capital  Group  [0.545,  0.13,  0.20]  

Buffalo  Bulls  football  [0.024,  0.11,  0.50]  

Buffalo  Bills  [0.021,  0.09,  0.58]  

Saban  45  

UCF  33  

Buffalo  317  

Miami  343  

Army  402  

Army  Black  Knights  football  [0.062,  0.09,  0.52]  

Mariland  Terrapins  football  [0.001,  0.07,  0.56]  

Army  [0.155,  0.04,  0.34]  

Use  all  candidates  for  SSd  

[email protected]   26  

UCF  Knights  football  [0.133,  0.18,  0.50]  

Miami  [0.632,  0.07,  0.61]  

Miami  Hurricanes  football  [0.029,  0.12,  0.58]  

Buffalo,  New  York  [0.467,  0.07,  0.54]  

Miami  Dolphins  [0.011,  0.10,  0.56]  

Lou  Saban  [0.009,  0.28,  0.41]  

Men?ons  [ambiguity]  

Candidates  [PriorProb,  CtxSim,  SemSim]  

UCF  Knights  basketball  [0.041,  0.13,  0.34]  

Nick  Saban  [0.009,  0.15,  0.54]  Saban  Capital  Group  [0.545,  0.13,  0.20]  

Buffalo  Bulls  football  [0.024,  0.11,  0.50]  

Buffalo  Bills  [0.021,  0.09,  0.58]  

Saban  45  

UCF  33  

Buffalo  317  

Miami  343  

UCF  Knights  football  [0.133,  0.18,  0.50]  

Army  402  

Army  Black  Knights  football  [0.062,  0.09,  0.52]  

Mariland  Terrapins  football  [0.001,  0.07,  0.56]  

Army  [0.155,  0.04,  0.34]  

Ed  =  {UCF  Knights  football}  

Algorithm  

University  of  Central  Florida  [0.167,  0.13,  0.52]  

[email protected]   27  

Miami  [0.632,  0.07,  0.61]  

Miami  Hurricanes  football  [0.029,  0.12,  0.58]  

Buffalo,  New  York  [0.467,  0.07,  0.54]  

Miami  Dolphins  [0.011,  0.10,  0.56]  

Lou  Saban  [0.009,  0.28,  0.41]  

Men?ons  [ambiguity]  

Candidates  [PriorProb,  CtxSim,  SemSim]  

Nick  Saban  [0.009,  0.15,  0.54]  Saban  Capital  Group  [0.545,  0.13,  0.20]  

Buffalo  Bulls  football  [0.024,  0.11,  0.50]  

Buffalo  Bills  [0.021,  0.09,  0.58]  

Saban  45  

Buffalo  317  

Miami  343  

Nick  Saban  [0.009,  0.15,  0.58]  

Lou  Saban  [0.009,  0.28,  0.51]  

Saban  Capital  Group  [0.545,  0.13,  0.18]  

Lou  Saban  [0.009,  0.28,  0.51]  

Army  402  

Army  Black  Knights  football  [0.062,  0.09,  0.52]  

Mariland  Terrapins  football  [0.001,  0.07,  0.56]  

Army  [0.155,  0.04,  0.34]  

Ed  =  {UCF  Knights  football,  Lou  Saban}  

Algorithm  

UCF  Knights  football  [0.133,  0.18,  0.50]  

UCF  Knights  basketball  [0.041,  0.13,  0.34]  

UCF  33  

UCF  Knights  football  [0.133,  0.18,  0.50]  

University  of  Central  Florida  [0.167,  0.13,  0.52]  

[email protected]   28  

Miami  [0.632,  0.07,  0.61]  

Miami  Hurricanes  football  [0.029,  0.12,  0.58]  

Buffalo,  New  York  [0.467,  0.07,  0.54]  

Miami  Dolphins  [0.011,  0.10,  0.56]  

Lou  Saban  [0.009,  0.28,  0.41]  

Men?ons  [ambiguity]  

Candidates  [PriorProb,  CtxSim,  SemSim]  

Nick  Saban  [0.009,  0.15,  0.54]  Saban  Capital  Group  [0.545,  0.13,  0.20]  

Buffalo  Bulls  football  [0.024,  0.11,  0.50]  

Buffalo  Bills  [0.021,  0.09,  0.58]  

Saban  45  

Buffalo  317  

Miami  343  

Nick  Saban  [0.009,  0.15,  0.58]  

Lou  Saban  [0.009,  0.28,  0.51]  

Saban  Capital  Group  [0.545,  0.13,  0.18]  

Lou  Saban  [0.009,  0.28,  0.51]  

Buffalo  Bills  [0.021,  0.09,  0.95]  

Army  402  

Army  Black  Knights  football  [0.062,  0.09,  0.52]  

Mariland  Terrapins  football  [0.001,  0.07,  0.56]  

Army  [0.155,  0.04,  0.34]  

Buffalo  Bills  [0.021,  0.09,  0.95]  

Buffalo  Bulls  football  [0.024,  0.11,  0.71]  Buffalo,  New  York  [0.467,  0.07,  0.42]  

Ed  =  {UCF  Knights  football,  Lou  Saban,  Buffalo  Bills}  

Algorithm  

UCF  Knights  football  [0.133,  0.18,  0.50]  

UCF  Knights  basketball  [0.041,  0.13,  0.34]  

UCF  33  

UCF  Knights  football  [0.133,  0.18,  0.50]  

University  of  Central  Florida  [0.167,  0.13,  0.52]  

[email protected]   29  

Miami  [0.632,  0.07,  0.61]  

Miami  Hurricanes  football  [0.029,  0.12,  0.58]  

Buffalo,  New  York  [0.467,  0.07,  0.54]  

Miami  Dolphins  [0.011,  0.10,  0.56]  

Lou  Saban  [0.009,  0.28,  0.41]  

Men?ons  [ambiguity]  

Candidates  [PriorProb,  CtxSim,  SemSim]  

Nick  Saban  [0.009,  0.15,  0.54]  Saban  Capital  Group  [0.545,  0.13,  0.20]  

Buffalo  Bulls  football  [0.024,  0.11,  0.50]  

Buffalo  Bills  [0.021,  0.09,  0.58]  

Saban  45  

Buffalo  317  

Miami  343  

Nick  Saban  [0.009,  0.15,  0.58]  

Lou  Saban  [0.009,  0.28,  0.51]  

Saban  Capital  Group  [0.545,  0.13,  0.18]  

Lou  Saban  [0.009,  0.28,  0.51]  

Buffalo  Bills  [0.021,  0.09,  0.95]  

Army  402  

Army  Black  Knights  football  [0.062,  0.09,  0.52]  

Mariland  Terrapins  football  [0.001,  0.07,  0.56]  

Army  [0.155,  0.04,  0.34]  

Buffalo  Bills  [0.021,  0.09,  0.95]  

Buffalo  Bulls  football  [0.024,  0.11,  0.71]  Buffalo,  New  York  [0.467,  0.07,  0.42]  

Miami  [0.632,  0.07,  0.48]  

Miami  Hurricanes  football  [0.029,  0.12,  0.93]  Miami  Dolphins  [0.011,  0.10,  0.98]  

Miami  Hurricanes  football  [0.029,  0.12,  0.93]  

Ed  =  {UCF  Knights  football,  Lou  Saban,  Buffalo  Bills,  Miami  Hurricanes  football}  

Algorithm  

UCF  Knights  football  [0.133,  0.18,  0.50]  

UCF  Knights  basketball  [0.041,  0.13,  0.34]  

UCF  33  

UCF  Knights  football  [0.133,  0.18,  0.50]  

University  of  Central  Florida  [0.167,  0.13,  0.52]  

[email protected]   30  

Miami  [0.632,  0.07,  0.61]  

Miami  Hurricanes  football  [0.029,  0.12,  0.58]  

Buffalo,  New  York  [0.467,  0.07,  0.54]  

Miami  Dolphins  [0.011,  0.10,  0.56]  

Lou  Saban  [0.009,  0.28,  0.41]  

Men?ons  [ambiguity]  

Candidates  [PriorProb,  CtxSim,  SemSim]  

Nick  Saban  [0.009,  0.15,  0.54]  Saban  Capital  Group  [0.545,  0.13,  0.20]  

Buffalo  Bulls  football  [0.024,  0.11,  0.50]  

Buffalo  Bills  [0.021,  0.09,  0.58]  

Saban  45  

Buffalo  317  

Miami  343  

Nick  Saban  [0.009,  0.15,  0.58]  

Lou  Saban  [0.009,  0.28,  0.51]  

Saban  Capital  Group  [0.545,  0.13,  0.18]  

Lou  Saban  [0.009,  0.28,  0.51]  

Buffalo  Bills  [0.021,  0.09,  0.95]  

Army  402  

Army  Black  Knights  football  [0.062,  0.09,  0.52]  

Mariland  Terrapins  football  [0.001,  0.07,  0.56]  

Army  [0.155,  0.04,  0.34]  

Buffalo  Bills  [0.021,  0.09,  0.95]  

Buffalo  Bulls  football  [0.024,  0.11,  0.71]  Buffalo,  New  York  [0.467,  0.07,  0.42]  

Miami  [0.632,  0.07,  0.48]  

Miami  Hurricanes  football  [0.029,  0.12,  0.93]  Miami  Dolphins  [0.011,  0.10,  0.98]  

Miami  Hurricanes  football  [0.029,  0.12,  0.93]  

Army  Black  Knights  football  [0.062,  0.09,  0.74]  

Army  Black  Knights  football  [0.062,  0.09,  0.74]  

Mariland  Terrapins  football  [0.001,  0.07,  0.83]  

Army  [0.155,  0.04,  0.28]  

Algorithm  

UCF  Knights  football  [0.133,  0.18,  0.50]  

UCF  Knights  basketball  [0.041,  0.13,  0.34]  

UCF  33  

UCF  Knights  football  [0.133,  0.18,  0.50]  

University  of  Central  Florida  [0.167,  0.13,  0.52]  

“Paul,  John,  Ringo,  and  George”  

[email protected]   31  

Paul  the  Apostle  [0.354,  0.06,  0.42]  Paul  McCartney  [0.055,  0.06,  0.51]  

John  Lennon  [0.007,  0.05,  0.53]  Gospel  of  John  

[0.154,  0.03,  0.33]  

George  Sco|  [0.002,  0.10,  0.21]  

George  Harrison  [0.011,  0.03,  0.47]  

John  the  Apostle  [0.038,  0.04,  0.33]  

Ringo  Starr  [0.266,  0.08,  0.42]  

Paul  

John  

Ringo  

George  

Men?ons  [ambiguity]  

Candidates  [PriorProb,  CtxSim,  SemSim]  

Paul  Field  [0.001,  0.12,  0.25]  Paul  I  of  Russia  

[0.026,  0.04,  0.25]  

John,  King  of  England  [0.066,  0.03,  0.28]  

Ringo  (album)  [0.297,  0.09,  0.30]  

Ringo  Rama  [0.010,  0.14,  0.27]  Johnny  Ringo  

[0.010,  0.18,  0.20  

George  Moore  [0.001,  0.10,  0.20]  George  Costanza  [0.07,  0.06,  0.24]  

35  

379  

807  

1699  

Ed  

“Paul,  John,  Ringo,  and  George”  

[email protected]   32  

Paul  the  Apostle  [0.354,  0.06,  0.42]  Paul  McCartney  [0.055,  0.06,  0.51]  

John  Lennon  [0.007,  0.05,  0.53]  Gospel  of  John  

[0.154,  0.03,  0.33]  

George  Sco|  [0.002,  0.10,  0.21]  

George  Harrison  [0.011,  0.03,  0.47]  

John  the  Apostle  [0.038,  0.04,  0.33]  

Ringo  Starr  [0.266,  0.08,  0.42]  

Paul  

John  

Ringo  

George  

Men?ons  [ambiguity]  

Candidates  [PriorProb,  CtxSim,  SemSim]  

Paul  Field  [0.001,  0.12,  0.25]  Paul  I  of  Russia  

[0.026,  0.04,  0.25]  

John,  King  of  England  [0.066,  0.03,  0.28]  

Ringo  (album)  [0.297,  0.09,  0.30]  

Ringo  Rama  [0.010,  0.14,  0.27]  Johnny  Ringo  

[0.010,  0.18,  0.20  

George  Moore  [0.001,  0.10,  0.20]  George  Costanza  [0.07,  0.06,  0.24]  

35  

379  

807  

1699  

Ringo  Starr  [0.266,  0.08,  0.42]  

Ed  =  {Ringo  Starr}  

“Paul,  John,  Ringo,  and  George”  

[email protected]   33  

Paul  the  Apostle  [0.354,  0.06,  0.42]  Paul  McCartney  [0.055,  0.06,  0.51]  

John  Lennon  [0.007,  0.05,  0.53]  Gospel  of  John  

[0.154,  0.03,  0.33]  

George  Sco|  [0.002,  0.10,  0.21]  

George  Harrison  [0.011,  0.03,  0.47]  

John  the  Apostle  [0.038,  0.04,  0.33]  

Ringo  Starr  [0.266,  0.08,  0.42]  

Paul  

John  

Ringo  

George  

Men?ons  [ambiguity]  

Candidates  [PriorProb,  CtxSim,  SemSim]  

Paul  Field  [0.001,  0.12,  0.25]  Paul  I  of  Russia  

[0.026,  0.04,  0.25]  

John,  King  of  England  [0.066,  0.03,  0.28]  

Ringo  (album)  [0.297,  0.09,  0.30]  

Ringo  Rama  [0.010,  0.14,  0.27]  Johnny  Ringo  

[0.010,  0.18,  0.20  

George  Moore  [0.001,  0.10,  0.20]  George  Costanza  [0.07,  0.06,  0.24]  

35  

379  

807  

1699  

Paul  the  Apostle  [0.354,  0.06,  0.24]  

Paul  McCartney  [0.055,  0.06,  1.25]  

Ringo  Starr  [0.266,  0.08,  0.42]  

Paul  Field  [0.001,  0.12,  0.22]  Paul  I  of  Russia  

[0.026,  0.04,  0.19]  

Paul  McCartney  [0.055,  0.06,  1.25]  

Ed  =  {Ringo  Starr,  Paul  McCartney}  

“Paul,  John,  Ringo,  and  George”  

[email protected]   34  

Paul  the  Apostle  [0.354,  0.06,  0.42]  Paul  McCartney  [0.055,  0.06,  0.51]  

John  Lennon  [0.007,  0.05,  0.53]  Gospel  of  John  

[0.154,  0.03,  0.33]  

George  Sco|  [0.002,  0.10,  0.21]  

George  Harrison  [0.011,  0.03,  0.47]  

John  the  Apostle  [0.038,  0.04,  0.33]  

Ringo  Starr  [0.266,  0.08,  0.42]  

Paul  

John  

Ringo  

George  

Men?ons  [ambiguity]  

Candidates  [PriorProb,  CtxSim,  SemSim]  

Paul  Field  [0.001,  0.12,  0.25]  Paul  I  of  Russia  

[0.026,  0.04,  0.25]  

John,  King  of  England  [0.066,  0.03,  0.28]  

Ringo  (album)  [0.297,  0.09,  0.30]  

Ringo  Rama  [0.010,  0.14,  0.27]  Johnny  Ringo  

[0.010,  0.18,  0.20  

George  Moore  [0.001,  0.10,  0.20]  George  Costanza  [0.07,  0.06,  0.24]  

35  

379  

807  

1699  

Paul  the  Apostle  [0.354,  0.06,  0.24]  

Paul  McCartney  [0.055,  0.06,  1.25]  

Ringo  Starr  [0.266,  0.08,  0.42]  

Paul  Field  [0.001,  0.12,  0.22]  Paul  I  of  Russia  

[0.026,  0.04,  0.19]  

Paul  McCartney  [0.055,  0.06,  1.25]  

George  Sco|  [0.002,  0.10,  0.20]  

George  Harrison  [0.011,  0.03,  1.30]  

George  Moore  [0.001,  0.10,  0.19]  

George  Costanza  [0.07,  0.06,  0.24]  

George  Harrison  [0.011,  0.03,  1.30]  

Ed  =  {Ringo  Starr,  Paul  McCartney,  George  Harrison}  

“Paul,  John,  Ringo,  and  George”  

[email protected]   35  

Paul  the  Apostle  [0.354,  0.06,  0.42]  Paul  McCartney  [0.055,  0.06,  0.51]  

John  Lennon  [0.007,  0.05,  0.53]  Gospel  of  John  

[0.154,  0.03,  0.33]  

George  Sco|  [0.002,  0.10,  0.21]  

George  Harrison  [0.011,  0.03,  0.47]  

John  the  Apostle  [0.038,  0.04,  0.33]  

Ringo  Starr  [0.266,  0.08,  0.42]  

Paul  

John  

Ringo  

George  

Men?ons  [ambiguity]  

Candidates  [PriorProb,  CtxSim,  SemSim]  

Paul  Field  [0.001,  0.12,  0.25]  Paul  I  of  Russia  

[0.026,  0.04,  0.25]  

John,  King  of  England  [0.066,  0.03,  0.28]  

Ringo  (album)  [0.297,  0.09,  0.30]  

Ringo  Rama  [0.010,  0.14,  0.27]  Johnny  Ringo  

[0.010,  0.18,  0.20  

George  Moore  [0.001,  0.10,  0.20]  George  Costanza  [0.07,  0.06,  0.24]  

35  

379  

807  

1699  

Paul  the  Apostle  [0.354,  0.06,  0.24]  

Paul  McCartney  [0.055,  0.06,  1.25]  

Ringo  Starr  [0.266,  0.08,  0.42]  

Paul  Field  [0.001,  0.12,  0.22]  Paul  I  of  Russia  

[0.026,  0.04,  0.19]  

Paul  McCartney  [0.055,  0.06,  1.25]  

George  Sco|  [0.002,  0.10,  0.20]  

George  Harrison  [0.011,  0.03,  1.30]  

George  Moore  [0.001,  0.10,  0.19]  

George  Costanza  [0.07,  0.06,  0.24]  

John  Lennon  [0.007,  0.05,  1.20]  Gospel  of  John  

[0.154,  0.03,  0.22]  

George  Harrison  [0.011,  0.03,  1.30]  

John  the  Apostle  [0.038,  0.04,  0.23]  

John,  King  of  England  [0.066,  0.03,  0.21]  

John  Lennon  [0.007,  0.05,  1.20]  

EnDty  Linking  –  EvaluaDon  

• Public  benchmarks  

• Synthe?c  Wikipedia  dataset.  §  Generated  based  on  the  popularity  of  en??es.          e.g.  dataset  0.3-­‐0.4  means  the  accuracy  of  prior  probability  is  between  0.3-­‐0.4.  §  8  datasets:  0.3-­‐0.4,  0.4-­‐0.5,  …  0.9-­‐1.0,  1.0-­‐1.1  §  40  documents,  each  document  has  20-­‐40  en??es.  

• Evalua?on  Metrics  §  Accuracy  §  Micro  F1:  average  F-­‐1  per  men?on  §  Macro  F1:  average  F-­‐1  per  document  

[email protected]   36  

Datasets # of mentions # of articles MSNBC 739 20 AQUAINT 727 50 ACE2004 306 57

EnDty  Linking  –  EvaluaDon  

• Results  on  MSNBC,  AQUAINT,  ACE2004  

[email protected]   37  

• Prior  probability  is  a  strong  baseline.  • Benchmarks  are  biased  towards  popular  en??es,    

• Representa?veness?  (e.g.  long  tails  of  the  men?ons  in  the  Web)  

DatasetsMSNBC AQUAINT ACE2004

Systems Accuracy F1@MI F1@MA Accuracy F1@MI F1@MA Accuracy F1@MI F1@MA

PriorProb 85.98 86.50 87.15 84.87 87.27 87.16 84.82 85.49 87.13Prior-Type 81.86 82.81 83.84 83.22 85.57 85.08 80.93 84.04 86.08Local 77.43 77.91 72.30 66.44 68.32 68.09 61.48 61.96 56.95

Cucerzan 87.80 88.34 87.76 76.62 78.67 78.22 78.99 79.30 78.22M&W 68.45 78.43 80.37 79.92 85.13 84.84 75.54 81.29 84.25Han’11 87.65 88.46 87.93 77.16 79.46 78.80 72.76 73.48 66.80AIDA 76.83 78.81 76.26 52.54 56.47 56.46 77.04 80.49 84.13GLOW 65.55 75.37 77.33 75.65 83.14 82.97 75.49 81.91 83.18RI 88.57 90.22 90.87 85.01 87.72 87.74 82.35 86.60 87.13

REL-RW 91.62 92.18 92.10 88.45 90.82 90.51 84.43 87.68 89.23

EnDty  Linking  –  EvaluaDon  

[email protected]   38  

REL-RW, Han11: Graph-based measure Curcerzan, AIDA: Lexical measure M&W, RI, GLOW: Linked-based measure

EnDty  Linking  –  EvaluaDon  

• Different  configura?ons  §  Itera?ve  process  performs  best  §  Unambiguous  men?ons  are  more  informa?ve  than  candidates.  

§  Robust  performance  with  different  weigh?ng  schemes.  

[email protected]   39  

Robust  EnDty  Linking  with  Random  Walks  

•  Intui?on:  less  popular  en??es  are  more  likely  to  be  be|er  linked  than  well  described  (i.e.,  have  a  lot  of  text)  

• Our  seman?c  similarity  has  a  natural  interpreta?on  and  relies  more  on  the  graph  than  on  the  document  content  

• Men?on  disambigua?on  without  global  no?on  of  coherence  

• Use  a  greedy  itera?ve  approach:  disambiguate  the  ``easiest’’  men?on,  re-­‐compute  everything,  repeat  

• Robust  against  bad  parameter  choice  

• No  learning!  Previous  state  of  the  art  [Hoffart’2011,  Milne&Wi|en’2008]  learn  similarity  weights  

• Future  work:  improving  the  first  step  of  the  algorithm  §  Maybe  exploring  a  few  alterna?ves    

[email protected]   40  

OPEN  RELATION  EXTRACTION  

Shallow  (SONEX):  Clustering-based [ACM TWEB’12]

Not-­‐so-­‐shallow  (dependency  parsing)  Reified networks/Nested relations [ICWSM’11] Rule-based (EXEMPLAR) [EMNLP’13] Tree-kernels on dependency parses [NAACL’13]

 Improving  ORE  with  text  normaliza?on  [LREC’14]    Cost-­‐constrained  ORE  

EFFICIENS

[email protected]   41  

ORE  “one  sentence  at  a  Dme”  

• Rela?on  extrac?on  as  a  sequence  predic?on  task    §  TextRunner:  [Banko  and  Etzioni,  2008]  a  small  list  of  part-­‐of-­‐speech  tag  sequences  that  account  for  a  large  number  of  rela?ons  in  a  large  corpus  

§  ReVerb:  [Fader  et  al.,  2011]  use  an  even  shorter  list  of  pa|erns  

• The  ReVerb/TextRunner  tools  have  extracted  over  one  billion  facts  from  the  Web  

• Efficient:  no  need  to  store  the  whole  corpus  

• Bri|le:  mul?ple  synonyms  of  the  the  same  rela?on  are  extracted  

[email protected]   42  

Frequency   Pattern   Example  

38%   E1 Verb E2   X established Y  23%   E1 NP Prep E2   X settlement with Y  16%   E1 Verb Prep E2   X moved to Y   9%   E1 Verb to Verb E2   X plans to acquire Y  

...    <PER>Greg  Francis</PER>  

new  Golden  Bears  basketball  head  coach  at  

from  Canada  Basketball  to  replace  re?ring  legend  Don  

Horwood    at    

 <ORG>  University  of    Alberta  </ORG>  

 currently  the  head  basketball  coach  at    

enters  his  12th  season  at    

Is  the  head  coach  of    <PER>Jim  Larranaga</PER>     <ORG>  George  Mason  

University</ORG>  

ORE  on  “all  sentences”  at  once  

•  [Hasegawa  et  al.,  2004]  use  hierarchical  agglomera?ve  clustering  of  all  triples  (E1,  C,  E2)  in  the  corpus  §  E1,  C,  E2    where  the  context  C  derives  from  all  sentences  connec?ng  the  en??es  §  The  clustering  is  done  on  the  context  vectors  (not  the  en??es)  

• All  triples  (and  thus,  en?ty  pairs)  in  the  same  cluster  belong  to  the  same  rela?on  

[email protected]   43  

SONEX  

• Offline  (HAC  clustering)  –  ACM  TWEB  2012  

• Online  (buckshot):  cluster  a  sample  and  classify  the  remaining  sentences,  one  at  a  ?me  §  No  discernible  loss  in  accuracy,  but  much  higher  scalability  §  Also  allows  the  same  en?ty  pair  to  belong  to  mul?ple  rela?ons  

[email protected]   44  

SONEX:  Clustering  features  

• Clustering  features  derived  from  the  context  words  between  en??es  §  Unigrams:  stemmed  words,  excluding  stop  words.  

•  [Allan,  1998]  Allan,  J.  (1998).  Book  Review:  Readings  in  Informa?on  Retrieval  edited  by  K.  Sparck  Jones  and  P.  Wille|.  Inf.  Process.  Manage.,  34(4):489–490.  

§  Bigrams:  sequence  of  two  (unigram)  words  (e.g.,  Vice  President).    §  Part  of  Speech  Paherns:  small  number  of  rela?on-­‐independent  linguis?cs  pa|erns  from  TextRunner  [Banko  and  Etzioni,  2008]  

• Weights:  Term  frequency  (<),  inverse  document  frequency  (idf)  and  Domain  frequency  (df)  [SIGIR’10]  

[email protected]  

df i(t) =fi(t)P

1jn fj(t)

45  

SONEX:  clustering  threshold  

• Used  ~400  unique  en?ty  pairs  from  Freebase,  manually  annotated  with  true  rela?ons  

• Picked  the  clustering  threshold  and  features  with  the  best  fit  

[email protected]   46  

SONEX:  importance  of  domain  

• DF  works  really  well    except  when  MISC    types  are  involved  

• Example:  coach  §  LOC–PER  domain:    (England,  Fabio  Capello);  (Croa?a,  Slaven  Bilic)  

§  MISC–PER  domain:    (Titans,  Jeff  Fisher);    (Jets,  Eric  Mangini)      

• Overall,  df  alone  improved  the  f-­‐measure  by  12%  

[email protected]   47  

SONEX:  From  Clusters  to  RelaDons  

• Clusters  are  sets  of  en?ty    pairs  with  similar  contexts  

• We  find  relaDon  names  by  looking  for  prominent  terms  in  the  context  vectors  §  Most  frequent  feature  §  Centroid  of  cluster  

[email protected]  

Relation   Cluster  Campaign Chairman  

McCain : Rick Davis Obama : David Plouffe  

Strongman President  

Zimbabwean : Robert Mugabe Venezuelan : Hugo Chavez  

Chief Architect  

Kia Behnia : BMC Software Brendan Eich : Mozilla  

Military Dictator  

Pakistan : Pervez Musharraf Zimbabwe : Robert Mugabe  

Coach   Tennessee : Rick Neuheisel Syracuse : Jim Boeheim  

48  

SONEX:  from  clusters  to  relaDons  

• Evaluate  rela?ons  by  compu?ng  the  agreement  between  the  Freebase  term  and  the  chosen  label  §  Scale:  1  (no  agreement)  to  5  (full  agreement)  

[email protected]   49  

SONEX  vs.  ReVerb:  qualitaDve  comparison  

• Local  methods  like  ReVerb  do  not  understand  synonyms,  and  return  all  variants  of  the  same  rela?on  for  the  same  pair  

• Global  methods  like  SONEX  produce  a  single  rela?on  for  each  en?ty  pair,  derived  from  all  sentences  connec?ng  the  pair  

• Trade-­‐off:  local  methods  are  best  when  there  are  mul?ple  rela?ons,  whereas  global  methods  aim  at  the  most  representa?ve  rela?on  §  The  biggest  problem  is  we  can’t  tell  which  is  the  case  just  by  looking  at  the  corpus  

[email protected]  

Barack Obama married Michelle Obama Barack Obama `s spouse is Michelle Obama

<“Barack Obama”, married, “Michelle Obama”> <“Barack Obama”, spouseOf, “Michelle Obama”>

50  

SONEX  vs  ReVerb—clustering  analysis  

• Purity:  homogeneity  of  clusters  §  Frac?on  of  instances  that  belong  together  

•  Inv.  purity:  specificity  of  clusters  §  Maximal  intersec?on  with  the  rela?ons  §  Also  known  as  overall  f-­‐score  [Larsen  and  Aone,  1999]  

[email protected]   51  

System Purity Inv. Purity

ReVerb 0.97 0.22SONEX 0.96 0.77

Deep  vs  shallow  NLP  in  ORE  

• Adding  NLP  machinery  increases  cost,  but  brings  in  be|er  results  

• What  is  the  right  trade-­‐off?  

[email protected]   52  

EXEMPLAR  [EMNLP’13]  

• Rule-­‐based  method  to  iden?fy  the  precise  connec?on  between  the  argument  and  the  predicate  via  dependency  parsing  

[email protected]   53  

1)/�DSSURYHV�)DOFRQV�QHZ�VWDGLXP�LQ�$WODQWD�

QVXEM DPRG

SRVV

GREM SUHSBLQ

<NFL,  approves_stadium_in,  Atlanta>  

<NFL,  approves_stadium_of,  Falcons>  

<NFL,  approves_stadium,  of:  Falcons,  in:  Atlanta>  

Open  Source!  h|ps://github.com/U-­‐Alberta/exemplar/    

EXEMPLAR  

• Standard  NLP  pipeline  to  break  document  into  sentences  and  find  named  en??es  (or  just  noun  phrases)  

• Find  triggers  (nouns  or  verbs  not  tagged  as  part  of  an  en?ty  men?on)  

• Find  candidate  arguments  (dependencies  of  triggers)  

[email protected]   54  

EXEMPLAR  

• Apply  rules  to  assign  roles  to  the  candidate  arguments  

[email protected]   55  

EXEMPLAR  -­‐-­‐  EvaluaDon  

• Binary  mode  

• Each  sentence  is  annotated  with:  §  Two  named  en??es  §  A  trigger  §  A  window  of  allowed  tokens  

• An  extrac?on  is  deemed  correct  if  it  contains  the  trigger  and  allowed  tokens  only  

[email protected]   56  

I’ve got a media call about [ORG Google] ->’s {acquisition} of<- [ORG YouTube] ->today<-

EXEMPLAR  -­‐-­‐  EvaluaDon  

[email protected]   57  

0.2

0.3

0.4

0.5

0.6

0.7

0.01 0.1 1 10

f-m

ea

sure

seconds per sentence

EXEMPLAR[S]EXEMPLAR[M]

REVERB

SONEXOLLIE LUND

SWIRL

PATTY

EXEMPLAR  –  Further  evaluaDons  

[email protected]   58  

Text  NormalizaDon  

• ORE  systems,  even  with  dependency  parsing  make  trivial  mistakes  as  soon  as  sentences  become  slightly  more  evolved  

• Problem:  an  en?ty  (Mrs.  Clinton)  is  too  far  from  the  trigger  (received);  so  EXEMPLAR  (or  ReVerb,  SONEX)  will  miss  the  extrac?on  

• Solu?ons:  §  Modify  the  rela?on  extrac?on  system  à  BAD:  fixing  one  issue  oUen  breaks  other  kinds  of  extrac?on  

§  Modify  the  input  text  à  GOOD  

[email protected]   59  

Mrs. Clinton, who won the Senate race in New York in 2000, received $91,000 from the Kushner family and partnerships, while Mr. Gore received $66,000 for his presidential campaign that year

Text  NormalizaDon  [LREC’14]  

• Focused  on  rela?ve  clauses  and  par?ciple  phrases    • Method:  

§  Apply  chunking  [Jurafsky  and  Mar?n‘08]    to  break  sentences  §  Classify  all  pairs  of  chunks  as:  

•  Connected:  are  part  of  the  same  clause,  and  should  be  merged  (usually  to  correct  mistakes  in  chunking)  

•  Dependent:  are  in  different  clauses  but  one  depends  on  the  other  

 •  Disconnected:  clauses  that  are  ``leU  alone’’  

§  Apply  the  original  ORE  system  

[email protected]   60  

Mrs. Clinton | won the Senate race in New York in 2000

Mrs. Clinton + received $ 91,000 from the Kushner family and partnerships

Mr. Gore received $ 66,000 for his presidential campaign that year

Text  NormalizaDon  

•  Ideal  case  is  when  we  have  a  dependence  parse  of  the  sentence  §  But  this  adds  cost  

• Used  a  Naïve  Bayes  classifier  that  looks  at  the  parts  of  speech  of  the  2  words  at  each  “end”  of  the  chunks,  and  decides  the  rela?onship  among  them  §  Training  Data:  37,015  parse  trees  of  The  Wall  Street  Journal  sec?on  of  OntoNotes  

§  Accuracy  (10-­‐fold  cross  valida?on):    •  77%  overall  •  85%  for  disconnected  •  75%  for  connected  

[email protected]   61  

Text  NormalizaDon  Improves  ORE  

[email protected]   62  

0.2

0.3

0.4

0.5

0.6

0.7

0.01 0.1 1 10

f-m

ea

sure

seconds per sentence

EXEMPLAR[M]

ReVerb

SONEX

EXEMPLAR[S]

OLLIELUND

SWIRL

PATTYReVerb+NB-SR

ReVerb+DEP-SR

EXEMPLAR[S]+DEP-SREXEMPLAR[S]+NB-SR

EFFICIENS  

• Towards  budget-­‐conscious  ORE  §  There  are  many  sentences  where  shallow  methods  are  just  fine,  so  there  is  no  need  to  fire  up  more  expensive  machinery  

• Goal:  apply  each  method  based  on  need:  §  First  apply  ORE  based  on  POS  tagging  

§  Predict  if  more  is  needed,  if  so,  apply  ORE  based  on  dependencies  

§  If  more  is  needed,  apply  ORE  using  seman?c  role  labeling  

[email protected]   63  

CONFLICTS  IN  WIKIPEDIA  

[email protected]   64  

Controversy  

• Controversy  leads  to  confusion  for  the  reader  and  propagates  misinforma?on,  bias,  prejudice,…  

• Example:  the  ar?cle  about  abor?on  was  the  stage  for  a  discussion  around  breast  cancer  

[email protected]   65  

controversy |ˈkäntrəәˌvəәrsē| noun (pl. controversies) disagreement, typically when prolonged, public, and heated

4K  words  on  the  ini?al  revision  of  the  

ar?cle  alone!  

Quality  control  delegated  to  the  crowd  

•  If  a  topic  is  important  to  a  large  enough  group  of  editors,  the  collabora?ve  edi?ng  process  will  (eventually)  lead  to  a  high-­‐quality  ar?cle    

• Editors  can  tag  whole  ar?cles  or  sec?ons  as  controversial  §  Controversial  tags  are  tags  such  as  {controversial},  {dispute},  {disputed-­‐sec?on}  §  Less  than  1%  of  the  ar?cles  are  tagged  as  controversial  

• Readers  can  only  determine  whether  an  ar?cle  is  controversial  based  on  the  tags,  inspec?on  of  the  talk  page,  or  the  edit  history  

• Our  Goal:  automate  the  quality  control  process  

[email protected]   66  

Wikipedia  takes  care  of  the  issue  

• Sec?ons  of  the  the  controversial  ar?cles  spawn  new  ar?cles  • Example:  Holiest  sites  in  Islam    

[email protected]   67  

Controversy  in  Wikipedia  is  different  

• Wikipedia’s  neutral  point  of  view  fools  sen?ment  analysis  tools  

[email protected]   68  

The  edit  history  

• The  edit  history  contains  the  log  of  all  ac?ons  performed  in  the  making  of  the  ar?cle  

• The  ?me-­‐stamp  of  each  commit  

• The  ac?on  performed  in  each  commit  

• An  op?onal  comment  explaining  the  intent  of  each  commit  

[email protected]   69  

Why  is  this  a  hard  problem?  

• Controversy  has  to  do  with  the  content  of  the  ar?cle  but  also  the  social  process  leading  to  the  ar?cle  

• Stats  don’t  help  §  Controversial  ar?cles  tend  to  be  a  li|le  longer  and  have  fewer  editors,  but  the  overlap  in  the  distribu?ons  is  too  high!  

§  Simple  scores  or  classifiers  are  not  likely  to  work  

• NLP  is  hard  §  Revisions  with  millions  of  individual  NPOV  sentences  

§  The  talk  page  is  not  linked  to  specific  revisions  

§  Textual  entailment?  

[email protected]   70  

Finding  hot  spots  

• Mutually  assign  trust  and  credibility  [Adler  et  al.  2008]  

[email protected]   71  

editor   edits  

longevity  increases  credibility  

credibility  predicts  likelihood  of  acceptance  

Bipolarity  

• Controversial  ar?cles  tend  to  look  like  bipar?te  graphs  more  than  non-­‐controversial  ones  [Brandes  et  al  2009]  

• However…  [WWW2011]  

[email protected]   72  

Discussions  in  Wikipedia:  exploiDng  the  edit  history  

•  [Ki|ur  et  al.,  2007]  train  a  classifier  to  predict  the  number  of  controversial  tags  in  an  ar?cle  §  Features  are  based  on  metrics  such  as  the  number  of  authors,  the  number  of  versions,  the  number  of  anonymous  edits,  etc.    

•  [Vuong  et  al.,  2008]  build  a  model  to  assign  a  controversy  score  to  ar?cles  assuming  a  mutual  reinforcing  rela?onship  between  controversy  score  of  ar?cles  and  their  editors  

•  [Druck  et  al.,  2008]  extract  features  from  editor  collabora?on  to  establish  a  no?on  of  editor  trust  

• Extrac?ng  affinity/polarity  networks:  ver?ces  are  editors,  edges  indicate  agreement/disagreement  among  them  §  [Maniu  et  al.,  2011]  [Leskovec  et  al.,  2010]  

• Bipolarity  [Brandes  et  atl.  2009]:  compu?ng  how  much  a  network  approximates  a  bipar?te  graph  

[email protected]   73  

Discussions  in  Wikipedia  [Hypertext’12]  [TIST’15]  

• Main  idea:  model  the  social  aspects  of  the  edi?ng  of  the  ar?cles  

[email protected]   74  

Finding  the  Aptudes  of  editors  

• Look  at  all  interac?ons  between  a  pair  of  editors,  and  predict  whether  one  would  vote  for  the  other  §  Use  Wikipedia  admin  elec?on  data  for  training  and  tes?ng  §  Consider  all  ar?cles  they  worked  together  with  34  revisions  of  each  other  

• 87  %  Accuracy  §  Test  on  Wikipedia  admin  elec?on  data  (~100K  votes)  §  High  bias  for  posi?ve  votes;    §  Nega?ve  impressions  persist  for  a  long  ?me  

[email protected]   75  

Controversy  in  Wikipedia  

• Once  we  know  how  the  editors  would  vote  for  each  other,  we  look  for  signs  of  ``fights’’  within  the  ar?cle’s  history  

• We  classify  each  collabora?on  network  

• Edi?ng  Stats  +  light  NLP  features  §  Various  stats  capturing  the  edi?ng  behavior  §  Use  of  sen?ment  words  (adapted    to  the  domain)  

• Social  networking  Features  §  Several  stats  about  the  shape  of  the  graph  §  Number  and  kind  of  triads    

•  “A  friend  of  my  friend  is  my  friend”  •  “An  enemy  of  my  friend  is  my  enemy”  

[email protected]   76  

Controversy  in  Wikipedia  

• Structure  classifier  on  the  collabora?on  networks  §  Logis?c  regression  /  WEKA  

• Ground  truth:  480  ar?cles  §  240  tagged  by  editors  as  controversial  §  240  from  other  categories,  that  have  never  been  tagged  as  controversial  in  their  history  

[email protected]   77  

Controversy  in  Wikipedia:  feature  ablaDon  

[email protected]   78  

An  enemy  of  my  enemy  is  my  friend  

number  of  triads  

Controversy  in  Wikipedia  -­‐-­‐  Robustness  

[email protected]   79  

Finer-­‐grained  noDons  of  controversy  

• OUen,  the  source  of  controversy  in  an  ar?cle  comes  from  a  single  sec?on  or  even  a  sentence  §  In  the  ar?cle  about  abor?on,  the  sec?on  which  connects  it  to  breast  cancer  is  the  main  dispute  

• Finding  the  text  units  that  contribute    the  most  to  the  controversy  §  Typical  NP-­‐hard  op?miza?on  problem  

§  Good  news:  if  the  controversy  func?on  is  sub-­‐modular  and  monotonic,  there  is  a  nice  approxima?on  algorithm  

§  Bad  news:  the  best  classifiers  can’t  be    made  into  monotonic  and  sub-­‐modular  func?ons  

[email protected]   80  

Controversy  in  Wikipedia  -­‐-­‐  NEXT  

• Summary:  §  Detec?ng  controversy  in  Wikipedia  is  challenging  §  Sta?s?cal  features  alone  are  not  enough  §  Social  networking  theories  help  a  lot  in  finding  disagreement  

• Use  more  NLP  machinery  §  Link  the  discussion  in  the  ar?cle  history    to  the  discussion/talk  page  

• Be|er  understanding  the  Wikipedia  dynamics  

•  Impact  on  ORE  tools  §  how  hard  is  it  to  find  contradictory    informa?on  on  subsequent  versions?  

[email protected]   81  

Thank  you  

• Robust  en?ty  linking  with  Random  Walks  [CIKM’2014]  

• Rela?on  extrac?on  [TWEB’2012],    EXEMPLAR  (get  the  code!)    

• Understanding  Conflict  in  Wikipedia      

[email protected]   82  

Zhaochen  Guo  PhD  (2015)  

Filipe  Mesquita,  PhD  (2014)  

Yuval  Merhav,  PhD(2012)  

Illinois  Int.  Technology  

Jordan  Schmidek,  MSc  (2014)  

Hoda  Sepehri-­‐Rad  PhD  (2014)  

Aibek  Makhazanov  MSc  (2013)