text analysis in transparency - a talk at sunlight labs

51
Text analysis in transparency Jonathan Stray Sunlight Labs, May 2 2013

Upload: jonathan-stray

Post on 05-Dec-2014

1.030 views

Category:

Education


0 download

DESCRIPTION

Video at: http://overview.ap.org/blog/2013/05/video-text-analysis-in-transparency/ How text analysis and natural language processing is being used in journalism, open government, and transparency generally. A survey of existing public projects, and the algorithms behind them. Then a demonstration of the Overview Project (overviewproject.org), a tool for automatically visualizing the topics in a large document set, designed for investigative journalists. Then, a discussion of where data-driven transparency is going now -- or, what should we work on next?

TRANSCRIPT

Page 1: Text analysis in transparency - a talk at Sunlight Labs

Text  analysis  in  transparency  

Jonathan  Stray  Sunlight  Labs,  May  2  2013  

Page 2: Text analysis in transparency - a talk at Sunlight Labs

 

Text  Analysis  for  Transparency  in  the  Wild  cool  projects,  and  the  tech  behind  them  

 

An  Overview  of  Overview  the  thing  I've  been  working  on  

 What's  Next  for  Data-­‐Driven  Transparency?    

how  does  transparency  work,  anyway?  

Page 3: Text analysis in transparency - a talk at Sunlight Labs

What  people  are  doing  now  

Transparency  applicaMons  of  text  analysis,  in  the  wild:    •  Document  SummarizaMon  •  ExploraMon  of  text  collecMons  •  Name  standardizaMon  •  Plagiarism  detecMon  /  text  flow  analysis  •  Change  surveillance  /  revision  tracking  •  ClassificaMon  /  automaMc  tagging    

Page 4: Text analysis in transparency - a talk at Sunlight Labs

Text  Analysis  for  Transparency  In  the  Wild  

Page 5: Text analysis in transparency - a talk at Sunlight Labs

Algorithms  

•  Full  text  search  •  Bag-­‐of-­‐words  /  TF-­‐IDF  •  N-­‐gram  language  models  •  Document  similarity  funcMons  (cosine  distance)  •  Fuzzy  string  matching  (shingles,  edit  distance,...)  •  Text  Diff  •  Clustering  (k-­‐means,  hierarchical,  ...)    •  Locality  SensiMve  Hashing  (MinHash,  ...)  •  supervised  classificaMon  (linear,  SVM,  ...)  •  Topic  modeling  (LSA,  LDA,  NMF,  ...)  

Page 6: Text analysis in transparency - a talk at Sunlight Labs

State  of  the  Union  2011  word  cloud,  Whitehouse.gov  

Page 7: Text analysis in transparency - a talk at Sunlight Labs

State  of  the  Union  by  decade,  Henry  Williams  

Page 8: Text analysis in transparency - a talk at Sunlight Labs

State  of  the  Union  by  Decade  Uses:  bag  of  words,  TF-­‐IDF    Loads  speeches  from  all  years,  applies  TF-­‐IDF.  Sums  document  vectors  by  decade,  then  picks  top  10  words.    Not  really  a  principled  approach,  but  seems  to  give  reasonable  results...  be_er  than  word  clouds?  

Page 9: Text analysis in transparency - a talk at Sunlight Labs

First  text  summarizaMon  algorithm:  H.P.  Luhn,  1958  

Page 10: Text analysis in transparency - a talk at Sunlight Labs

First  text  summarizaMon  algorithm:  H.P.  Luhn,  1958  

Page 11: Text analysis in transparency - a talk at Sunlight Labs

Many  Bills,  IBM  

Page 12: Text analysis in transparency - a talk at Sunlight Labs

Many  Bills  Does:  legislaAve  text  exploraAon  Using  machine  classificaAon  via  (best  guess)  bag-­‐of-­‐words,  n-­‐grams,  TF-­‐IDF    Classifies  secMons  of  bill  by  topic,  and  displays  visually.    Allows  comparison  of  mulMple  bills.      Intended  applicaMon:  obscure  riders  and  "pork  barrel"  projects  

Page 13: Text analysis in transparency - a talk at Sunlight Labs

Churnalism,  Sunlight  Labs  

Page 14: Text analysis in transparency - a talk at Sunlight Labs

Churnalism  Does:  bill  content  explorer  Using:  maching(best  guess):  bag-­‐of-­‐words,  n-­‐grams,  locality  sensiAve  hashing,  fuzzy  string  matching    Given  some  text,  find  all  documents  which  contain  a  substanMal  secMon  of  that  text.    Allows  for  some  difference  between  source  and  target.    Highlights  diffs.  

Page 15: Text analysis in transparency - a talk at Sunlight Labs

MemeTracker,    by  Jure  Leskovec,  Lars  Backstrom  and  Jon  Kleinberg    

Page 16: Text analysis in transparency - a talk at Sunlight Labs

MemeTracker  Does:  web-­‐scale  text  flow  analysis  on  poliAcal  quotes  Using:  n-­‐grams,  fuzzy  string  matching  via  edit  distance,  phylogeneAc  tree  concepts  from  bioinformaAcs    Given  a  quote,  track  its  diffusion  and  mutaMon  across  news  outlets  and  millions  of  blogs.    Shows  a_enMon  curves,  phrase  variaMons.  Allows  comparison  of  different  types  of  media.  

Page 17: Text analysis in transparency - a talk at Sunlight Labs

Campaign  finance  donor  name  standardizer,  Chase  Davis  

Page 18: Text analysis in transparency - a talk at Sunlight Labs

FEC-­‐Standardizer  Does:  name  standardizaAon  Using:  supervised  classificaAon  via  random  forests,  locality-­‐sensiAve  hashing  on  2-­‐shingles    Standardizes  donor  idenMMes.  That  is,  finds  clusters  of  donors  who  are  the  same  person,  even  with  typos,  incomplete  data,  other  errors.    95-­‐99%  accurate,  compared  to  Center  for  Responsive  PoliMcs  reference  data.    

Page 19: Text analysis in transparency - a talk at Sunlight Labs

Newsdiffs,  Eric  Price,  Jenny  8  Lee,  Greg  Price    

Page 20: Text analysis in transparency - a talk at Sunlight Labs

NewsDiffs  Does:  change  detecAon  Using:  text  diff    ConMnuously  scrapes  nyMmes.com,  cnn.com,  poliMco.com,  bbc.co.uk,  looking  for  changes  in  published  stories.    Displays  diffs  in  visual  format.    

Page 21: Text analysis in transparency - a talk at Sunlight Labs

Docket  Wrench,  Sunlight  Labs  

Page 22: Text analysis in transparency - a talk at Sunlight Labs

Docket  Wrench  Does:  topic  analysis  /  plagiarism  detecAon  Using:  (best  guess)  bag-­‐of-­‐words,  n-­‐grams,  locality-­‐sensiAve  hashing,  full  text  search    Analyzes  comments  on  proposed  Federal  regulaMon  and  shows  clusters  which  contain  similar  text.    ConMnuously  pulls  from  many  different  agencies  –  over  100k  dockets!  Also  visual  display  of  docket  acMvity,  browsing,  search.  

Page 23: Text analysis in transparency - a talk at Sunlight Labs

The  BaSle  for  Bystanders:  InformaAon,  Meaning  Contests,  and  CollecAve  AcAon  in  the  EgypAan  RevoluAon  of  2011,  Trey  Causey    

Page 24: Text analysis in transparency - a talk at Sunlight Labs

The  Ba_le  for  Bystanders  An  analysis  of  media  during  EgypAan  revoluAon  of  2011  Using:  bag-­‐of-­‐words,  topic  modeling    Topic  modeling  across  a  database  of  three  online  news  outlets  –  both  state  and  non-­‐state  media  –  to  detect  and  count  stories  with  various  frames,  e.g.  "danger  and  instability"    Relies  on  interpretaMon  of  algorithmically  generated  "topics,"  which  are  really  distribuMons  over  words.  No  ground-­‐truth  /  comparison  to  human  raters.  

Page 25: Text analysis in transparency - a talk at Sunlight Labs

An  Overview  of  Overview  

Page 26: Text analysis in transparency - a talk at Sunlight Labs

The  Overview  Project  

A  general  purpose  document  mining  system.    Meant  to  answer  the  quesMon,  "what's  in  there?"    Be_er  than  search  –  find  what  you  didn't  know  you're  looking  for.  

Page 27: Text analysis in transparency - a talk at Sunlight Labs

Overview,  Associated  Press  

Page 28: Text analysis in transparency - a talk at Sunlight Labs

Overview  Does:  topic  exploraAon  Using:  bag-­‐of-­‐words,  n-­‐grams,  TF-­‐IDF,  document  similarity,  k-­‐means  clustering,  full  text  search    Uses  the  full  text  of  each  document  to  perform  hierarchical  clustering  based  on  topic.    Visual  exploraMon  and  tagging,  and  (soon)  integrated  full-­‐text  search.  

Page 29: Text analysis in transparency - a talk at Sunlight Labs
Page 30: Text analysis in transparency - a talk at Sunlight Labs

Topic  Tree  

Computer  sorts  documents  into  folders  and  sub-­‐folders,  based  on  topic  analysis.    

Page 31: Text analysis in transparency - a talk at Sunlight Labs

Duplicate/near  duplicate  detecMon  

66  copies  with  different  names  

Page 32: Text analysis in transparency - a talk at Sunlight Labs

AutomaMc  sorMng  +  manual  tagging  

Deeper  in  the  tree  =  narrower  topic.    When  all  docs  are  on  "same"  topic,  tag  it  

Page 33: Text analysis in transparency - a talk at Sunlight Labs

Extracted  keywords  for  folders  and  docs  

Page 34: Text analysis in transparency - a talk at Sunlight Labs

Generate  document  vectors,  just  like  a  search  engine.  Then  cluster  the  space.  VisualizaMon  of  "types"  of  search  result.  

Page 35: Text analysis in transparency - a talk at Sunlight Labs

Stories  done  with  Overview  

9000  pages  FOIA'd  from  200  Federal  agencies.  Data  Journalism  Awards  2013  finalist.  

4500  pages  of  incident  reports  from  US  Dept  of  State,  declassified  aoer  FOIA  

7000  emails  from  Tulsa  Police  Department.  Millions  wasted  on  bad  computers.  

Page 36: Text analysis in transparency - a talk at Sunlight Labs

Lessons  learned  

•  Import  is  the  hardest  part!  Messy  input  formats,  big  uploads,  many  documents  on  paper...  

•  Usability  is  crucial.  People  will  give  up  fast.  •  #1  FAQ:  "how  is  it  sorMng  my  documents?"  •  #1  comment:  "oh,  you  mean  it's  a  search  engine."  •  How  do  we  explain  what  we're  doing  to  users?  

 WORKFLOW  beats  ALGORITHM  

every  Mme  

Page 37: Text analysis in transparency - a talk at Sunlight Labs

What's  Next  for  Data-­‐Driven  Transparency  

Page 38: Text analysis in transparency - a talk at Sunlight Labs

What  should  we  do  next?  

Lots  of  stories  we  could  do.  Lots  of  tools  we  could  build.  Lots  of  data  we  could  analyze.  

 Are  we  starMng  from  the  right  place?  

     

Page 39: Text analysis in transparency - a talk at Sunlight Labs

"Low  Hanging  Fruit"  

Work  on  the  untouched  data  sets  that  have  obvious  interest  and  potenMal,  like  campaign  contribuMons.  

 Catalog  available  data.  Push  for  opening  more.  Create  interfaces  to  exisMng  data  sets.      This  is  a  data-­‐driven  approach.  Risk  is  "looking  for  your  keys  under  the  street  light."  

Page 40: Text analysis in transparency - a talk at Sunlight Labs
Page 41: Text analysis in transparency - a talk at Sunlight Labs

"Capacity  Building"  

Data  analysis  is  hard!  Let's  make  it  easier.    Build  be_er  sooware.  Reduce  duplicaMon  of  engineering  efforts.  Teach  people  to  do  data  work,  and  improve  training  methods.    This  is  a  tool-­‐  and  technique-­‐driven  approach.  Risk  is  building  capacity  that  doesn't  ma_er  (no  one  uses,  or  has  no  impact)  

Page 42: Text analysis in transparency - a talk at Sunlight Labs
Page 43: Text analysis in transparency - a talk at Sunlight Labs

"What  happens  if"  

Look  for  the  work  that  will  have  the  greatest  posiMve  effect.    Impact  is  some  combinaMon  of  supply  (we  could  do  this)  plus  demand  (people  would  want  it)  plus  effecMveness  (contributes  to  agency.)    This  is  an  impact-­‐driven  approach.  Can  be  very  hard  to  predict  or  measure.  

Page 44: Text analysis in transparency - a talk at Sunlight Labs

How  does  transparency  work?  

•  Deterrence.  Powerful  people  don't  do  bad  things  because  they  know  someone  is  watching.  

•  A_enMon.  Focus  spotlight  on  things  that  shouldn't  be  (even  if  they're  "known")  

•  Understanding.  Just  what  is  going  on  there  anyway?  Secrets  vs.  mysteries.  

•  Influence  mapping.  Who  is  actually  making  the  rules?  

Page 45: Text analysis in transparency - a talk at Sunlight Labs

The  anxiety  of  influence  

This  is  why  people  care  about  campaign  finance    This  is  why  people  care  about  text  flow  in  lawmaking    This  is  why  people  care  about  poliMcal  adverMsing    

.  

.  

.  

Page 46: Text analysis in transparency - a talk at Sunlight Labs
Page 47: Text analysis in transparency - a talk at Sunlight Labs

DetecMng  influence  

But  how  does  influence  work?    Influence  over  what?  (It's  a  vector,  not  a  scalar)    Algorithmically  detectable?  Campaign  finance  data  seeks  to  quanMfy  it.  Social  network  analysis  makes  claims.  Straight  up  votes  sMll  count.    But...  do  we  really  understand  influence?  Are  we  confusing  inputs  and  results?  

Page 48: Text analysis in transparency - a talk at Sunlight Labs
Page 49: Text analysis in transparency - a talk at Sunlight Labs

Some  analyses  I'd  like  to  see  Externali+es  of  Finance.  How  do  banks  make  money?  What  effects  does  this  have  on  the  rest  of  us?  Are  internal  jusMficaMons  like  "increasing  liquidity"  good  or  bad  for  everyone  else?  Is  the  industry  actually  compeMMve  or  just  an  oligopoly?  ConnecMons  to  poliMcs  and  other  sources  of  power?    Large  scale  social  network  mapping.  Start  with  data  from  Li_leSis.org.  Can  we  actually  learn  anything  about  influence  from  this?  Try  to  develop  comparaMve  metrics,  and  typology  of  influence  –  break  down  by  industry?  Look  at  revolving  doors,  hiring  and  appointment,  money  flows,  etc.  

Page 50: Text analysis in transparency - a talk at Sunlight Labs

Transparency  Grand  Challenge  

 Illuminate  for  ciMzens  how  the  decisions  that  

affect  them  actually  get  made.    

(Which  requires  figuring  that  out.)    

Show  them  how  to  use  their  own  influence.  

Page 51: Text analysis in transparency - a talk at Sunlight Labs

Grand  Challenge  QuesMons  •  Is  government  the  right  focus?  •  What  types  of  influence  are  there?  (tribes,  insMtuMons,  

markets,  networks)  •  What  is  the  limiMng  factor  to  detecMng  influence?  Could  be  

data  access,  missing  tools,  lack  of  public  a_enMon,  system  complexity,  or  ...  ?  

•  Are  we  facing  secrets  (someone  doesn't  want  us  to  know)  or  a  mysteries  (it's  complicated  and  no  one  knows)?  

•  Do  we  really  know  how  data  relates  to  influence?  •  Who  is  affected  by  each  type  of  influence?  •  Who  are  we  working  for?  Have  we  asked  them  what  they  

want?