text analysis in transparency - a talk at sunlight labs

Text analysis in transparency

Jonathan Stray Sunlight Labs, May 2 2013

Text Analysis for Transparency in the Wild cool projects, and the tech behind them

An Overview of Overview the thing I've been working on

What's Next for Data-‐Driven Transparency?

how does transparency work, anyway?

What people are doing now

Transparency applicaMons of text analysis, in the wild: •  Document SummarizaMon •  ExploraMon of text collecMons •  Name standardizaMon •  Plagiarism detecMon / text flow analysis •  Change surveillance / revision tracking •  ClassificaMon / automaMc tagging

Text Analysis for Transparency In the Wild

Algorithms

•  Full text search •  Bag-‐of-‐words / TF-‐IDF •  N-‐gram language models •  Document similarity funcMons (cosine distance) •  Fuzzy string matching (shingles, edit distance,...) •  Text Diff •  Clustering (k-‐means, hierarchical, ...) •  Locality SensiMve Hashing (MinHash, ...) •  supervised classificaMon (linear, SVM, ...) •  Topic modeling (LSA, LDA, NMF, ...)

State of the Union 2011 word cloud, Whitehouse.gov

State of the Union by decade, Henry Williams

State of the Union by Decade Uses: bag of words, TF-‐IDF Loads speeches from all years, applies TF-‐IDF. Sums document vectors by decade, then picks top 10 words. Not really a principled approach, but seems to give reasonable results... be_er than word clouds?

First text summarizaMon algorithm: H.P. Luhn, 1958

Many Bills, IBM

Many Bills Does: legislaAve text exploraAon Using machine classificaAon via (best guess) bag-‐of-‐words, n-‐grams, TF-‐IDF Classifies secMons of bill by topic, and displays visually. Allows comparison of mulMple bills. Intended applicaMon: obscure riders and "pork barrel" projects

Churnalism, Sunlight Labs

Churnalism Does: bill content explorer Using: maching(best guess): bag-‐of-‐words, n-‐grams, locality sensiAve hashing, fuzzy string matching Given some text, find all documents which contain a substanMal secMon of that text. Allows for some difference between source and target. Highlights diffs.

MemeTracker, by Jure Leskovec, Lars Backstrom and Jon Kleinberg

MemeTracker Does: web-‐scale text flow analysis on poliAcal quotes Using: n-‐grams, fuzzy string matching via edit distance, phylogeneAc tree concepts from bioinformaAcs Given a quote, track its diffusion and mutaMon across news outlets and millions of blogs. Shows a_enMon curves, phrase variaMons. Allows comparison of different types of media.

Campaign finance donor name standardizer, Chase Davis

FEC-‐Standardizer Does: name standardizaAon Using: supervised classificaAon via random forests, locality-‐sensiAve hashing on 2-‐shingles Standardizes donor idenMMes. That is, finds clusters of donors who are the same person, even with typos, incomplete data, other errors. 95-‐99% accurate, compared to Center for Responsive PoliMcs reference data.

Newsdiffs, Eric Price, Jenny 8 Lee, Greg Price

NewsDiffs Does: change detecAon Using: text diff ConMnuously scrapes nyMmes.com, cnn.com, poliMco.com, bbc.co.uk, looking for changes in published stories. Displays diffs in visual format.

Docket Wrench, Sunlight Labs

Docket Wrench Does: topic analysis / plagiarism detecAon Using: (best guess) bag-‐of-‐words, n-‐grams, locality-‐sensiAve hashing, full text search Analyzes comments on proposed Federal regulaMon and shows clusters which contain similar text. ConMnuously pulls from many different agencies – over 100k dockets! Also visual display of docket acMvity, browsing, search.

The BaSle for Bystanders: InformaAon, Meaning Contests, and CollecAve AcAon in the EgypAan RevoluAon of 2011, Trey Causey

The Ba_le for Bystanders An analysis of media during EgypAan revoluAon of 2011 Using: bag-‐of-‐words, topic modeling Topic modeling across a database of three online news outlets – both state and non-‐state media – to detect and count stories with various frames, e.g. "danger and instability" Relies on interpretaMon of algorithmically generated "topics," which are really distribuMons over words. No ground-‐truth / comparison to human raters.

An Overview of Overview

The Overview Project

A general purpose document mining system. Meant to answer the quesMon, "what's in there?" Be_er than search – find what you didn't know you're looking for.

Overview, Associated Press

Overview Does: topic exploraAon Using: bag-‐of-‐words, n-‐grams, TF-‐IDF, document similarity, k-‐means clustering, full text search Uses the full text of each document to perform hierarchical clustering based on topic. Visual exploraMon and tagging, and (soon) integrated full-‐text search.

Topic Tree

Computer sorts documents into folders and sub-‐folders, based on topic analysis.

Duplicate/near duplicate detecMon

66 copies with different names

AutomaMc sorMng + manual tagging

Deeper in the tree = narrower topic. When all docs are on "same" topic, tag it

Extracted keywords for folders and docs

Generate document vectors, just like a search engine. Then cluster the space. VisualizaMon of "types" of search result.

Stories done with Overview

9000 pages FOIA'd from 200 Federal agencies. Data Journalism Awards 2013 finalist.

4500 pages of incident reports from US Dept of State, declassified aoer FOIA

7000 emails from Tulsa Police Department. Millions wasted on bad computers.

Lessons learned

•  Import is the hardest part! Messy input formats, big uploads, many documents on paper...

•  Usability is crucial. People will give up fast. •  #1 FAQ: "how is it sorMng my documents?" •  #1 comment: "oh, you mean it's a search engine." •  How do we explain what we're doing to users?

WORKFLOW beats ALGORITHM

every Mme

What's Next for Data-‐Driven Transparency

What should we do next?

Lots of stories we could do. Lots of tools we could build. Lots of data we could analyze.

Are we starMng from the right place?

"Low Hanging Fruit"

Work on the untouched data sets that have obvious interest and potenMal, like campaign contribuMons.

Catalog available data. Push for opening more. Create interfaces to exisMng data sets. This is a data-‐driven approach. Risk is "looking for your keys under the street light."

"Capacity Building"

Data analysis is hard! Let's make it easier. Build be_er sooware. Reduce duplicaMon of engineering efforts. Teach people to do data work, and improve training methods. This is a tool-‐ and technique-‐driven approach. Risk is building capacity that doesn't ma_er (no one uses, or has no impact)

"What happens if"

Look for the work that will have the greatest posiMve effect. Impact is some combinaMon of supply (we could do this) plus demand (people would want it) plus effecMveness (contributes to agency.) This is an impact-‐driven approach. Can be very hard to predict or measure.

How does transparency work?

•  Deterrence. Powerful people don't do bad things because they know someone is watching.

•  A_enMon. Focus spotlight on things that shouldn't be (even if they're "known")

•  Understanding. Just what is going on there anyway? Secrets vs. mysteries.

•  Influence mapping. Who is actually making the rules?

The anxiety of influence

This is why people care about campaign finance This is why people care about text flow in lawmaking This is why people care about poliMcal adverMsing

.

.

.

DetecMng influence

But how does influence work? Influence over what? (It's a vector, not a scalar) Algorithmically detectable? Campaign finance data seeks to quanMfy it. Social network analysis makes claims. Straight up votes sMll count. But... do we really understand influence? Are we confusing inputs and results?

Some analyses I'd like to see Externali+es of Finance. How do banks make money? What effects does this have on the rest of us? Are internal jusMficaMons like "increasing liquidity" good or bad for everyone else? Is the industry actually compeMMve or just an oligopoly? ConnecMons to poliMcs and other sources of power? Large scale social network mapping. Start with data from Li_leSis.org. Can we actually learn anything about influence from this? Try to develop comparaMve metrics, and typology of influence – break down by industry? Look at revolving doors, hiring and appointment, money flows, etc.

Transparency Grand Challenge

Illuminate for ciMzens how the decisions that

affect them actually get made.

(Which requires figuring that out.)

Show them how to use their own influence.

Grand Challenge QuesMons •  Is government the right focus? •  What types of influence are there? (tribes, insMtuMons,

markets, networks) •  What is the limiMng factor to detecMng influence? Could be

data access, missing tools, lack of public a_enMon, system complexity, or ... ?

•  Are we facing secrets (someone doesn't want us to know) or a mysteries (it's complicated and no one knows)?

•  Do we really know how data relates to influence? •  Who is affected by each type of influence? •  Who are we working for? Have we asked them what they

want?

text analysis in transparency - a talk at sunlight labs

Education