natural language processing in investigative journalism

Natural Language Processing for Investigative Journalism Jonathan Stray NYC ML Meetup, 2014/6/19

Links!

http://bit.ly/JournalismNLP

Proof of concept algorithm For every document x

convert to TF-IDF vector label by three words with highest TF-IDF color by incident type (from original data)

For every pair of documents x,y

if cosine_distance(x,y) < threshold add_edge(x,y)

Plot all documents and edges with force-directed layout

Document sets too big for O(N2)

Prototype scatterplot for each document x

let edges[x] = { N random documents } For each time step

for each document x for each y in edges[x] d1 = cosine_distance(x,y) d2 = layout_distance(x,y) apply_spring_force(d1, d2) for N/2 edges with largest d1 edges[y] = random document

Prototype tree let c = { one component with all documents } for threshold = [0.1, 0.2 ... 1]

c_new = {} for x in c pieces = connected_components(x, threshold) x.children = pieces c_new += pieces c = c_new

Lots of emails would be meaningless, spam or pictures of cats, so Overview can be used to easily dismiss the majority. Given a set of emails based on a keyword search, the problem is more difficult because most of the emails will be at least somewhat relevant. In this case, Overview was most useful as an organizational tool. I could look at an email, make a note, and easily have it grouped with other similar emails through tagging. I started with a branch of Overviews document tree and starting clicking, glancing, noting and tagging. Right off the bat, I found that Overview had grouped together all of the similarly formatted service desk requests. There were hundreds if not thousands of those, so I was able to tag them by the dozens without a second thought while focusing on the more meaty emails.

And then no one really used it....

Sources of user feedback Log data (select node, apply tag, view document, ...) Emails and other personal contact After-use semi-structured interviews Think-aloud usability tests with naive users

Usability lesson #1

If the workflow doesn't work, the algorithm doesn't matter

Workflow improvements Potential users were not able to download and install a command line system, they couldn't get their documents into it, and they didn't understand how to use it. Rewritten as a web application Import from something other than a CSV Split long documents into pages UI overhaul has to be obvious without reading manual!

Is the tree any good?

Evaluation Methods for Topic Models Wallach et. al. 2009

Interpretation and Trust: Designing Model-Driven Visualizations for Text Analysis Chuang et al, SIGCHI 2012

The curious case of Petroleum Engineering. The top visualization shows a 2D projection of pairwise topical distances between academic departments. In 2005, Petroleum Engineering appears similar to Neurobiology, Medicine, and Biology. Was there a collaboration among those departments? The bottom visualization shows the undistorted distances from Petroleum Engineering to other departments by radial distance. The connection to biology disappears: it was an artifact of dimensionality reduction. The visual encoding of spatial distance in the first view is interpretable, but on its own is not trustworthy.

Usability lesson #2

Your users define the tasks and therefore the measure of quality

Current tree c = { one component with all documents } max_kids = 5 while c is not empty

c_new = {} for x in c where size(x) > 1 children = adaptive_kmeans(x, max_kids) x.children = children c_new += children c = c_new

Adaptive k-means

Folder labeling for each folder x

let d = { docs in x } let v = sum(TF-IDF vectors of d) let t = { 10 terms in v with highest weight }

label = "ALL" + t in all d "MOST" + t in >=70% of d "SOME" + t in < 70% of d

Types of document driven-stories Smoking gun

basically a search problem often hard to formulate a query, so visual exploration can help

Categorize and count "trend story" about quantitative patterns Find/invent useful categories, then tag and count documents

Exhaustive reading still desirable or necessary for some stories! for example, prove that something does not exist To our surprise, wide scope for computer-assisted speedup

Added in response to user feedback Limit to five children per folder ALL, MOST, SOME folder labeling Search Show untagged documents Multiple language support Many import and export options ... Simplify, simplify, simplify!

K-means vs. LDA on xkcd

Why not "real" topic models? How to display topic model output?

many systems just use output for distance metric we've already got a tree, we've already rejected MDS popular topics-over-time view not applicable for most users multiple topics per document even more confusing

LDA interpretability not obviously better K-means, LDA, NMF are mathematically related anyway Need hierarchical, O(N) algorithm But ultimately... So far, usability problems data modeling problems Just haven't gotten around to trying

What we're building now

Coming soon: named entity recognition

NER accuracy is really low!

Test of OpenCalais against 5 random articles from various sources versus hand-tagged entities Overall PRECISION = 77% Overall RECALL = 30% ...and journalism inputs can be from any domain!

Initially populate using NER, but let user edit entities, aliases, and entity tags on each document.

Usability lesson #3

The user doesn't care about having an accurate algorithm.

They care about getting clean data out.

Plugin API Custom visualizations Plugin calls to Overview: - get document text - write/read persistent objects - read/write document metadata Overview calls to Plugin - display visualization (render HTML/JS for iframe) - selection changed In development now, coming this summer!

Your Visualization Here

I mean, I really feel the reason that what occurred in Homicide occurred was because of the incident in, I believe, the Brentwood area with that stand-by pay. And, you know, what I'm beginning to learn in this business is that payback is not nice sometimes.

An interactive NLP testbed Conjecture: across many domains, it's much faster for a human to correct an algorithm than to do the whole task by hand. Plugins can read document text, write document metadata, and interact with the user. Perfect for hybrid human-computer tasks.

Thank you! For links to everything referenced in this talk please go to:

http://bit.ly/JournalismNLP

Find us at:

overviewproject.org

github.com/overview

[email protected]

natural language processing in investigative journalism

Documents