natural language processing in investigative journalism

Upload: jonathan-stray

Post on 14-Oct-2015

713 views

Category:

Documents


0 download

DESCRIPTION

Journalists frequently have far too many documents to read manually, whether it's a 10,000 page response to a Freedom of Information Request or 250,000 leaked diplomatic cables. We've spent the last three years applying NLP and visualization techniques to this problem, building a system called Overview which has now been used by journalists all over the world. In this talk I'll show you exactly how Overview's language processing pipeline works. But I'll also talk about how we decided which algorithms to use and how to present the results to the user. Topic modeling is a powerful technique, but all such algorithms are derived by optimizing for statistical properties, not fitness to end-user tasks. We developed Overview through extensive collaboration with journalists and careful user testing, and the experience has taught us a great deal about the problem of making NLP results interpretable to users. Since Overview is open source, you can leverage our work to build your own user-friendly NLP applications with our plugin API.For the links referenced in the talk see https://bitly.com/JournalismNLPOriginal Meetup talk link: http://www.meetup.com/NYC-Machine-Learning/events/188713962/

TRANSCRIPT

  • Natural Language Processing for Investigative Journalism Jonathan Stray NYC ML Meetup, 2014/6/19

  • Links!

    http://bit.ly/JournalismNLP

  • Proof of concept algorithm For every document x

    convert to TF-IDF vector label by three words with highest TF-IDF color by incident type (from original data)

    For every pair of documents x,y

    if cosine_distance(x,y) < threshold add_edge(x,y)

    Plot all documents and edges with force-directed layout

  • Document sets too big for O(N2)

  • Prototype scatterplot for each document x

    let edges[x] = { N random documents } For each time step

    for each document x for each y in edges[x] d1 = cosine_distance(x,y) d2 = layout_distance(x,y) apply_spring_force(d1, d2) for N/2 edges with largest d1 edges[y] = random document

  • Prototype tree let c = { one component with all documents } for threshold = [0.1, 0.2 ... 1]

    c_new = {} for x in c pieces = connected_components(x, threshold) x.children = pieces c_new += pieces c = c_new

  • Lots of emails would be meaningless, spam or pictures of cats, so Overview can be used to easily dismiss the majority. Given a set of emails based on a keyword search, the problem is more difficult because most of the emails will be at least somewhat relevant. In this case, Overview was most useful as an organizational tool. I could look at an email, make a note, and easily have it grouped with other similar emails through tagging. I started with a branch of Overviews document tree and starting clicking, glancing, noting and tagging. Right off the bat, I found that Overview had grouped together all of the similarly formatted service desk requests. There were hundreds if not thousands of those, so I was able to tag them by the dozens without a second thought while focusing on the more meaty emails.

  • And then no one really used it....

  • Sources of user feedback Log data (select node, apply tag, view document, ...) Emails and other personal contact After-use semi-structured interviews Think-aloud usability tests with naive users

  • Usability lesson #1

    If the workflow doesn't work, the algorithm doesn't matter

  • Workflow improvements Potential users were not able to download and install a command line system, they couldn't get their documents into it, and they didn't understand how to use it. Rewritten as a web application Import from something other than a CSV Split long documents into pages UI overhaul has to be obvious without reading manual!

  • Is the tree any good?

  • Evaluation Methods for Topic Models Wallach et. al. 2009

  • Interpretation and Trust: Designing Model-Driven Visualizations for Text Analysis Chuang et al, SIGCHI 2012

  • The curious case of Petroleum Engineering. The top visualization shows a 2D projection of pairwise topical distances between academic departments. In 2005, Petroleum Engineering appears similar to Neurobiology, Medicine, and Biology. Was there a collaboration among those departments? The bottom visualization shows the undistorted distances from Petroleum Engineering to other departments by radial distance. The connection to biology disappears: it was an artifact of dimensionality reduction. The visual encoding of spatial distance in the first view is interpretable, but on its own is not trustworthy.

  • Usability lesson #2

    Your users define the tasks and therefore the measure of quality

  • Current tree c = { one component with all documents } max_kids = 5 while c is not empty

    c_new = {} for x in c where size(x) > 1 children = adaptive_kmeans(x, max_kids) x.children = children c_new += children c = c_new

  • Adaptive k-means

  • Folder labeling for each folder x

    let d = { docs in x } let v = sum(TF-IDF vectors of d) let t = { 10 terms in v with highest weight }

    label = "ALL" + t in all d "MOST" + t in >=70% of d "SOME" + t in < 70% of d

  • Types of document driven-stories Smoking gun

    basically a search problem often hard to formulate a query, so visual exploration can help

    Categorize and count "trend story" about quantitative patterns Find/invent useful categories, then tag and count documents

    Exhaustive reading still desirable or necessary for some stories! for example, prove that something does not exist To our surprise, wide scope for computer-assisted speedup

  • Added in response to user feedback Limit to five children per folder ALL, MOST, SOME folder labeling Search Show untagged documents Multiple language support Many import and export options ... Simplify, simplify, simplify!

  • K-means vs. LDA on xkcd

  • Why not "real" topic models? How to display topic model output?

    many systems just use output for distance metric we've already got a tree, we've already rejected MDS popular topics-over-time view not applicable for most users multiple topics per document even more confusing

    LDA interpretability not obviously better K-means, LDA, NMF are mathematically related anyway Need hierarchical, O(N) algorithm But ultimately... So far, usability problems data modeling problems Just haven't gotten around to trying

  • What we're building now

  • Coming soon: named entity recognition

  • NER accuracy is really low!

    Test of OpenCalais against 5 random articles from various sources versus hand-tagged entities Overall PRECISION = 77% Overall RECALL = 30% ...and journalism inputs can be from any domain!

  • Initially populate using NER, but let user edit entities, aliases, and entity tags on each document.

  • Usability lesson #3

    The user doesn't care about having an accurate algorithm.

    They care about getting clean data out.

  • Plugin API Custom visualizations Plugin calls to Overview: - get document text - write/read persistent objects - read/write document metadata Overview calls to Plugin - display visualization (render HTML/JS for iframe) - selection changed In development now, coming this summer!

  • Your Visualization Here

  • Your Visualization Here

  • Your Visualization Here

  • Your Visualization Here

  • I mean, I really feel the reason that what occurred in Homicide occurred was because of the incident in, I believe, the Brentwood area with that stand-by pay. And, you know, what I'm beginning to learn in this business is that payback is not nice sometimes.

  • An interactive NLP testbed Conjecture: across many domains, it's much faster for a human to correct an algorithm than to do the whole task by hand. Plugins can read document text, write document metadata, and interact with the user. Perfect for hybrid human-computer tasks.

  • Thank you! For links to everything referenced in this talk please go to:

    http://bit.ly/JournalismNLP

    Find us at:

    overviewproject.org

    github.com/overview

    [email protected]