m t e : linked brushing and mutual information for ... · statistical text data? exploratory data...

Proceedings of the Workshop on Interactive Language Learning, Visualization, and Interfaces, pages 1–13,Baltimore, Maryland, USA, June 27, 2014. c�2014 Association for Computational Linguistics

MITEXTEXPLORER: Linked brushing and mutual information forexploratory text data analysis

Figure 1: Screenshot of MITEXTEXPLORER, analyzing geolocated tweets.

Brendan O’ConnorMachine Learning Department

Carnegie Mellon [email protected]://brenocon.com

Abstract

In this paper I describe a preliminary ex-perimental system, MITEXTEXPLORER,for textual linked brushing, which allowsan analyst to interactively explore statis-tical relationships between (1) terms, and(2) document metadata (covariates). Ananalyst can graphically select documentsembedded in a temporal, spatial, or othercontinuous space, and the tool reportsterms with strong statistical associationsfor the region. The user can then drilldown to specific term and term groupings,viewing further associations, and see howterms are used in context. The goal is torapidly compare language usage across in-teresting document covariates.

I illustrate examples of using the tool onseveral datasets: geo-located Twitter mes-sages, presidential State of the Union ad-dresses, the ACL Anthology, and the KingJames Bible.

1 Introduction: Can we “just look” atstatistical text data?

Exploratory data analysis (EDA) is an approachto extract meaning from data, which emphasizeslearning about a dataset through an iterative pro-cess of many analyses which suggest and refinepossible hypotheses. It is vital in early stages of adata analysis for data cleaning and sanity checks,which are crucial to help ensure a dataset will beuseful. Exploratory techniques can also suggestpossible hypotheses or issues for further investi-gation.

1

3/18/14 Anscombe's_quartet_3.svg

file:///Users/brendano/projects/textexplore/writing/Anscombe's_quartet_3.svg 1/1

Figure 2: Anscombe Quartet. (Source: Wikipedia)

The classical approach to EDA, as pioneered inworks such as Tukey (1977) and Cleveland (1993)(and other work from the Bell Labs statistics groupduring that period) emphasizes visual analysis un-der nonparametric, model-free assumptions, inwhich visual attributes are a fairly direct reflec-tion of numerical or categorical aspects of data.As a simple example, consider the well-knownAnscombe Quartet (1973), a set of four bivari-ate example datasets. The Pearson correlation, avery widely used measure of dependence that as-sumes a linear Gaussian model of the data, findsthat each dataset has an identical amount of de-pendence (r = 0.82). However, a scatterplot in-stantly reveals that very different dependence re-lationships hold in each dataset (Figure 2). Thescatterplot is possibly the simplest visual analysistool for investigating the relationship between twovariables, in which the variables’ numerical valuesare mapped to horizontal and vertical space. Whilethe correlation coefficient is a model-based analy-sis tool, the scatterplot is model-free (or at least, itis effective under an arguably wider range of datagenerating assumptions), which is crucial for thisexample.

This nonparametric, visual approach to EDAhas been encoded into many data analysis pack-ages, including the now-ubiquitous R language (RCore Team, 2013), which descends from earliersoftware by the Bell Labs statistics group (Beckerand Chambers, 1984). In R, tools such as his-tograms, boxplots, barplots, dotplots, mosaicplots,etc. are built-in, basic operators in the language.(Wilkinson (2006)’s grammar of graphics moreextensively systematizes this approach; see also(Wickham, 2010; Bostock et al., 2011).)

In the meantime, textual data has emerged asa resource of increasing interest for many scien-

Figure 3: Linked brushing with the anal-ysis software GGobi. More references atsource: http://www.infovis-wiki.net/index.php?title=Linking_and_Brushing

tific, business, and government data analysis ap-plications. Consider the use case of automatedcontent analysis (a.k.a. text mining) as a tool forinvestigating social scientific and humanistic ques-tions (Grimmer and Stewart, 2013; Jockers, 2013;Shaw, 2012; O’Connor et al., 2011). The contentof the data is under question: analysts are inter-ested in what/when/how/by-whom different con-cepts, ideas, or attitudes are expressed in a cor-pus, and the trends in these factors across time,space, author communities, or other document-level covariates (often called metadata). Compar-isons of word statistics across covariates are ab-solutely essential to many interesting questions orsocial measurement problems, such as

• What topics tend to get censored by the Chi-nese government online, and why (Bammanet al., 2012; King et al., 2013)? Covari-ates: whether a message is deleted by cen-sors, time/location of message.

• What drives media bias? Do newspapersslant their coverage in response to what read-ers want (Gentzkow and Shapiro, 2010)? Co-variates: political preferences of readers,competitiveness of media markets.

There exist dozens, if not more, of other examplesin social scientific and humanities research; seereferences in O’Connor et al. (2011); O’Connor(2014).

In this work, I focus on the question: What

2

should be the baseline exploratory tools for textualdata, to discover important statistical associationsbetween text and document covariates? Ideally,we’d like to “just look” at the data, in the spirit ofscatterplotting the Anscombe Quartet. An analy-sis tool to support this should not require any sta-tistical model assumptions, and should display thedata in as direct a form as possible.

For low-dimensional, non-textual data, the basefunctionality of R prescribes a broad array of use-ful defaults: one-dimensional continuous data canbe histogrammed (hist(x)), or kernel density plot-ted (plot(density(x))), while the relationship be-tween two dimensions of continuous variables canbe viewed as a scatterplot (plot(x,y)); or perhapsa boxplot for discrete x and continous y (box-plot(x,y)); and so on. Commercial data analysissystems such as Excel, Stata, Tableau, JMP, etc.,have similar functionality.

These visual tools can be useful for analyz-ing derived content statistics from text—for exam-ple, showing a high-level topic or sentiment fre-quency trending over time—but they cannot visu-alize the text itself. Text data consists of a linearsequence of high-dimensional discrete variables(words). The most aggressive and common anal-ysis approach, bag-of-words, eliminates the prob-lematic sequential structure, by reducing a docu-ment to a high-dimensional discrete counts overwords. But still, none of the above visual toolsmakes sense for visualizing a word distribution;many popular tools simply crash or become veryslow when given word count data. And besidesthe issues of discrete high-dimensionality, text isunique in that it has to be manually read in orderto more reliably understand its meaning. Naturallanguage processing tools can sometimes extractpartial views of text meaning, but full understand-ing is a long ways off; and the quality of availableNLP tools varies greatly across corpora and lan-guages. A useful exploratory tool should be ableto work with a variety of levels of sophisticationin NLP tooling, and allow the user to fall back tomanual reading when necessary.

2 MITEXTEXPLORER: linked brushingfor text and covariate correlations

The analysis tool presented here, MITEXTEX-PLORER, is designed for exploratory analysis ofrelationships between document covariates—suchas time, space, or author community—against tex-

tual variables—words, or other units of meaning,that can be counted per document. Unlike topicmodel approaches to analyzing covariate-text re-lationships (Mimno, 2012; Roberts et al., 2013),there is no dimension reduction of the terms. In-stead, interactivity allows a user to explore more ofthe high-dimensional space, by specifying a doc-ument selection (Q) and/or a term selection (T ).We are inspired by the linking and brushing familyof techniques in interactive data visualization, inwhich an analyst can select a group of data pointsunder a query in one covariate space, and see thesame data selection in a different covariate space(Figure 3; see Buja et al. (1996), and e.g. Beckerand Cleveland (1987); Buja et al. (1991); Martinand Ward (1995); Cook and Swayne (2007)). Inour case, one of the variables is text.

The interface consists of several linked views,which contain:

(A) a view of the documents in a two-dimensionalcovariate space (e.g. scatterplot),

(B) an optional list of pinned terms,

(C) document-associated terms: a view of the rel-atively most frequent terms for the currentdocument selection,

(D) term-associated terms: a view of terms thatrelatively frequently co-occur with the currentterm selection; and

(E) a keyword-in-context (KWIC) display of tex-tual passages for the current term selection.

Figure 1 shows the interface viewing a corpus of201,647 geo-located Twitter messages from 2,000users during 2009-2012, which have been taggedwith their author’s spatial coordinates through amobile phone client and posted publicly; for dataanalysis, their texts have been lowercased andtokenized appropriately (Owoputi et al., 2013;O’Connor et al., 2010). Since this type of corpuscontains casual, everyday language, it is a datasetthat may illuminate geographic patterns of slangand lexical variation in local dialects (Eisensteinet al., 2012, 2010).

The document covariate display (A) uses (longi-tude, latitude) positions as the 2D space. The cor-pus has been preprocessed to define a document asthe concatenation of messages from a single au-thor, with its position the average location of theauthor’s messages. When the interface loads, all

3

points in (A) are initially gray, and all other panelsare blank.

2.1 Covariate-driven queries

A core interaction, brushing, consists of using themouse to select a rectangle in the (x,y) covariatespace. Figure 1 shows a selection around the BayArea metropolitan area (blue rectangle). Uponselection, the document-driven term display (C)is updated to show the relatively most frequentterms in the document selection. Let Q denotethe set of documents that are selected by the cur-rent covariate query. The tool ranks terms w bytheir (exponentiated) pointwise mutual informa-tion, a.k.a. lift, for Q:

lift(w;Q) =

p(w|Q)

p(w)

✓=

p(w, Q)

p(w)p(Q)

◆(1)

This quantity measures how much more frequentthe term is in the queryset, compared to the base-line global probability in the corpus (p(w)). Prob-abilities are calculated with simple MLE relativefrequencies, i.e.

p(w|Q)

p(w)

=

Pd2Q

ndwP

d2Q

nd

N

nw

(2)

where d denotes a document ID, ndw

the countof word w in document d, and N the numberof tokens in the corpus. PMI gives results thatare much more interesting than results from rank-ing w on raw probability within the query set(p(w|Q)), since that simply shows grammaticalfunction words or other terms that are commonboth in the queryset and across the corpus, and notdistinctive for the queryset.1

A well-known weakness of PMI is over-emphasis on rare terms; terms that appearonly in the queryset, even if they appear onlyonce, will attain the highest PMI value. Oneway to address this is through a smoothingprior/pseudocounts/regularization, or through sta-tistical significance ranking (see §3). For simplic-ity, we use a minimum frequency threshold filter.The user interface allows minimums for either lo-cal or global term frequencies, and to easily ad-just them, which naturally shifts the emphasis be-tween specific and generic language. All methods

1The term “lift” is used in business applications (Provostand Fawcett, 2013), while PMI has been used in many NLPapplications to measure word associations.

to protect against rare probabilistic events neces-sarily involve such a tradeoff parameter that theuser ought to experiment with; given this situation,we might prefer a transparent mechanism insteadof mathematical priors (though see also §3).

Figure 1 shows that hella is the highest rankedterm for this spatial selection (and freqency thresh-old), occurring 7.8 times more frequently com-pared to the overall corpus; this comports withsurveyed intuitions of Californian English speak-ers (Bucholtz et al., 2007). For full transparencyto the user, the local and global term counts areshown in the table. (Since hella occurred 18 timesin the queryset and 90 times globally, this im-plies the simple conditional probability p(Q|w) =

18/90; and indeed, ranking on p(Q|w) is equiva-lent to ranking on PMI, since exponentiated PMIis p(Q|w)/p(Q).) The user can also sort by localcount to see the raw most-frequent term report forthe document selection. As the user reshapes thequery box, or drags it around the space, the termsin panel (C) are updated.

Not shown are options to change the term fre-quency representation. For exposition here, proba-bilities are formulated as counts of tokens, but thiscan be problematic for social media data, since asingle user might use a term a very large numberof times. The above analysis is conducted withan indicator representation of terms per user, soall frequencies refer to the probability that a useruses the term at least once. However, the other ex-amples in this paper use token-level frequencies,which seem to work fine. It is an interesting statis-tical analysis question how to derive a single rangeof methods to work across these situations.

2.2 Term selection and KWIC viewsTerms in the table (C) can be clicked and selected,forming a term selection as a set of terms T . Thisaction drives several additional views:

(A) documents containing the term are high-lighted in the document covariate display(here, in red),

(E) examples of the term’s usage, in Keyword-in-Context style with vertical alignment for thequery term; and

(D) other terms that frequently co-occur with T(§2.3).

The KWIC report in (E) shows examples of term’susage. For example, why is the term “la” in

4

Figure 4: KWIC examples of “la” usage in tweetsselected in Figure 1.

the PMI list? My initial thought was that thiswas an example of “LA”, short for “Los Ange-les”. But clicking on “la” instantly disproves thishypothesis—Figure 4, showing the Los Angelessense, but also the “la la la” sense, as well as theSpanish function word.

The KWIC alignment makes it easier to rapidlybrowse examples, and think about a rough as-sessment of their word sense or how they areused. Figure 5 compares how the term “God”is used by U.S. presidents Ronald Reagan andBarack Obama, in a corpus of State of the Unionspeeches, from two different displays of the tool.The predominant usage is the invocation of “Godbless America” or similar, nearly ornamental, ex-pressions, but Reagan also has substantive us-ages, such as references to the role of religionin schools. The vertical alignments of the right-side context words makes it easy to see the “Godbless” word sense. I initially found this exam-ple simply by browsing the covariate space, andnoticing “god” as a frequent term for Reagan,though still occurring for other presidents; theKWIC drilldown better illuminated these distinc-tions, and suggests differences in political ideolo-gies between the presidents.

In lots of exploratory text analysis work, espe-cially in the topic modeling literature, it is com-mon to look at word lists produced by a statisticalanalysis method and think about what they mightmean. At least in my experience doing this, I’veoften found that seeing examples of words in con-

text has disproved my initial intuitions. Hopefully,supporting this activity in an interactive user inter-face might make exploratory analysis more effec-tive. Currently, the interface simply shows a sam-ple of in-context usages from the document query-set; it would be interesting to perform groupingand stratified sampling based on local contextualstatistics. Summarizing local context by frequen-cies could be done as a trie visualization (Watten-berg and Viegas, 2008); see §5.

2.3 Term-association queries

When a term is selected, its interaction with co-variates is shown by highlighting documents in (B)that contain the term. This can be thought of asanother document query: instead of being spec-ified as a region in the covariate space, is spec-ified as a fragment of the discrete lexical space.As illustrated in much previous work (e.g. Churchand Hanks (1990); Turney (2001, 2002)), word-to-word PMI scores can find other terms with similarmeanings, or having interesting semantic relation-ships, to the target term.2

This panel ranks terms u by their associationwith the query term v. The simplest method is toanalyze the relative frequencies of terms in docu-ments that contain v,

bool-tt-epmi(u, v) =

p(wi

= u|v 2 supp(di

))

p(wi

= u)

Here, the subscript i denotes a token position inthe entire corpus, for which there is a wordtypew

i

and a document ID di

. In this notation, thecovariate PMI in 2.1 would be p(w

i

= u|di

2

Q)/p(wi

= u). supp(di

) denotes the set of termsthat occur at least once in document d

i

.This measure is a very simple extension of

the document covariate selection mechanism, andeasy to understand. However, it is less satisfy-ing for longer documents, since a larger numberof occurrences of v do not lead to a stronger asso-ciation score. A possible extension is to considerthe joint random event of selecting two tokens iand j in the corpus, and consider if the two to-kens being in the same document is informativefor whether the tokens are the words (u, v), i.e.

2For finding terms with similar semantic meaning, dis-tributional similarity may be more appropriate (Turney andPantel, 2010); this could be interesting to incorporate into thesoftware.

5

Figure 5: KWIC examples of “God” in speeches by Reagan versus Obama.

PMI[(wi

, wj

) = (u, v); di

= dj

],

freq-tt-epmi(u, v) =

p(wi

= u,wj

= v|di

= dj

)

p(wi

= u,wj

= v)

In terms of word counts, this expression has theform

freq-tt-epmi(u, v) =

Pd

ndu

ndv

nu

nv

N2

Pd

n2d

The right-side term is a normalizing constant in-variant to u and v. The left-side term is interesting:it can be viewed as a similarity measure, wherethe numerator is the inner product of the invertedterm-document vectors n

.,u

and n.,v

, and the de-nominator is the product of their `1 norms. Thisis a very similar form as cosine similarity, whichis another normalized inner product, except its de-nominator is the product of the vectors’ `2 norms.

Term-to-term associations allow a navigation ofthe term space, complementing the views of termsdriven by document covariates. This part of thetool is still at a more preliminary stage of develop-ment. One important enhancement would be ad-justment of the context window size allowed forco-occurrences; the formulations above assume acontext window the size of the document. Mediumsized context windows might capture more fo-cused topical content, especially in very long dis-courses such as speeches; and the smallest contextwindows, of size 1, should be more like colloca-tion detection (though see §3; this is arguably bet-ter done with significance tests, not PMI).

2.4 Pinned termsThe term PMI views of (C) and (D) are very dy-namic, which can cause interesting terms to disap-pear when their supporting query is changed. It isoften useful to select terms to be constantly viewedwhen the document covariate queries change.

Any term can be double-clicked to be moved tothe the table of pinned terms (B). The set of termshere does not change as the covariate query ischanged; a user can fix a set of terms and see howtheir PMI scores change while looking at differ-ent parts of the covariate space. One possible useof term pinning is to manually build up clusters ofterms—for example, topical or synonymous termsets—whose aggregate statistical behavior (i.e. asa disjunctive query) may be interesting to observe.Manually built sets of keywords are a very usefulform of text analysis; in fact, the WordSeer cor-pus analysis tool has explicit support to help userscreate them (Shrikumar, 2013).

3 Statistical term association measures

There exist many measures to measure the sta-tistical strength of an association between a termand a document covariate, or between two terms.A number of methods are based on significancetesting, looking for violations of a null hypothesisthat term frequencies are independent. For collo-cation detection, which aims to find meaningfulnon-compositional lexical items through frequen-cies of neighboring words, likelihood ratio (Dun-ning, 1993) and chi-square tests have been used(see review in Manning and Schutze (1999)). Forterm-covariate associations, chi-square tests were

6

used by Gentzkow and Shapiro (2010) to find po-litically loaded phrases often used by members ofone political party; this same method is often usedas a feature selection method for supervised learn-ing (Guyon and Elisseeff, 2003).

The approach we take here is somewhat differ-ent, being a point estimate approach, analyzingthe estimated difference (and giving poor resultswhen counts are small). Some related work fortopic model analysis, looking at statistical associa-tions between words and latent topics (as opposedto between words and observed covariates in thiswork) includes Chuang et al. (2012b), whose termsaliency function measures one word’s associa-tions against all topics; a salient term tends to havemost of its probability mass in a small set of top-ics. The measure is a form of mutual information,3

and may be useful for our purposes here if the userwishes to see a report of distinctive terms for agroup of several different observed covariate val-ues at once. Blei and Lafferty (2009) ranks wordsper topic by a measure inspired by TFIDF, whichlike PMI downweights words that are genericallycommon across all topics.

Finally, hierarchical priors and regularizers canalso be used; for example, by penalizing thelog-odds parameterization of term probabilities(Eisenstein et al., 2011; Taddy, 2013). Thesemethods are better in that they incorporate bothprotection against small count situations, whilepaying attention to effect size, as well as allow-ing overlapping covariates and regression controlvariables; but unfortunately, they are more compu-tationally intensive, as opposed to the above mea-sures which all work directly from sufficient countstatistics. An association measure that fulfilled allthese desiderata would be very useful. For term-covariate analysis, Monroe et al. (2008) contains areview of many different methods, from both po-litical science as well as computer science; theyalso propose a hierarchical prior method, and torank by statistical significance via the asymptotic

3This is apparent as follows, using notation from their sec-tion 3.1:

saliency(w) = p(w)

X

T

p(T |w) log[p(T |w)/p(T )]

=

X

T

p(w, T ) log[p(w, T )/[p(w)p(T )]]

This might be called a “half-pointwise” mutual information:between a specific word w and the topic random variable T .Mutual information is

Pw saliency(w).

standard error of the terms’ odds ratios.Given the large amount of previous work using

the significance approach, it merits further explo-ration for this system.

4 Phrase selection

The simplest approach to defining the terms is touse all words (unigrams). This can be insightful,but single words are both too coarse and too nar-row a unit of analysis. They can be too narrowwhen there are multiple ways of saying the samething, such as synonyms—for example, while wehave evidence about differing usages of the term“god” in presidential rhetoric, in order to make aclaim about religious themes, we might need tofind other terms such as “creator”, “higher power”,etc. Another problematic case is alternate namesor anaphoric references to an entity. In general,any NLP tool that extracts interesting discrete vari-able indicators of word meaning could be usedfor mutual information and covariate exploratoryanalysis—for example, a coreference system’s en-tity ID predictions could be browsed by the systemas the term variables. (More complex concepts, ofcourse, would also require more UI support.)

At the same time, words can be too coarse com-pared to the longer phrases they are containedwithin, which often contain more interesting anddistinctive concepts: for example, “death tax”and “social security” are important concepts inU.S. politics that get missed under a unigram anal-ysis. In fact, Sim et al. (2013)’s analysis of U.S.politicians’ speeches found that domain expertshad a hard time understanding unigrams out-of-context, but bigrams and trigrams worked muchbetter; Gentzkow and Shapiro (2010) similarly fo-cus on partisan political phrases.

It sometimes works to simply add overlap-ping n-grams as more terms, but sometimes oddphrases get selected that cross constituent bound-aries from their source sentences, and are thus nottotally meaningful. I’ve experimented with a verystrong filtering approach to phrase selection: be-sides using all unigrams, take all n-grams up tolength 5 that have nominal part-of-speech patterns:either the sequence consists of zero or more ad-jectives followed by one or more noun tokens, orall tokens were classified as names by a namedentity recognition system.4 This tends to yield

4For traditional text, the tool currently uses StanfordCoreNLP; for Twitter, CMU ARK TweetNLP.

7

Figure 6: MITEXTEXPLORER for paper titles in the ACL Anthology (Radev et al., 2009). Y-axis is venue(conference or journal name), X-axis is year of publication. Unlike the other figures, docvar-associatedterms are sorted alphabetically.

Figure 7: MITEXTEXPLORER for the King James Bible. Y-axis is book, X-axis is chapter (truncated to39).

8

(partial) constituents, and nouns tend to be moreinteresting than other content words (perhaps be-cause they are relatively less reliant on predicate-argument structure to express their semantics—asopposed to adjectives or verbs, say—and a bag-of-terms analysis does not allow expression of argu-ment structure.) However, for many corpora, POSor NER taggers work poorly—for example, I’veseen paper titles from the ACL Anthology havecapitalized prepositions tagged as names—so sim-pler stopword heuristics are necessary.

The phrase selection approach could be im-proved in many ways; for example, a real nounphrase recognizer could get important (NP PP)constructs like “war on terror.” Furthermore,Chuang et al. (2012a) find that while these sortsof syntactic features are helpful in choosing usefulkeyphrases to summarize of scientific abstracts,it is also very useful to add in collocation de-tection scores. Similarly to the PMI calculationsused here, likelihood ratio or chi-square collo-cation detection statistics are also very rapid tocompute and may benefit from interactive adjust-ment of decision thresholds. More generally, anytype of lexicalized linguistic structures could po-tentially be used, such as dependency paths orconstituents from a syntactic parser, or predicate-argument structures from a semantic parser. Lin-guistic structures extracted from more sophisti-cated NLP tools may indeed be better-generalizedunits of linguistic meaning compared to words andphrases, but they will still bear the same high-dimensionality issues for data analysis purposes.

5 Related work: Exploratory textanalysis

Many systems and techniques have been devel-oped for interactive text analysis. Two such sys-tems, WordSeer and Jigsaw, have been under de-velopment for several years, each having had a se-ries of user experiments and feedback. Recent andinteresting review papers and theses are availablefor both of them.

The WordSeer system (Shrikumar, 2013)5 con-tains many different interactive text visualizationtools, including syntax-based search, and was ini-tially designed for the needs of text analysis inthe humanities; the WordSeer 3.0 system includesa word frequency analysis component that cancompare word frequencies along document covari-

5http://wordseer.berkeley.edu/

ates. Interestingly, Shrikumar found in user stud-ies with literary experts that data comparisons andannotation/note-taking support were very impor-tant capabilities to add to the system. Unique tothe work in this paper is the emphasis on condi-tioning on document covariates to analyze rela-tive word frequencies, and encouraging the user tochange the statistical parameters that govern textcorrelation measurements. (The term pinning andterm-to-term association techniques are certainlyless developed than previous work.)

Another text analysis system is Jigsaw (Gorget al., 2013),6 originally developed for investiga-tive analysis (as in law enforcement or intelli-gence), which again has many features. It empha-sizes visualizations based on entity extractions,such as for names, places, and dates. Gorg et al.note that errors in entity extraction were a majorproblem for users; this might be a worthwhile ar-gument to focus on getting something to first workwith simple words/phrases before tackling morecomplex units of meaning. A section of the reviewpaper is entitled “Reading the documents still mat-ters”, pointing out that analysts did not want just tovisualize high-level relationships, but also wantedto read documents in context; this capability wasadded to later versions of Jigsaw, and supports theemphasis here on the KWIC display.

Both these systems also use variants of Watten-berg and Viegas (2008)’s word tree visualization,which gives a sequential word frequencies as atree (i.e., what computational linguists might call atrie representation of a high-order Markov model).The “God bless” word sense example from §2 in-dicates that such statistical summarization of localcontextual information may be useful to integrate;it is worth thinking how to integrate this againstthe important need of document covariate analy-sis, while being efficient with the use of space.

Many other systems, especially ones designedfor literary content analysis, emphasize concor-dances and keyword searches within a text; forexample, Voyeur/Voyant (Rockwell et al., 2010),7

which also features some document covariateanalysis through temporal trend analyses for indi-vidual terms. Another class of approaches empha-sizes the use of document clustering or topic mod-els (Gardner et al., 2010; Newman et al., 2010;

6http://www.cc.gatech.edu/gvu/ii/jigsaw/

7http://voyant-tools.org/,http://hermeneuti.ca/voyeur

9

Grimmer and King, 2011; Chaney and Blei, 2013),while Overview8 emphasizes hierarchical docu-ment clustering paired with manual tagging.

Finally, considerable research has examinedexploratory visual interfaces for information re-trieval, in which a user specifies an informationneed in order to find relevant documents or pas-sages from a corpus (Hearst (2009), Ch. 10). In-formation retrieval problems have some similari-ties to text-as-data analysis in the need for an ex-ploratory process of iterative refinement, but thetext-as-data perspective differs in that it requiresan analyst to understand content and contextualfactors across multiple or many documents.

6 Future work

The current MITEXTEXPLORER system is an ex-tremely simple prototype to explore what sorts of“bare words” text-and-covariates analyses are pos-sible. Several major changes will be necessary formore serious use.

First, essential basic capabilities must be added,such as a search box the user can use to search andfilter the term list.

Second, the document covariate display needsto support more than just scatterplots. When thereare hundreds or more documents, summarizationis necessary in the form of histograms, kernel den-sity plots, or other tools. For example, for a largecorpus of documents over time, a lineplot or tem-poral histogram is more appropriate, where eachtimestep has a document count. The ACL An-thology scatterplot (Figure 6, Radev et al. (2009)),which has hundreds of overplotted points at each(year,venue) position, makes clear the limitationsof the current approach.

Better visual feedback for term selections herecould be useful—for example, sizing documentpoints monotonically with the term’s frequency(rather than just presence/absence), or usingstacked line plots—though certain visual depic-tions of frequency may be difficult given the Zip-fian distribution of word frequencies.

Furthermore, document structures may bethought of as document covariates. A single bookhas interesting internal variation that could be an-alyzed itself. Figure 7 shows the King JamesBible, which has a hierarchical structure of book,chapter, and verse. Here, the (y,x) coordinates

8https://www.overviewproject.org/ http://overview.ap.org/

represent books and chapters. A more special-ized display for book-level structures, or other dis-course structures, may be appropriate for book-length texts.

Finally, a major goal of this work is to use anal-ysis methods that can be computed on the fly,but the current prototype only works with smalldatasets. Hierarchical spatial indexing techniques(e.g. r-trees), may make it possible to interactivelycompute sums for covariate PMI scoring over verylarge numbers of documents. Text indexing isalso important for term-driven queries and KWICviews. Techniques from ad-hoc data querying sys-tems may be necessary for further scale (e.g. Mel-nik et al. (2010)).

Many other directions are possible. The proto-type tool, as described in §2, will be available asopen-source software at: http://brenocon.com/MiTextExplorer. It is a desktop appli-cation written in Java.

Acknowledgments

Thanks to Michael Heilman and Bryan Routledge,for many discussions and creative text analysisscripts that inspired this work. Thanks also to theanonymous reviewers for very helpful feedback.This research was supported in part by NSF grantIIS-1211277 and CAREER grant IIS-1054319.

References

Francis J Anscombe. Graphs in statistical analysis.The American Statistician, 27(1):17–21, 1973.

David Bamman, Brendan O’Connor, and Noah A.Smith. Censorship and deletion practices inChinese social media. First Monday, 17(3),2012.

Richard A Becker and John M Chambers. S: aninteractive environment for data analysis andgraphics. CRC Press, 1984.

Richard A. Becker and William S. Cleveland.Brushing scatterplots. Technometrics, 29(2):127–142, 1987.

David M. Blei and John D. Lafferty. Topic mod-els. Text mining: classification, clustering, andapplications, 10:71, 2009.

Michael Bostock, Vadim Ogievetsky, and JeffreyHeer. D3: Data-driven documents. IEEE Trans.Visualization & Comp. Graphics (Proc. Info-Vis), 2011. URL http://vis.stanford.edu/papers/d3.

10

Mary Bucholtz, Nancy Bermudez, Victor Fung,Lisa Edwards, and Rosalva Vargas. HellaNor Cal or totally So Cal? the per-ceptual dialectology of California. Jour-nal of English Linguistics, 35(4):325–352,2007. URL http://people.duke.edu/

˜eec10/hellanorcal.pdf.

Andreas Buja, John Alan McDonald, John Micha-lak, and Werner Stuetzle. Interactive data visu-alization using focusing and linking. In Visu-alization, 1991. Visualization’91, Proceedings.,IEEE Conference on, pages 156–163. IEEE,1991.

Andreas Buja, Dianne Cook, and Deborah FSwayne. Interactive high-dimensional data vi-sualization. Journal of Computational andGraphical Statistics, 5(1):78–99, 1996.

Allison J.B. Chaney and David M. Blei. Visual-izing topic models. In Proceedings of ICWSM,2013.

Jason Chuang, Christopher D. Manning, andJeffrey Heer. ”without the clutter of unim-portant words”: Descriptive keyphrases fortext visualization. ACM Trans. on Computer-Human Interaction, 19:1–29, 2012a. URLhttp://vis.stanford.edu/papers/keyphrases.

Jason Chuang, Christopher D. Manning, and Jef-frey Heer. Termite: Visualization techniques forassessing textual topic models. In Advanced Vi-sual Interfaces, 2012b. URL http://vis.stanford.edu/papers/termite.

K. W Church and P. Hanks. Word associationnorms, mutual information, and lexicography.Computational linguistics, 16(1):2229, 1990.

William S. Cleveland. Visualizing data. HobartPress, 1993.

Dianne Cook and Deborah F. Swayne. Interactiveand dynamic graphics for data analysis: with Rand GGobi. Springer, 2007.

Ted Dunning. Accurate methods for the statisticsof surprise and coincidence. Computa-tional Linguistics, 19:61—74, 1993. doi:10.1.1.14.5962. URL http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.14.5962.

J. Eisenstein, A. Ahmed, and E.P. Xing. Sparse ad-ditive generative models of text. In Proceedingsof ICML, pages 1041–1048, 2011.

Jacob Eisenstein, Brendan O’Connor, Noah A.Smith, and Eric P. Xing. A latent variable modelfor geographic lexical variation. In Proceedingsof the 2010 Conference on Empirical Methodsin Natural Language Processing, pages 1277—1287, 2010.

Jacob Eisenstein, Brendan O’Connor, Noah A.Smith, and Eric P. Xing. Mapping the geograph-ical diffusion of new words. In NIPS Workshopon Social Network and Social Media Analy-sis, 2012. URL http://arxiv.org/abs/1210.5268.

M.J. Gardner, J. Lutes, J. Lund, J. Hansen,D. Walker, E. Ringger, and K. Seppi. The topicbrowser: An interactive tool for browsing topicmodels. In NIPS Workshop on Challenges ofData Visualization. MIT Press, 2010.

Matthew Gentzkow and Jesse M Shapiro. Whatdrives media slant? evidence from us dailynewspapers. Econometrica, 78(1):35–71, 2010.

Carsten Gorg, Zhicheng Liu, and John Stasko.Reflections on the evolution of the jigsaw vi-sual analytics system. Information Visualiza-tion, 2013.

Justin Grimmer and Gary King. General purposecomputer-assisted clustering and conceptualiza-tion. Proceedings of the National Academy ofSciences, 108(7):2643–2650, 2011.

Justin Grimmer and Brandon M Stewart. Textas Data: The promise and pitfalls of au-tomatic content analysis methods for polit-ical texts. Political Analysis, 21(3):267–297, 2013. URL http://www.stanford.edu/˜jgrimmer/tad2.pdf.

Isabelle Guyon and Andre Elisseeff. An introduc-tion to variable and feature selection. The Jour-nal of Machine Learning Research, 3:1157–1182, 2003.

Marti Hearst. Search user interfaces. CambridgeUniversity Press, 2009.

Matthew L Jockers. Macroanalysis: Digital meth-ods and literary history. University of IllinoisPress, 2013.

Gary King, Jennifer Pan, and Margaret E. Roberts.How censorship in china allows governmentcriticism but silences collective expression.American Political Science Review, 107:1–18,2013.

11

Christopher D Manning and Hinrich Schutze.Foundations of statistical natural language pro-cessing. MIT press, 1999.

Allen R. Martin and Matthew O. Ward. High di-mensional brushing for interactive explorationof multivariate data. In Proceedings of the6th Conference on Visualization’95, page 271.IEEE Computer Society, 1995.

Sergey Melnik, Andrey Gubarev, Jing Jing Long,Geoffrey Romer, Shiva Shivakumar, MattTolton, and Theo Vassilakis. Dremel: interac-tive analysis of web-scale datasets. Proceed-ings of the VLDB Endowment, 3(1-2):330–339,2010.

David Mimno. Topic regression. PhD thesis, Uni-versity of Massachusetts Amherst, 2012.

B. L. Monroe, M. P. Colaresi, and K. M. Quinn.Fightin’Words: lexical feature selection andevaluation for identifying the content of politi-cal conflict. Political Analysis, 16(4):372, 2008.

D. Newman, T. Baldwin, L. Cavedon, E. Huang,S. Karimi, D. Martinez, F. Scholer, and J. Zo-bel. Visualizing search results and documentcollections using topic maps. Web Semantics:Science, Services and Agents on the World WideWeb, 8(2):169–175, 2010.

Brendan O’Connor. Statistical Text Analysis forSocial Science. PhD thesis, Carnegie MellonUniversity, 2014.

Brendan O’Connor, Michel Krieger, and DavidAhn. TweetMotif: Exploratory search and topicsummarization for Twitter. In Proceedings ofthe International AAAI Conference on Weblogsand Social Media, 2010.

Brendan O’Connor, David Bamman, and Noah A.Smith. Computational text analysis for socialscience: Model assumptions and complexity. InSecond Workshop on Comptuational Social Sci-ence and the Wisdom of Crowds (NIPS 2011),2011.

Olutobi Owoputi, Brendan O’Connor, Chris Dyer,Kevin Gimpel, Nathan Schneider, and Noah ASmith. Improved part-of-speech tagging for on-line conversational text with word clusters. InProceedings of NAACL, 2013.

Foster Provost and Tom Fawcett. Data Science forBusiness. O’Reilly Media, 2013.

R Core Team. R: A Language and Environment forStatistical Computing. R Foundation for Statis-

tical Computing, Vienna, Austria, 2013. URLhttp://www.R-project.org/. ISBN 3-900051-07-0.

Dragomir R. Radev, Pradeep Muthukrishnan, andVahed Qazvinian. The ACL anthology networkcorpus. In Proc. of ACL Workshop on Natu-ral Language Processing and Information Re-trieval for Digital Libraries, 2009.

Margaret E. Roberts, Brandon M. Stewart, andEdoardo M. Airoldi. Structural topic models.2013. URL http://scholar.harvard.edu/bstewart/publications/structural-topic-models. Work-ing paper.

Geoffrey Rockwell, Stefan G Sinclair, StanRuecker, and Peter Organisciak. Ubiquitoustext analysis. paj: The Journal of the Initiativefor Digital Humanities, Media, and Culture, 2(1), 2010.

Ryan Shaw. Text-mining as a researchtool, 2012. URL http://aeshin.org/textmining/.

Aditi Shrikumar. Designing an Exploratory TextAnalysis Tool for Humanities and Social Sci-ences Research. PhD thesis, University of Cali-fornia at Berkeley, 2013.

Yanchuan Sim, Brice Acree, Justin H Gross, andNoah A Smith. Measuring ideological propor-tions in political speeches. In Proceedings ofEMNLP, 2013.

Matt Taddy. Multinomial inverse regression fortext analysis. Journal of the American Statisti-cal Association, 108(503):755–770, 2013.

John W. Tukey. Exploratory data analysis. 1977.

P. D Turney. Thumbs up or thumbs down?: seman-tic orientation applied to unsupervised classifi-cation of reviews. In Proceedings of the 40thAnnual Meeting on Association for Computa-tional Linguistics, page 417424, 2002.

P. D Turney and P. Pantel. From frequency tomeaning: Vector space models of semantics.Journal of Artificial Intelligence Research, 37(1):141188, 2010. ISSN 1076-9757.

Peter Turney. Mining the web for syn-onyms: Pmi-ir versus lsa on toefl. InProceedings of the Twelth European Con-ference on Machine Learning, 2001.URL http://nparc.cisti-icist.

12

nrc-cnrc.gc.ca/npsi/ctrl?action=rtdoc&an=5765594.

Martin Wattenberg and Fernanda B Viegas. Theword tree, an interactive visual concordance.Visualization and Computer Graphics, IEEETransactions on, 14(6):1221–1228, 2008.

Hadley Wickham. A layered grammar of graph-ics. Journal of Computational and GraphicalStatistics, 19(1):328, 2010. doi: 10.1198/jcgs.2009.07098.

Leland Wilkinson. The grammar of graphics.Springer, 2006.

13

m t e : linked brushing and mutual information for ... · statistical text data? exploratory data...

Documents