eric e monson, text->data 08 nov 2012

Post on 03-Jul-2015

258 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Visualizing Text: Tools & TechniquesEric E Monson, PhD (Duke VTG)Katherine de Vos Devine (Duke AAHVS)8 Nov 2012

Why do we visualize?

Why do we visualize?

To reveal patterns:clusters, trends, gaps &

outliers

Anscombe’s Quartet

I II III IV

x y x y x y x y

10 8.04 10 9.14 10 7.46 8 6.58

8 6.95 8 8.14 8 6.77 8 5.76

13 7.58 13 8.74 13 12.74 8 7.71

9 8.81 9 8.77 9 7.11 8 8.84

11 8.33 11 9.26 11 7.81 8 8.47

14 9.96 14 8.1 14 8.84 8 7.04

6 7.24 6 6.13 6 6.08 8 5.25

4 4.26 4 3.1 4 5.39 19 12.5

12 10.84 12 9.13 12 8.15 8 5.56

7 4.82 7 7.26 7 6.42 8 7.91

5 5.68 5 4.74 5 5.73 8 6.89

Mean of x 9

Variance of x 11

Mean of y 7.50

Variance of y 4.122 or 4.127

XY Correlation 0.816

Linear fit y = 3.00 + 0.500x

Anscombe, F. J. (1973). "Graphs in Statistical Analysis". American Statistician 27 (1): 17–21.

12

10

8

6

4

4 6 8 10 12 14 16 18

12

10

8

6

4

4 6 8 10 12 14 16 18

12

10

8

6

4

4 6 8 10 12 14 16 18

12

10

8

6

4

4 6 8 10 12 14 16 18

I II

III IV

modified from http://en.wikipedia.org/wiki/File:Anscombe%27s_quartet_3.svg

Form of presentation can reveal patternsCleveland (1994)

Why do we visualize?

Efficient humanvisual system

“Preattentive” cues – fast!Chris Healy (NC State) web examples – http://www.csc.ncsu.edu/faculty/healey/PP/

Why do we visualize?

Communication& Exploration

Vis for Communication & Exploration

• Medium – Print, slide show, poster, web

• Telling a story – Clear & guided

Text Visualization – Difficult & Fascinating

• Not preattentive

• Difficult to abstract

• Occlusion destroys comprehension

• Context gives meaning

Text Processing(not covering)

Types of visualization

Term countingWordle

Term counting + contextNYTimes

Document ComparisonJuxta

Document ComparisonJuxta Commons

Document ComparisonJuxta Commons

Terms in contextPoemViz (Indiana SILS)

Document Network (entities)Jigsaw Visual Analytics

Topic VisualizationMany Bills (IBM)

Many Eyes – IBM collaborative web vis

• Pros

- Some of the best vis people in the world did the original development

- Wide variety of visualizations, some that don’t exist anywhere else

- Best-practice graphics- Nice model for crowd vis

• Cons

- Experimental, and not clear that IBM is still supporting it even though usage keeps increasing

Phrase NetMany Eyes (* and *)

Word TreeMany Eyes

Research questions driving this workKatherine de Vos Devine

I am pursuing a JD/PhD in Art History, specializing in twentieth century and contemporary art and fashion. Broadly, my research focuses on appropriation. Specifically, I focus on practices of avant-garde artists and fashion designers that are characterized as adapting, borrowing, or interpreting from other individuals, cultures, or the past, as well as the ways in which new technologies permit and encourage appropriation.

My dissertation will focus on the regulation and enforcement of appropriation through formal (legal) and informal (social) rules, contrast the market for fine art (high regulation of appropriation) with the market for high-end fashion (low regulation of appropriation), and explore the ways in which regulation of behaviors and technologies distinguish these markets.

Research questions driving this workEric E Monson

Can we build some tools that will help Katherine explore her data more easily and quickly, while allowing her to ask new questions and see new patterns that would not have been possible without the technology?

– What tends to work is a combination of many small tools.

– Focus on prototypes so can see what is actually useful

Two main pieces we worked on

• Text Archive- Sources: Google Scholar court decisions for now- Working archive- Full-text & faceted searchable DB

• Visualizations- Web interface- Words in context

Text vis details: Concordance

In [6]: dec.concordance('piracy', width=120, lines=70)Displaying 70 of 232 matches:owledge could not be used without incurring the guilt of piracy of the book .' " Feist Publications , 499 U . S . at 350cent infringement , noting that "[ o ] pen and unabashed piracy is not a mark of good faith " and that " in [ those ] ci reproductions of records or tapes , which is known as " piracy ,"[- 3 -] could be prosecuted or face civil liability foe substance of the Sound Recording Act of 1971 [- 3 -] " Piracy ," which refers to an unauthorized duplication of a perf7 , 3129 n . 2 , 87 L . Ed . 2d 152 ( 1985 ) [- 4 -] See Piracy and Counterfeiting Amendments Act of 1982 , Pub . L . Noe here a conflict of policies : ( a ) that of preventing piracy of copyrighted matter and ( b ) that of enforcing the anthe plaintiff if we deny it relief . As the defendants ' piracy is unmistakably clear , while the plaintiffs ' infractioth that last conclusion we disagree . Open and unabashed piracy is not a mark of good faith ; and we think the ' claimedeld found that state laws on trade secrets and recording piracy were not preempted by the Copyright Act . See Kewanee Oie the review for it , such a use will be deemed in law a piracy . Id . at 550 ( quoting Folsom v . Marsh , 9 F . Cas . 3making this factual determination , a layman must detect piracy " without any aid or suggestion or critical analysis by so to treat the concept of " publication " as to prevent piracy . They tend to bear out Judge Putnam ' s suggestion in Lvely short " period of one year would actually encourage piracy by making it easier for malefactors to evade detection . Prostar is a Texas corporation suing for alleged signal piracy conducted in a Louisiana establishment . More generally conducted on a national and international scale . Cable piracy consequently differs from many of the cases where courtsnd that application of Louisiana conversion law to cable piracy claims brought under 47 U . S . C . ? ? 553 and 605 woulsions " in their efforts to investigate and pursue cable piracy . A single federal standard would eliminate these practi S . position in trade negotiations with countries where piracy is not uncommon " and " rais [ ing ] the like [ li ] hoo, Note , A Trade Based Response to Intellectual Property Piracy : A Comprehensive Plan to Aid the Motion Picture Industrso to treat the concept of ` publication ' as to prevent piracy ." We think the authorities he cites and others warrant with more or less colorable alterations to disguise the piracy . Paraphrasing is copying and an infringement , if carriord convinces me of Millard ' s transparent and shocking piracy of plaintiff ' s publications . For the wrongful and clehe trade as ' disklegging ,' ' bootlegging ' or record ' piracy .' Krug sold these records to dealer customers includinglions of dollars in losses suffered as a result of the " piracy and bootlegging " of the industry ' s products . Andersoboth to protect consumers and to prevent tape and record piracy . While tapes and records are doubtless speech , as Andee . Disclosure of the manufacturer also protects against piracy . Anderson contends that this latter interest of the station , which might adequately serve the state ' s anti - piracy interest , would largely defeat its consumer - protectioty . The primary purpose of Sec . 653w is to prevent the piracy of the works of these performers and manufacturers ; thetition for writ of habeas corpus is AFFIRMED . [- 1 -] " Piracy is the term used for unauthorized duplication of originarized duplication of original commercial products ." See Piracy and Counterfeiting Amendments Act of 1982 , S . Rep . No

Text vis overview: Word Tree

Data stages (iterative)

• Gather – Google Scholar scraping (shhh...)

• Parse – HTML content & metadata

• Clean – parsing mistakes & regularization

• Analyze / Transform – topic (subject) modeling

• Visualize – build online prototypes

New tools to learn (project as excuse)

• MongoDB – doc-centered NoSQL database

• Google Refine – data cleaning / regularization

• Apache Solr – Lucene-based search DB / server

• PHP

• D3.js – JavaScript data / DOM / vis library

• (already knew some: Python, BeautifulSoup, Mallet)

‣ So far, prototype with 18k+ copyright & trademark court decisions (1900-2011)

MongoDB – Working DB

• Pros

- Scalable, high-performance, open-source- No Schema! – Easy!- JSON – native object / dict in JS & Python- Indexed queries, rich operators, geospatial- GridFS for large binary files- Easy dumps and CSV export

• Cons

- No in-DB joins

MongoDB – Working DB> db.docs.findOne({referenced:{$size:4}}){ "_id" : ObjectId("4f406d8d47b2301618000091"), "content" : "\n121 F.2d 575 (1941)\nCORCORAN\r\nv.\r\nCOLUMBIA BROADCASTING SYSTEM, Inc., et al.\n No. 9664.\nCircuit Court of Appeals, Ninth Circuit.\nJune 30, 1941.\nBlase A. Bonpane, of Hollywood, Cal., for appellant.\nFrederick Leuschner and Richard Harper Graham, both of Los Angeles, Cal., for appellee Montgomery Ward & Co.\nBefore DENMAN, MATHEWS, and HEALY, Circuit Judges.\nHEALY, Circuit Judge.\nThe appeal is from a judgment awarding attorneys' fees in a suit for infringement of copyright, the allowance being made under the claimed authority of § 40 of the Copyright Act (Act of March 4, 1909, c. 320, 35 Stats. 1084, 17 U.S.C.A. § 40), providing that the court \"may award to the prevailing party a reasonable attorney's fee as part of [...] a consolidation of two cases.\n\n", "court" : "United States Court of Appeals, Ninth Circuit.", "court_level" : 4, "dates" : { "unlabeled" : ISODate("1941-06-30T00:00:00Z") }, "docket" : "", "docket_url" : "", "file_ref" : { "$ref" : "fs.files", "$id" : ObjectId("4f35d3a88b4cff037f000122") }, "filename" : "17703661253263627975.html", "media_type" : "google_scholar_case", "name" : "CORCORAN v. COLUMBIA BROADCASTING SYSTEM, Inc., et al.", "numbers" : [ "121 F.2d 575 (1941)" ], "ref_summary" : "Corcoran v. Columbia Broadcasting System, 121 F. 2d 575 - Circuit Court of Appeals, 9th Circuit 1941", "referenced" : [ "15974969240519593564", "7943512795249075682", "9986939737036121771", "5029666492191827803" ],

"solr_term_freqs" : [ 2, 1, ..., 3, 1 ], "solr_term_list" : [ "1", "1084", ..., "work", "would" ], "subjects" : { "television" : 0, "fashion" : 0, "art" : 0, "publishing" : 0.0032362460624426603, "comics" : 0, "photography" : 0, "toys" : 0, "architecture" : 0, "sports" : 0, "maps" : 0, "theater" : 0, "music" : 0.0032362460624426603, "advertising" : 0, "internet" : 0, "videogames" : 0, "design" : 0, "film" : 0, "software" : 0 }, "tags" : [ "copyright" ], "url" : "scholar.google.com/scholar_case?case=17703661253263627975", "year" : 1941}

Google Refine – Data cleaning

• Pros

- Free- Useful- Tools that no other package covers- Training at Data & GIS Services

• Cons

- Clustering algorithms & parameters opaque

Google Refine – Data cleaning

Google Refine – Data cleaning

Google Refine – Data cleaning

Google Refine – Data cleaning

Apache Solr – Searching

• Pros

- Lucene, fast & open-source- Indexed full-text, faceted, “snippets” returned on searches- Control over text processing (stemming)- Rich document handling (PDF, Word)

• Cons

- Not as transparent or flexible as MongoDB (no command line, no embedded documents)

- Java running in a servlet container (Tomcat)- Install & config a bit technical

Topic modeling (LDA – mallet)

• Want to search / filter by topic

• Don’t want to manually label all cases

• Topic Modeling – Latent Dirichlet Allocation

- Topics are weighted groups of words

- Documents are weighted groups of topics

- Humans give topics names later

Topic modeling (LDA – mallet)

Topic modeling (LDA – mallet)

D3 – Visualization (Mike Bostock)

• Pros

- Lightweight & Fast- Web (JavaScript & SVG)- Almost infinite flexibility- Attach data to DOM- Transitions & Interactivity- Free & open-source

• Cons

- Need programming expertise- Learning curve & No pre-canned visualizations

Prototype examples

Future work

• More Sources!! – NYTimes & Vogue top priority

• Additional parsing – metadata

• Interaction & Tree Pruning – like Many Eyes

• Multi-tree comparisons

• Timelines – theme river

top related