high-level text analysis and techniques
DESCRIPTION
Duke University Libraries, Digital Scholarship Text > Data, October 25. Angela Zoss Data Visualization Coordinator 226 Perkins Library angela.zoss@ duke.edu. High-Level Text Analysis and Techniques. Documents as Context. But first,. Angela As Context. - PowerPoint PPT PresentationTRANSCRIPT
HIGH-LEVEL TEXT ANALYSIS AND TECHNIQUESAngela ZossData Visualization Coordinator226 Perkins [email protected]
Duke University Libraries, Digital ScholarshipText > Data, October 25
DOCUMENTS AS CONTEXT
ANGELA AS CONTEXTBut first,
How I learned to love the document.B.A. courses: Linguistics, Communication
M.S. courses: Communication, Human-Computer Interaction
Employment: arXiv.org Administrator
Ph.D. courses: •Bibliometrics/Scientometrics•Computer Mediated Discourse Analysis•Latent Structure Analysis•Natural Language Processing
DOCUMENTS AS CONTEXTNow,
Text analysis from…
• documents down to words (“low-level”)
• words up to documents (“high-level”)
Using documents to learn about language (or other social phenomena)
Analyzing documents as records/proxies of language, social structures, events, etc.
Linguistic studies: morphology, word counts, syntax, etc. …
over time (e.g., Google ngram viewer) language across corpora (e.g., political speeches)
Underwood, T. (2012). Where to start with text mining.
Using documents to learn about language
Historical culturomics of pronoun frequencies
Using documents to learn about language
Universal properties of mythological networks
Using language to learn about documents
Analyzing documents as artifacts themselves, with their own properties and dynamics
Literary, documentary studies:Structural/rhetorical/stylistic analysisDocument categorization, classificationDetecting clusters of document features (topic modeling)
Underwood, T. (2012). Where to start with text mining.
Using language to learn about documents
Literary Empires, Mapping Temporal and Spatial Settings in Swinburne
Using language to learn about documentsUsing Word Clouds for Topic Modeling Results
What are documents?
For this discussion, digital versions of works of spoken or written language
Examples: books, articles, transcripts, emails,
tweets…
Documents as context
Documents have:• form(at)• style• provenance• entities• intentions
STUDIES OF DOCUMENTS
Why study documents?
• Describe a corpus• Compare/organize documents• Locate relevant information/filter out
irrelevant information
Describing a corpus
• Finding regularities/differences across groups of documents
• Developing theories of structure, style, etc. that can then be tested or applied
• May be manual (content analysis) or computer-assisted (statistical)
Example: Storylines
http://xkcd.com/657/
Differences of format, genre, participants…
• Articles may have sections, but these will vary by discipline and type of article
• Books may be fiction or non-fiction (or both)
• Transcripts may refer to multiple speakers, non-text content
• …ad infinitum
Example: Literature Fingerprinting
Keim, D. A., & Oelke, D. (2007). Literature fingerprinting: A new method for visual literary analysis. In IEEE Symposium on Visual Analytics Science and Technology, VAST 2007 (pp.115-122). doi: 10.1109/VAST.2007.4389004
Organizing documents
Detect similarity between documents and a known category (or simply among themselves)
Supports browsing, sentiment analysis, authorship detection
Example: Bohemian Bookshelf
Thudt, A., Hinrichs, U., & Carpendale, S. (2012). The Bohemian Bookshelf: Supporting Serendipitous Book Discoveries through Information Visualization. In CHI '12: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, to appear.
Similarity based on…
• common document attributesauthorship, genre
• common language patternstopics, phrases
• common entity referencescharacters, citations
Example: Quantitative Formalism
Allison, S., Heuser, R., Jockers, M., Moretti, F., & Witmore, M. (2011). Quantitative formalism: An experiment. Pamphlets of the Stanford Literary Lab (vol. 1).
Example: Clinton’s DNC Speech
http://b.globe.com/TogUqq
Example: View DHQ
http://digitalliterature.net/viewDHQ/vis3.html
Classification
• assigning an object to a single class• often supervised, using an existing
classification scheme and a tagged corpus
Example: Relative signatures
Jankowska, M., Keselj, V., & Milios, E. (2012). Relative n-gram signatures: Document visualization at the level of character n-grams. In Proceedings of IEEE Conference on Visual Analytics Science and Technology 2012 (pp. 103-112).
Categorization
• assigning documents to one or more categories
• suggestive of unsupervised clustering techniques
• design choices made to fit particular tasks or goals
Example: UCSD Map of Science
Börner, K., Klavans, R., Patek, M., Zoss, A. M., Biberstine, J. R., Light, R. P., Larivière, V., & Boyack, K. W. (2012). Design and update of a classification system: The UCSD Map of Science . PLoS ONE, 7(7), e39464.
Example: NIH Map Viewer
https://app.nihmaps.org/nih/browser/
Reference systems, infrastructureWhat do we gain by adding structure?
What do we lose?
SUMMARIZING DOCUMENTS
Text is only one component of a document.
Research questions often push us to be creative with how we operationalize constructs.
The richness of language and documents is best preserved by using multiple, complementary approaches.