text and document - wordpress.com 8803dv > spring 17 challenge (cont’d) • unstructured text...
TRANSCRIPT
INFOVIS8803DV > SPRING 17
TEXT & DOCUMENT
VISUALIZATION
Prof. Rahul C. Basole
CS/MGT 8803-DV > February 20, 2017
INFOVIS8803DV > SPRING 17
Text is Everywhere
• We (may) use documents as primary information artifact in our lives
• Our access to documents/information has grown tremendously
• And the amount of information has grown!
INFOVIS8803DV > SPRING 17
How can DataVis help users ….
… in gathering, understanding, using, comparing information from
• Document collections (macro-level)?
– Such as every research article in the field of AI
• Individual or a few documents (micro-level)?
– Such as a thesaurus or a book or speech
– Shakespeare, Bible, etc.
INFOVIS8803DV > SPRING 17
Example Macro-Level Tasks
• Which documents contain text on topic XYZ?
• Are there other documents that might be close enough to be
worthwhile?
• How do documents fit into a larger context?
• What documents might be of interest to me?
• Which documents have a negative/angry tone?
INFOVIS8803DV > SPRING 17
Example Micro-Level Tasks
• What are the main themes of a document?
• How are certain words or themes distributed through a document?
• How does one document compare to or relate to other documents?
• In what contexts is the word “inflation” used with the word
“spending?”
INFOVIS8803DV > SPRING 17
Visualizing TedX Talks
https://www.ted.com/talks/eric_berlow_and_sean_gourley_mapping_ideas_worth_spreading
INFOVIS8803DV > SPRING 17
Related Topic
Information Retrieval
• Information Retrieval (IR) is the search process that locates
particular entities based on selection criteria
– Google search algorithms
– Library catalog search
• We will not discuss IR algorithms
• We will discuss how DataVis can help
– Understand what can be retrieved
– Understand what has been retrieved
– Browse
– Formulate more precise queries
– etc.
INFOVIS8803DV > SPRING 17
Challenge
• Text is nominal (discrete) data with a huge (infinite) cardinality
• The step “Raw data Data Table” mapping is important/central
INFOVIS8803DV > SPRING 17
Process for Text/Document InfoVis
Analysis Algorithms Visualization
Raw Data
(Documents)
Decomposition
StatisticsSimilarity
Clustering
Relevance
Thesaurus
Word Count
etc.
2D, 3D Display
Vectors
Keywords
etc.
Data Tables
for InfoVis
INFOVIS8803DV > SPRING 17
Challenge (cont’d)
• Unstructured text does NOT have any explicit meta-data.
– Just that infinitely big collection of nominal data
– Meta-data is sometimes extracted from raw text
• What Jigsaw calls “entity extraction”
• Google News extracts dates
• Contrast to structured text of an online library with explicit meta-data such as
– Author name
– Year of publication
– Title
– ISBN number
– Library of Congress umber
– Publisher name
– etc.
INFOVIS8803DV > SPRING 17
Document Collections
• How do you present the contents/semantics/themes/etc. of the
documents to someone who does not have time to read them all?
• Who are the users?
– How often do YOU use Google/Yahoo/Bing???
– Students, researchers, news people, everyday people, CIA/FBI?
INFOVIS8803DV > SPRING 17
Outline
• Macro-level
– Searching larger document collections
• Unstructured – no meta-data
• Structured – explicit meta-data
– Search history
• Micro-level
– Inter-document methods for smaller document collections
• How do retrieved documents relate to a query?
• How do retrieved documents relate to one another?
– Intra-document methods
• Word usage, sentiment, grammatical style, …
INFOVIS8803DV > SPRING 17
Macro-Level: Large Unstructured
• Note: LARGE does not mean entire WWW!!
• A number of systems endeavor to give a “big picture view” – the
“gist” of a large collection of documents
– ThemeRiver
– Themescape
– WebThemes
– Galaxies
– Feature Maps/Self Organizing Maps (SOM)
INFOVIS8803DV > SPRING 17
Feature Maps (Self Organizing Maps)
• Developed by Teuvo Kohonen
(thus sometimes called Kohonen
Maps)
• Expresses complex, non-linear
relationships between high
dimensional data items into simple
geometric relationships on a 2D
display
• Creates clusters of “like” things
Self-organizing map of 83 Finnish newsgroups
and postings. Think of as a top view of a
ThemeScape, but organized with a different
method.
http://websom.hut.fi/websom/
Bright areas =>
more documents
INFOVIS8803DV > SPRING 17
Basic Idea to Create Maps
• Break each document into its words
• Two documents are “similar” if they share many words
– See later slide on Vector Space Analysis
• Use mass-spring graph-like algorithm for clustering similar
documents together and pushing dissimilar documents far apart
INFOVIS8803DV > SPRING 17
How do you compare similarity of two documents?
• One way: Vector Space Analysis
• Step 1
– For each document
• Make list of each unique word in document
– Throw out common words (a, an, the, …)
– Make different forms the same (bake, bakes, baked)
• Store count of how many times each word appeared
• Alphabetize, make into a vector
– One per document
INFOVIS8803DV > SPRING 17
Vector Space Analysis
• To compare two doc’s, determine how closely two vectors go in same direction
• Step 2– Form inner (dot) product of each doc’s vector with every other vector
– Gives similarity of each document to every other one
• Step 3– Use mass-spring layout algorithm to position representations of each document
– Dot product closeness
• Themescape makes mountains from clusters
Note: There are some similarities to how search engines work
INFOVIS8803DV > SPRING 17
… but not all Words Equal
• Not all terms or words are equally useful
• Often apply TFIDF
– Term Frequency, Inverse Document Frequency
• Weight of a word goes up if it appears often in a document, but not often in the collection
INFOVIS8803DV > SPRING 17
What about Understanding Small Information
Spaces?
• SMART – System for the Mechanical Analysis and Retrieval of Text
• VIBE
• Text Themes
INFOVIS8803DV > SPRING 17
SMART System
• Uses vector space model for documents
– May break document into chapters and sections and deal with those as
atoms
• Plot document atoms on circumference of circle
– Atom - document, or section, or paragraph
• Draw line between items if their similarity exceeds some threshold
value
Salton et al, Automatic Analysis, Theme Generation, and
Summarization of Machine-Readable Texts, Science June 1994
INFOVIS8803DV > SPRING 17
SMART System
• Four documents shown
• Lines give similarity between
documents, if above .20
• Items evenly spaced
• Doesn’t give viewer idea of
how big each
section/document is
• Very early system by Jerry
Salton, the father of
Information Retrieval
INFOVIS8803DV > SPRING 17
SMART – another example
• Connections between
paragraphs in a single
document
• No weights shown
– Clutter problem
– How about dynamic query
on weights?
INFOVIS8803DV > SPRING 17
VIBE System
• Smaller sets of documents than whole library
– Example: Set of 100 documents retrieved from a web search
– Idea is to understand how contents of documents relate to each other
• Visualize Keywords and Documents
– Show relation of each Doc to Keywords
– “Similar” Doc’s cluster together
Olsen et al
Info Process & Mgmt ‘93
INFOVIS8803DV > SPRING 17
VIBE Pro’s and Con’s
• Effectively communicate
relationships
• Straightforward methodology
and VIS are easy to follow
• Can show relatively large
collections
• Not showing much about a
document
– Could encode info in Doc
Marks
• Single items lose “detail” in the
presentation
• Starts to break down with large
number of keywords
INFOVIS8803DV > SPRING 17
Radial Visualization (VIBE-like)
• VIBE expanded to
more terms
• Au et al, New Paradigms
in Information
Visualization, International
ACM Information Retrieval
Conference (2000), 307-
309
INFOVIS8803DV > SPRING 17
Visualizing Document Collections
• VIBE and Radial Visualization present documents with respect to a
small set of user-specified query terms
• Oriented toward unstructured text, but search terms could be meta-
data
• Problematic if set of visualized documents gets big
• Can be used with results of query to large or small set of documents
• Details-on-Demand easy to add
INFOVIS8803DV > SPRING 17
Macro-Level
How do we visualize Large Structured Data?
• ResultMap
• PaperLens
• FacetMaps
INFOVIS8803DV > SPRING 17
Consider the Following Problem
• Digital libraries are:
– Increasingly large
– Finding relevant documents
difficult
– Analyzing, comparing, and
structuring, the documents
are not easy to do
• Current digital libraries do
not support:
– Providing more than simple
statistical facts
– Making correlations among
the paper topics and authors
over time
– Gaining insight into paper
trends, themes and patterns
INFOVIS8803DV > SPRING 17
Structured Info Spaces: ResultMaps
• Problem
– Understand information space and
how retrieved results fit into that
space
• Solution - ResultMaps
– Based on TreeMaps
– Highlight documents retrieved via
text query
– Linkage from highlight to
document in retrieval list
INFOVIS8803DV > SPRING 17
ResultMap
• Experimental evaluation
– Compare with Google-
style results list
– Not much help
– However, some
evidence that they are
subjectively preferred
and help understand
overall document
collection structure
INFOVIS8803DV > SPRING 17
Structured InfoSpaces: PaperLens
a) Popularity of topic
b) Selected authors
c) Author list
d) Degrees of
separation of links
e) Paper list
f) Year-by-year top ten
cited papers/ authors –
can be sorted by topic
http://www.cs.umd.edu/hcil/paperlens/PaperLens-Video.mov
INFOVIS8803DV > SPRING 17
Here is a challenge for you …
• Visualize an entire book
– What does that mean?
– What do you need to consider?
– What would you do?
INFOVIS8803DV > SPRING 17
Tag Clouds and Word Clouds
• Tag Clouds represent explicit meta-data about a document or
website or picture (typically user-assigned – “crowd-sourced
keywords”)
• Word Clouds, in contrast, are generated from the words of a
document or document collection or website.
Note: Ways to use are same, but we will always refer to Word Clouds, not Tag Clouds
INFOVIS8803DV > SPRING 17
Word Cloud Design Parameters
• Alphabetical order / Prominent in center / …
• Same orientation / different orientation
• Font
• Color / Monochrome
• Foreground/background
INFOVIS8803DV > SPRING 17
Break Lots of Gestalt Rules!
• Longer words grab more attention than shorter
– Big long word takes up 4 times area of word that is half as long and half as tall
• White space implies meaning when there is none intended
• Big words in center get extra (maybe too much) attention
• Eye moves around erratically, no alignments to aid scanning
• Words of same color may or may not be related (similarity)
• In the worst case, blue may appear in some other color
• Words in same orientation may or may not be related (similarity, common fate)
• Proximity provides no information or worse, misleading information
• Position in scanning sequence has saliency (remembering, forgetting) effects
• Visual comparisons difficult
INFOVIS8803DV > SPRING 17
Meaningful Associations Confused
Find the country
names in
this cloud
FAST!!
INFOVIS8803DV > SPRING 17
Alternative: “Semantic” Layout
Hassan-Monteroa & Herrero-Solana, Improving Tag-Clouds as Visual Information Retrieval Interfaces,
InSciT2006
Tags are grouped
based on clustering and
co-occurrence analysis
– words that co-occur
close to one another in
the text are placed
together in the cloud
INFOVIS8803DV > SPRING 17
WordClouds with Tableau
http://kb.tableau.com/articles/howto/creating-a-word-cloud
INFOVIS8803DV > SPRING 17
Concordances & Frequency Lists
• A concordance is an alphabetical list of the principal words used in
a book or body of work, with their immediate contexts
• A frequency list is a sorted list of words together with their
frequencies
INFOVIS8803DV > SPRING 17
Word Correlation
http://www.neoformix.com/2007/ATextExplorer.html
Dynamic graph:
selected word
shown in
concordance
Concordance:
selected word
in all contexts
Distribution of
all central
words in doc.
Color coded
central words
Frequently
used words:
can be added
to graph
INFOVIS8803DV > SPRING 17
Concordance: WordTree
• Shows context of a word
or words
– Follow word with all
the phrases that follow
it
• Font size shows
frequency of
appearance
• Continue branch until
hitting unique phrase
• Clicking on phrase
makes it the focus
• Ordered alphabetically,
by frequency, or by first
appearance
Wattenberg & Viégas
TVCG ‘08
INFOVIS8803DV > SPRING 17
Phrase Nets
• Examine unstructured text documents
• Presents pairs of terms from phrases such as
– X and Y (as in “pride and prejudice”)
– X’s Y (as in “Jim’s trains”)
– X at Y (as in “Macy’s at Lenox”)
– X (is|are|was|were) Y
• Uses special graph layout algorithm with compression and
simplification
van Ham et al
TVCG ‘09
INFOVIS8803DV > SPRING 17
Document Correlation
• Understanding relationship between two (or more) documents
• What kinds of relationships might one want to understand?
INFOVIS8803DV > SPRING 17
Document Correlation: Jigsaw
Links indicate documents with common term
http://www.iilabgt.org/listview/
INFOVIS8803DV > SPRING 17
Adding DataVis to Google
Hoeber & Yang, Comparative Study of Web Search
Interfaces, 2006 Conference on Web Intelligence (ACM
Digital Library)
Concepts related
to search terms
Search
terms
Items in
window
HotMap and Concept Highlighter tested
somewhat better. See paper for details.
INFOVIS8803DV > SPRING 17
Understanding Relevance: TileBars
• Goal
– Minimize time and effort for deciding which documents to view in detail
• Idea
– Show the role of the query terms in the retrieved documents, making
use of document structure
• Graphical representation of term distribution and overlap
• Simultaneously indicate:
– Relative document length
– Frequency of term sets in document
– Distribution of term sets with respect to the document and each other
INFOVIS8803DV > SPRING 17
How do we think about all of this?
• Remember this outline?
• Macro-level – searching larger document collections
– Unstructured – no meta-data
– Structured – explicit meta-data
– Search history
• Micro-level
– Inter-document methods for smaller document collections
• How do retrieved documents relate to a query?
• How do retrieved documents relate to one another?
– Intra-document methods
• Word usage, grammatical style, …
• With the caveat that some methods can be used in multiple ways
INFOVIS8803DV > SPRING 17
Another Way of Remembering:
Information Space Browsing Model
Organize
Relate
Skim
Read
Understand
Navigate
Search
• How can DataVis Help Each of These Steps?
• (Some DataVis methods may help multiple steps)
INFOVIS8803DV > SPRING 17
How can DataVis Facilitate Search?
Organize
Relate
Skim
Read
Understand
Navigate
Search
• Understand overall “gist” of an
Information Space
– Understand document collection
space (usually large doc space)
• Example: Themescape
– Understand how search results
relate to information spae (usually
smaller doc space)
• Example: ResultsMap
• Understand search history / go
back to previous searches
– Examples: Graphical history,
Sparkler
INFOVIS8803DV > SPRING 17
How can DataVis Help Understand Documents?
Organize
Relate
Skim
Read
Understand
Navigate
Search
What is this document about?
• Examples
– Key Word in Context, Word Clouds,
Phrase Nets,
– Relevance to Query – Hearst’s TileBars,
Veerasamy & Belkin
INFOVIS8803DV > SPRING 17
How can DataVis Help Organize & Relate?
Organize
Relate
Skim
Read
Understand
Navigate
Search
What is a collection of docs all about?
• Examples Key Word in Context
Veerasamy & Belkin Relevance to Query
Hearst TileBars Relevance to Query
Word Clouds
ThemeScape
How does this doc relate to others?
• Examples Veerasamy & Belkin Relevance to Query
Hearst TileBars Relevance to Query
Word Clouds
ResultMap
INFOVIS8803DV > SPRING 17
How can DataVis Help Navigate (Web) Linkages?
Organize
Relate
Skim
Read
Understand
Navigate
Search
Web Linkage Graphs
INFOVIS8803DV > SPRING 17
Takeaways
• It’s a huge and important space: From searching everything (WWW) to analyzing a single document. There are many opportunities for creativity.
• From big picture overview of many docs to query-related views to detailed views of a few docs to within a single doc. Think about using the usual suspects of interaction (Details-on-demand, Dynamic queries, Semantic zoom, Animation, Brush/Link)
• Please think about the following:– What are user activities with Text and Documents? How can InfoVis support
those activities?
– Which methods scale from one or a few documents to thousands of docs on up to the WWW? Why? Why not?
– How do we know which methods are good and which are not so good?
– Are there places where using InfoVis does not make sense? What are they?
INFOVIS8803DV > SPRING 17
HW5: Text
• The purpose of this assignment is to provide you with further experience in analyzing and understanding multivariate datasets. The particular focus of this HW is a dataset that is rich with textual data. It is a document collection that consists of a set of reviews of a Samsung TV from amazon.com.
• Draw/sketch/show your design on a piece of paper or a few pages (don't go overboard). Feel free to annotate the sketch with small comments or captions to explain what it is and how it would work. On a separate page, explain your visualization design in a paragraph or two, how it would start, what the interaction would be, etc.
• More details on course webpage.
• Bring two (2) copies of the network visualization to class and submit HW4 on T-Square.