text and document - wordpress.com 8803dv > spring 17 challenge (cont’d) • unstructured text...

INFOVIS8803DV > SPRING 17

TEXT & DOCUMENT

VISUALIZATION

Prof. Rahul C. Basole

CS/MGT 8803-DV > February 20, 2017


Text is Everywhere

• We (may) use documents as primary information artifact in our lives

• Our access to documents/information has grown tremendously

• And the amount of information has grown!


Library


World Wide Web


Email Archives


Blogs


Wiki


How can DataVis help users ….

… in gathering, understanding, using, comparing information from

• Document collections (macro-level)?

– Such as every research article in the field of AI

• Individual or a few documents (micro-level)?

– Such as a thesaurus or a book or speech

– Shakespeare, Bible, etc.


Example Macro-Level Tasks

• Which documents contain text on topic XYZ?

• Are there other documents that might be close enough to be

worthwhile?

• How do documents fit into a larger context?

• What documents might be of interest to me?

• Which documents have a negative/angry tone?


Example Micro-Level Tasks

• What are the main themes of a document?

• How are certain words or themes distributed through a document?

• How does one document compare to or relate to other documents?

• In what contexts is the word “inflation” used with the word

“spending?”


Recall: Newsmap

http://newsmap.jp/


http://textarc.org

TextArc


Visualizing TedX Talks

https://www.ted.com/talks/eric_berlow_and_sean_gourley_mapping_ideas_worth_spreading


Related Topic

Information Retrieval

• Information Retrieval (IR) is the search process that locates

particular entities based on selection criteria

– Google search algorithms

– Library catalog search

• We will not discuss IR algorithms

• We will discuss how DataVis can help

– Understand what can be retrieved

– Understand what has been retrieved

– Browse

– Formulate more precise queries

– etc.


Text Visualization Browser

http://textvis.lnu.se/


What is the Challenge?


Challenge

• Text is nominal (discrete) data with a huge (infinite) cardinality

• The step “Raw data Data Table” mapping is important/central


Process for Text/Document InfoVis

Analysis Algorithms Visualization

Raw Data

(Documents)

Decomposition

StatisticsSimilarity

Clustering

Relevance

Thesaurus

Word Count

etc.

2D, 3D Display

Vectors

Keywords

etc.

Data Tables

for InfoVis


Challenge (cont’d)

• Unstructured text does NOT have any explicit meta-data.

– Just that infinitely big collection of nominal data

– Meta-data is sometimes extracted from raw text

• What Jigsaw calls “entity extraction”

• Google News extracts dates

• Contrast to structured text of an online library with explicit meta-data such as

– Author name

– Year of publication

– Title

– ISBN number

– Library of Congress umber

– Publisher name

– etc.


Document Collections

• How do you present the contents/semantics/themes/etc. of the

documents to someone who does not have time to read them all?

• Who are the users?

– How often do YOU use Google/Yahoo/Bing???

– Students, researchers, news people, everyday people, CIA/FBI?


Outline

• Macro-level

– Searching larger document collections

• Unstructured – no meta-data

• Structured – explicit meta-data

– Search history

• Micro-level

– Inter-document methods for smaller document collections

• How do retrieved documents relate to a query?

• How do retrieved documents relate to one another?

– Intra-document methods

• Word usage, sentiment, grammatical style, …


Macro-Level: Large Unstructured

• Note: LARGE does not mean entire WWW!!

• A number of systems endeavor to give a “big picture view” – the

“gist” of a large collection of documents

– ThemeRiver

– Themescape

– WebThemes

– Galaxies

– Feature Maps/Self Organizing Maps (SOM)


ThemeRiver


Themescape

Height/color encode

document density

3D Landscape


Patent Data ThemeScape


WebTheme


Galaxies

Presentation of documents where

similar ones cluster together


IN-SPIRE


Feature Maps (Self Organizing Maps)

• Developed by Teuvo Kohonen

(thus sometimes called Kohonen

Maps)

• Expresses complex, non-linear

relationships between high

dimensional data items into simple

geometric relationships on a 2D

display

• Creates clusters of “like” things

Self-organizing map of 83 Finnish newsgroups

and postings. Think of as a top view of a

ThemeScape, but organized with a different

method.

http://websom.hut.fi/websom/

Bright areas =>

more documents


Basic Idea to Create Maps

• Break each document into its words

• Two documents are “similar” if they share many words

– See later slide on Vector Space Analysis

• Use mass-spring graph-like algorithm for clustering similar

documents together and pushing dissimilar documents far apart


How do you compare similarity of two documents?

• One way: Vector Space Analysis

• Step 1

– For each document

• Make list of each unique word in document

– Throw out common words (a, an, the, …)

– Make different forms the same (bake, bakes, baked)

• Store count of how many times each word appeared

• Alphabetize, make into a vector

– One per document


Vector Space Analysis

• To compare two doc’s, determine how closely two vectors go in same direction

• Step 2– Form inner (dot) product of each doc’s vector with every other vector

– Gives similarity of each document to every other one

• Step 3– Use mass-spring layout algorithm to position representations of each document

– Dot product closeness

• Themescape makes mountains from clusters

Note: There are some similarities to how search engines work


… but not all Words Equal

• Not all terms or words are equally useful

• Often apply TFIDF

– Term Frequency, Inverse Document Frequency

• Weight of a word goes up if it appears often in a document, but not often in the collection


What about Understanding Small Information

Spaces?

• SMART – System for the Mechanical Analysis and Retrieval of Text

• VIBE

• Text Themes


SMART System

• Uses vector space model for documents

– May break document into chapters and sections and deal with those as

atoms

• Plot document atoms on circumference of circle

– Atom - document, or section, or paragraph

• Draw line between items if their similarity exceeds some threshold

value

Salton et al, Automatic Analysis, Theme Generation, and

Summarization of Machine-Readable Texts, Science June 1994


SMART System

• Four documents shown

• Lines give similarity between

documents, if above .20

• Items evenly spaced

• Doesn’t give viewer idea of

how big each

section/document is

• Very early system by Jerry

Salton, the father of

Information Retrieval


SMART – another example

• Connections between

paragraphs in a single

document

• No weights shown

– Clutter problem

– How about dynamic query

on weights?


VIBE System

• Smaller sets of documents than whole library

– Example: Set of 100 documents retrieved from a web search

– Idea is to understand how contents of documents relate to each other

• Visualize Keywords and Documents

– Show relation of each Doc to Keywords

– “Similar” Doc’s cluster together

Olsen et al

Info Process & Mgmt ‘93


VIBE Visualization


VIBE Pro’s and Con’s

• Effectively communicate

relationships

• Straightforward methodology

and VIS are easy to follow

• Can show relatively large

collections

• Not showing much about a

document

– Could encode info in Doc

Marks

• Single items lose “detail” in the

presentation

• Starts to break down with large

number of keywords


Radial Visualization (VIBE-like)

• VIBE expanded to

more terms

• Au et al, New Paradigms

in Information

Visualization, International

ACM Information Retrieval

Conference (2000), 307-

309


Visualizing Document Collections

• VIBE and Radial Visualization present documents with respect to a

small set of user-specified query terms

• Oriented toward unstructured text, but search terms could be meta-

data

• Problematic if set of visualized documents gets big

• Can be used with results of query to large or small set of documents

• Details-on-Demand easy to add


Macro-Level

How do we visualize Large Structured Data?

• ResultMap

• PaperLens

• FacetMaps


Consider the Following Problem

• Digital libraries are:

– Increasingly large

– Finding relevant documents

difficult

– Analyzing, comparing, and

structuring, the documents

are not easy to do

• Current digital libraries do

not support:

– Providing more than simple

statistical facts

– Making correlations among

the paper topics and authors

over time

– Gaining insight into paper

trends, themes and patterns


Structured Info Spaces: ResultMaps

• Problem

– Understand information space and

how retrieved results fit into that

space

• Solution - ResultMaps

– Based on TreeMaps

– Highlight documents retrieved via

text query

– Linkage from highlight to

document in retrieval list


ResultMap

• Experimental evaluation

– Compare with Google-

style results list

– Not much help

– However, some

evidence that they are

subjectively preferred

and help understand

overall document

collection structure


Structured InfoSpaces: PaperLens

a) Popularity of topic

b) Selected authors

c) Author list

d) Degrees of

separation of links

e) Paper list

f) Year-by-year top ten

cited papers/ authors –

can be sorted by topic

http://www.cs.umd.edu/hcil/paperlens/PaperLens-Video.mov


Micro Level

How do we visualize one or a few documents?


Here is a challenge for you …

• Visualize an entire book

– What does that mean?

– What do you need to consider?

– What would you do?


What’s with Clouds?


Tag Clouds and Word Clouds

• Tag Clouds represent explicit meta-data about a document or

website or picture (typically user-assigned – “crowd-sourced

keywords”)

• Word Clouds, in contrast, are generated from the words of a

document or document collection or website.

Note: Ways to use are same, but we will always refer to Word Clouds, not Tag Clouds


Word Cloud: 2013 State of Union Speech


Word Cloud Design Parameters

• Alphabetical order / Prominent in center / …

• Same orientation / different orientation

• Font

• Color / Monochrome

• Foreground/background


Alpha order / prominent in center / etc


Same / Different Orientation


Font


Color Palette n / Monochrome


Foreground / Background


Lots of Other Design Options


More Design Options: Wordle

http://www.wordle.net


Break Lots of Gestalt Rules!

• Longer words grab more attention than shorter

– Big long word takes up 4 times area of word that is half as long and half as tall

• White space implies meaning when there is none intended

• Big words in center get extra (maybe too much) attention

• Eye moves around erratically, no alignments to aid scanning

• Words of same color may or may not be related (similarity)

• In the worst case, blue may appear in some other color

• Words in same orientation may or may not be related (similarity, common fate)

• Proximity provides no information or worse, misleading information

• Position in scanning sequence has saliency (remembering, forgetting) effects

• Visual comparisons difficult


Meaningful Associations Confused

Find the country

names in

this cloud

FAST!!


Alternative: “Semantic” Layout

Hassan-Monteroa & Herrero-Solana, Improving Tag-Clouds as Visual Information Retrieval Interfaces,

InSciT2006

Tags are grouped

based on clustering and

co-occurrence analysis

– words that co-occur

close to one another in

the text are placed

together in the cloud


Tag Cloud AlternativesProvided by Martin Wattenberg


WordClouds with Tableau

http://kb.tableau.com/articles/howto/creating-a-word-cloud


Concordances & Frequency Lists

• A concordance is an alphabetical list of the principal words used in

a book or body of work, with their immediate contexts

• A frequency list is a sorted list of words together with their

frequencies


Concordance & Frequency List Together

www.concordancesoftware.co.uk


Visual Thesaurus

http://www.visualthesaurus.com/


Concordances & Word Frequencies


Word Correlation

http://www.neoformix.com/2007/ATextExplorer.html

Dynamic graph:

selected word

shown in

concordance

Concordance:

selected word

in all contexts

Distribution of

all central

words in doc.

Color coded

central words

Frequently

used words:

can be added

to graph

http://www.neoformix.com/2007/ATextExplorer.html


Concordance: ManyEyes’ WordTree


Concordance: WordTree

• Shows context of a word

or words

– Follow word with all

the phrases that follow

it

• Font size shows

frequency of

appearance

• Continue branch until

hitting unique phrase

• Clicking on phrase

makes it the focus

• Ordered alphabetically,

by frequency, or by first

appearance

Wattenberg & Viégas

TVCG ‘08


Phrase Nets

• Examine unstructured text documents

• Presents pairs of terms from phrases such as

– X and Y (as in “pride and prejudice”)

– X’s Y (as in “Jim’s trains”)

– X at Y (as in “Macy’s at Lenox”)

– X (is|are|was|were) Y

• Uses special graph layout algorithm with compression and

simplification

van Ham et al

TVCG ‘09


Phrase Net Examples


Document Correlation

• Understanding relationship between two (or more) documents

• What kinds of relationships might one want to understand?


Document Correlation: NYTimes


Document Correlation: Jigsaw

Links indicate documents with common term

http://www.iilabgt.org/listview/


Adding DataVis to Google

Hoeber & Yang, Comparative Study of Web Search

Interfaces, 2006 Conference on Web Intelligence (ACM

Digital Library)

Concepts related

to search terms

Search

terms

Items in

window

HotMap and Concept Highlighter tested

somewhat better. See paper for details.


Adding DataVis to Google?


Understanding Relevance: TileBars

• Goal

– Minimize time and effort for deciding which documents to view in detail

• Idea

– Show the role of the query terms in the retrieved documents, making

use of document structure

• Graphical representation of term distribution and overlap

• Simultaneously indicate:

– Relative document length

– Frequency of term sets in document

– Distribution of term sets with respect to the document and each other


How do we think about all of this?

• Remember this outline?

• Macro-level – searching larger document collections

– Unstructured – no meta-data

– Structured – explicit meta-data

– Search history

• Micro-level

– Inter-document methods for smaller document collections

• How do retrieved documents relate to a query?

• How do retrieved documents relate to one another?

– Intra-document methods

• Word usage, grammatical style, …

• With the caveat that some methods can be used in multiple ways


Another Way of Remembering:

Information Space Browsing Model

Organize

Relate

Skim

Read

Understand

Navigate

Search

• How can DataVis Help Each of These Steps?

• (Some DataVis methods may help multiple steps)


How can DataVis Facilitate Search?

Organize

Relate

Skim

Read

Understand

Navigate

Search

• Understand overall “gist” of an

Information Space

– Understand document collection

space (usually large doc space)

• Example: Themescape

– Understand how search results

relate to information spae (usually

smaller doc space)

• Example: ResultsMap

• Understand search history / go

back to previous searches

– Examples: Graphical history,

Sparkler


How can DataVis Help Understand Documents?

Organize

Relate

Skim

Read

Understand

Navigate

Search

What is this document about?

• Examples

– Key Word in Context, Word Clouds,

Phrase Nets,

– Relevance to Query – Hearst’s TileBars,

Veerasamy & Belkin


How can DataVis Help Organize & Relate?

Organize

Relate

Skim

Read

Understand

Navigate

Search

What is a collection of docs all about?

• Examples Key Word in Context

Veerasamy & Belkin Relevance to Query

Hearst TileBars Relevance to Query

Word Clouds

ThemeScape

How does this doc relate to others?

• Examples Veerasamy & Belkin Relevance to Query

Hearst TileBars Relevance to Query

Word Clouds

ResultMap


How can DataVis Help Navigate (Web) Linkages?

Organize

Relate

Skim

Read

Understand

Navigate

Search

Web Linkage Graphs


Takeaways

• It’s a huge and important space: From searching everything (WWW) to analyzing a single document. There are many opportunities for creativity.

• From big picture overview of many docs to query-related views to detailed views of a few docs to within a single doc. Think about using the usual suspects of interaction (Details-on-demand, Dynamic queries, Semantic zoom, Animation, Brush/Link)

• Please think about the following:– What are user activities with Text and Documents? How can InfoVis support

those activities?

– Which methods scale from one or a few documents to thousands of docs on up to the WWW? Why? Why not?

– How do we know which methods are good and which are not so good?

– Are there places where using InfoVis does not make sense? What are they?


HW5: Text

• The purpose of this assignment is to provide you with further experience in analyzing and understanding multivariate datasets. The particular focus of this HW is a dataset that is rich with textual data. It is a document collection that consists of a set of reviews of a Samsung TV from amazon.com.

• Draw/sketch/show your design on a piece of paper or a few pages (don't go overboard). Feel free to annotate the sketch with small comments or captions to explain what it is and how it would work. On a separate page, explain your visualization design in a paragraph or two, how it would start, what the interaction would be, etc.

• More details on course webpage.

• Bring two (2) copies of the network visualization to class and submit HW4 on T-Square.