text and document - wordpress.com 8803dv > spring 17 challenge (cont’d) • unstructured text...

91
INFOVIS 8803DV > SPRING 17 TEXT & DOCUMENT VISUALIZATION Prof. Rahul C. Basole CS/MGT 8803-DV > February 20, 2017

Upload: vuonganh

Post on 25-Apr-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

INFOVIS8803DV > SPRING 17

TEXT & DOCUMENT

VISUALIZATION

Prof. Rahul C. Basole

CS/MGT 8803-DV > February 20, 2017

INFOVIS8803DV > SPRING 17

Text is Everywhere

• We (may) use documents as primary information artifact in our lives

• Our access to documents/information has grown tremendously

• And the amount of information has grown!

INFOVIS8803DV > SPRING 17

Library

INFOVIS8803DV > SPRING 17

World Wide Web

INFOVIS8803DV > SPRING 17

Email Archives

INFOVIS8803DV > SPRING 17

Blogs

INFOVIS8803DV > SPRING 17

Wiki

INFOVIS8803DV > SPRING 17

How can DataVis help users ….

… in gathering, understanding, using, comparing information from

• Document collections (macro-level)?

– Such as every research article in the field of AI

• Individual or a few documents (micro-level)?

– Such as a thesaurus or a book or speech

– Shakespeare, Bible, etc.

INFOVIS8803DV > SPRING 17

Example Macro-Level Tasks

• Which documents contain text on topic XYZ?

• Are there other documents that might be close enough to be

worthwhile?

• How do documents fit into a larger context?

• What documents might be of interest to me?

• Which documents have a negative/angry tone?

INFOVIS8803DV > SPRING 17

Example Micro-Level Tasks

• What are the main themes of a document?

• How are certain words or themes distributed through a document?

• How does one document compare to or relate to other documents?

• In what contexts is the word “inflation” used with the word

“spending?”

INFOVIS8803DV > SPRING 17

Recall: Newsmap

http://newsmap.jp/

INFOVIS8803DV > SPRING 17

http://textarc.org

TextArc

INFOVIS8803DV > SPRING 17

Visualizing TedX Talks

https://www.ted.com/talks/eric_berlow_and_sean_gourley_mapping_ideas_worth_spreading

INFOVIS8803DV > SPRING 17

Related Topic

Information Retrieval

• Information Retrieval (IR) is the search process that locates

particular entities based on selection criteria

– Google search algorithms

– Library catalog search

• We will not discuss IR algorithms

• We will discuss how DataVis can help

– Understand what can be retrieved

– Understand what has been retrieved

– Browse

– Formulate more precise queries

– etc.

INFOVIS8803DV > SPRING 17

Text Visualization Browser

http://textvis.lnu.se/

INFOVIS8803DV > SPRING 17

What is the Challenge?

INFOVIS8803DV > SPRING 17

Challenge

• Text is nominal (discrete) data with a huge (infinite) cardinality

• The step “Raw data Data Table” mapping is important/central

INFOVIS8803DV > SPRING 17

Process for Text/Document InfoVis

Analysis Algorithms Visualization

Raw Data

(Documents)

Decomposition

StatisticsSimilarity

Clustering

Relevance

Thesaurus

Word Count

etc.

2D, 3D Display

Vectors

Keywords

etc.

Data Tables

for InfoVis

INFOVIS8803DV > SPRING 17

Challenge (cont’d)

• Unstructured text does NOT have any explicit meta-data.

– Just that infinitely big collection of nominal data

– Meta-data is sometimes extracted from raw text

• What Jigsaw calls “entity extraction”

• Google News extracts dates

• Contrast to structured text of an online library with explicit meta-data such as

– Author name

– Year of publication

– Title

– ISBN number

– Library of Congress umber

– Publisher name

– etc.

INFOVIS8803DV > SPRING 17

Document Collections

• How do you present the contents/semantics/themes/etc. of the

documents to someone who does not have time to read them all?

• Who are the users?

– How often do YOU use Google/Yahoo/Bing???

– Students, researchers, news people, everyday people, CIA/FBI?

INFOVIS8803DV > SPRING 17

Outline

• Macro-level

– Searching larger document collections

• Unstructured – no meta-data

• Structured – explicit meta-data

– Search history

• Micro-level

– Inter-document methods for smaller document collections

• How do retrieved documents relate to a query?

• How do retrieved documents relate to one another?

– Intra-document methods

• Word usage, sentiment, grammatical style, …

INFOVIS8803DV > SPRING 17

Macro-Level: Large Unstructured

• Note: LARGE does not mean entire WWW!!

• A number of systems endeavor to give a “big picture view” – the

“gist” of a large collection of documents

– ThemeRiver

– Themescape

– WebThemes

– Galaxies

– Feature Maps/Self Organizing Maps (SOM)

INFOVIS8803DV > SPRING 17

ThemeRiver

INFOVIS8803DV > SPRING 17

Themescape

Height/color encode

document density

3D Landscape

INFOVIS8803DV > SPRING 17

Patent Data ThemeScape

INFOVIS8803DV > SPRING 17

WebTheme

INFOVIS8803DV > SPRING 17

Galaxies

Presentation of documents where

similar ones cluster together

INFOVIS8803DV > SPRING 17

IN-SPIRE

INFOVIS8803DV > SPRING 17

Feature Maps (Self Organizing Maps)

• Developed by Teuvo Kohonen

(thus sometimes called Kohonen

Maps)

• Expresses complex, non-linear

relationships between high

dimensional data items into simple

geometric relationships on a 2D

display

• Creates clusters of “like” things

Self-organizing map of 83 Finnish newsgroups

and postings. Think of as a top view of a

ThemeScape, but organized with a different

method.

http://websom.hut.fi/websom/

Bright areas =>

more documents

INFOVIS8803DV > SPRING 17

Basic Idea to Create Maps

• Break each document into its words

• Two documents are “similar” if they share many words

– See later slide on Vector Space Analysis

• Use mass-spring graph-like algorithm for clustering similar

documents together and pushing dissimilar documents far apart

INFOVIS8803DV > SPRING 17

How do you compare similarity of two documents?

• One way: Vector Space Analysis

• Step 1

– For each document

• Make list of each unique word in document

– Throw out common words (a, an, the, …)

– Make different forms the same (bake, bakes, baked)

• Store count of how many times each word appeared

• Alphabetize, make into a vector

– One per document

INFOVIS8803DV > SPRING 17

Vector Space Analysis

• To compare two doc’s, determine how closely two vectors go in same direction

• Step 2– Form inner (dot) product of each doc’s vector with every other vector

– Gives similarity of each document to every other one

• Step 3– Use mass-spring layout algorithm to position representations of each document

– Dot product closeness

• Themescape makes mountains from clusters

Note: There are some similarities to how search engines work

INFOVIS8803DV > SPRING 17

… but not all Words Equal

• Not all terms or words are equally useful

• Often apply TFIDF

– Term Frequency, Inverse Document Frequency

• Weight of a word goes up if it appears often in a document, but not often in the collection

INFOVIS8803DV > SPRING 17

What about Understanding Small Information

Spaces?

• SMART – System for the Mechanical Analysis and Retrieval of Text

• VIBE

• Text Themes

INFOVIS8803DV > SPRING 17

SMART System

• Uses vector space model for documents

– May break document into chapters and sections and deal with those as

atoms

• Plot document atoms on circumference of circle

– Atom - document, or section, or paragraph

• Draw line between items if their similarity exceeds some threshold

value

Salton et al, Automatic Analysis, Theme Generation, and

Summarization of Machine-Readable Texts, Science June 1994

INFOVIS8803DV > SPRING 17

SMART System

• Four documents shown

• Lines give similarity between

documents, if above .20

• Items evenly spaced

• Doesn’t give viewer idea of

how big each

section/document is

• Very early system by Jerry

Salton, the father of

Information Retrieval

INFOVIS8803DV > SPRING 17

SMART – another example

• Connections between

paragraphs in a single

document

• No weights shown

– Clutter problem

– How about dynamic query

on weights?

INFOVIS8803DV > SPRING 17

VIBE System

• Smaller sets of documents than whole library

– Example: Set of 100 documents retrieved from a web search

– Idea is to understand how contents of documents relate to each other

• Visualize Keywords and Documents

– Show relation of each Doc to Keywords

– “Similar” Doc’s cluster together

Olsen et al

Info Process & Mgmt ‘93

INFOVIS8803DV > SPRING 17

VIBE Visualization

INFOVIS8803DV > SPRING 17

VIBE Pro’s and Con’s

• Effectively communicate

relationships

• Straightforward methodology

and VIS are easy to follow

• Can show relatively large

collections

• Not showing much about a

document

– Could encode info in Doc

Marks

• Single items lose “detail” in the

presentation

• Starts to break down with large

number of keywords

INFOVIS8803DV > SPRING 17

Radial Visualization (VIBE-like)

• VIBE expanded to

more terms

• Au et al, New Paradigms

in Information

Visualization, International

ACM Information Retrieval

Conference (2000), 307-

309

INFOVIS8803DV > SPRING 17

Visualizing Document Collections

• VIBE and Radial Visualization present documents with respect to a

small set of user-specified query terms

• Oriented toward unstructured text, but search terms could be meta-

data

• Problematic if set of visualized documents gets big

• Can be used with results of query to large or small set of documents

• Details-on-Demand easy to add

INFOVIS8803DV > SPRING 17

Macro-Level

How do we visualize Large Structured Data?

• ResultMap

• PaperLens

• FacetMaps

INFOVIS8803DV > SPRING 17

Consider the Following Problem

• Digital libraries are:

– Increasingly large

– Finding relevant documents

difficult

– Analyzing, comparing, and

structuring, the documents

are not easy to do

• Current digital libraries do

not support:

– Providing more than simple

statistical facts

– Making correlations among

the paper topics and authors

over time

– Gaining insight into paper

trends, themes and patterns

INFOVIS8803DV > SPRING 17

Structured Info Spaces: ResultMaps

• Problem

– Understand information space and

how retrieved results fit into that

space

• Solution - ResultMaps

– Based on TreeMaps

– Highlight documents retrieved via

text query

– Linkage from highlight to

document in retrieval list

INFOVIS8803DV > SPRING 17

ResultMap

• Experimental evaluation

– Compare with Google-

style results list

– Not much help

– However, some

evidence that they are

subjectively preferred

and help understand

overall document

collection structure

INFOVIS8803DV > SPRING 17

Structured InfoSpaces: PaperLens

a) Popularity of topic

b) Selected authors

c) Author list

d) Degrees of

separation of links

e) Paper list

f) Year-by-year top ten

cited papers/ authors –

can be sorted by topic

http://www.cs.umd.edu/hcil/paperlens/PaperLens-Video.mov

INFOVIS8803DV > SPRING 17

Micro Level

How do we visualize one or a few documents?

INFOVIS8803DV > SPRING 17

Here is a challenge for you …

• Visualize an entire book

– What does that mean?

– What do you need to consider?

– What would you do?

INFOVIS8803DV > SPRING 17

INFOVIS8803DV > SPRING 17

What’s with Clouds?

INFOVIS8803DV > SPRING 17

Tag Clouds and Word Clouds

• Tag Clouds represent explicit meta-data about a document or

website or picture (typically user-assigned – “crowd-sourced

keywords”)

• Word Clouds, in contrast, are generated from the words of a

document or document collection or website.

Note: Ways to use are same, but we will always refer to Word Clouds, not Tag Clouds

INFOVIS8803DV > SPRING 17

Word Cloud: 2013 State of Union Speech

INFOVIS8803DV > SPRING 17

Word Cloud Design Parameters

• Alphabetical order / Prominent in center / …

• Same orientation / different orientation

• Font

• Color / Monochrome

• Foreground/background

INFOVIS8803DV > SPRING 17

Alpha order / prominent in center / etc

INFOVIS8803DV > SPRING 17

Same / Different Orientation

INFOVIS8803DV > SPRING 17

Font

INFOVIS8803DV > SPRING 17

Color Palette n / Monochrome

INFOVIS8803DV > SPRING 17

Foreground / Background

INFOVIS8803DV > SPRING 17

Lots of Other Design Options

INFOVIS8803DV > SPRING 17

More Design Options: Wordle

http://www.wordle.net

INFOVIS8803DV > SPRING 17

Break Lots of Gestalt Rules!

• Longer words grab more attention than shorter

– Big long word takes up 4 times area of word that is half as long and half as tall

• White space implies meaning when there is none intended

• Big words in center get extra (maybe too much) attention

• Eye moves around erratically, no alignments to aid scanning

• Words of same color may or may not be related (similarity)

• In the worst case, blue may appear in some other color

• Words in same orientation may or may not be related (similarity, common fate)

• Proximity provides no information or worse, misleading information

• Position in scanning sequence has saliency (remembering, forgetting) effects

• Visual comparisons difficult

INFOVIS8803DV > SPRING 17

Meaningful Associations Confused

Find the country

names in

this cloud

FAST!!

INFOVIS8803DV > SPRING 17

Alternative: “Semantic” Layout

Hassan-Monteroa & Herrero-Solana, Improving Tag-Clouds as Visual Information Retrieval Interfaces,

InSciT2006

Tags are grouped

based on clustering and

co-occurrence analysis

– words that co-occur

close to one another in

the text are placed

together in the cloud

INFOVIS8803DV > SPRING 17

Tag Cloud AlternativesProvided by Martin Wattenberg

INFOVIS8803DV > SPRING 17

WordClouds with Tableau

http://kb.tableau.com/articles/howto/creating-a-word-cloud

INFOVIS8803DV > SPRING 17

Concordances & Frequency Lists

• A concordance is an alphabetical list of the principal words used in

a book or body of work, with their immediate contexts

• A frequency list is a sorted list of words together with their

frequencies

INFOVIS8803DV > SPRING 17

Concordance & Frequency List Together

www.concordancesoftware.co.uk

INFOVIS8803DV > SPRING 17

Visual Thesaurus

http://www.visualthesaurus.com/

INFOVIS8803DV > SPRING 17

Concordances & Word Frequencies

INFOVIS8803DV > SPRING 17

Concordances & Word Frequencies

INFOVIS8803DV > SPRING 17

Word Correlation

http://www.neoformix.com/2007/ATextExplorer.html

Dynamic graph:

selected word

shown in

concordance

Concordance:

selected word

in all contexts

Distribution of

all central

words in doc.

Color coded

central words

Frequently

used words:

can be added

to graph

INFOVIS8803DV > SPRING 17

Concordance: ManyEyes’ WordTree

INFOVIS8803DV > SPRING 17

Concordance: WordTree

• Shows context of a word

or words

– Follow word with all

the phrases that follow

it

• Font size shows

frequency of

appearance

• Continue branch until

hitting unique phrase

• Clicking on phrase

makes it the focus

• Ordered alphabetically,

by frequency, or by first

appearance

Wattenberg & Viégas

TVCG ‘08

INFOVIS8803DV > SPRING 17

Phrase Nets

• Examine unstructured text documents

• Presents pairs of terms from phrases such as

– X and Y (as in “pride and prejudice”)

– X’s Y (as in “Jim’s trains”)

– X at Y (as in “Macy’s at Lenox”)

– X (is|are|was|were) Y

• Uses special graph layout algorithm with compression and

simplification

van Ham et al

TVCG ‘09

INFOVIS8803DV > SPRING 17

Phrase Net Examples

INFOVIS8803DV > SPRING 17

Phrase Net Examples

INFOVIS8803DV > SPRING 17

Document Correlation

• Understanding relationship between two (or more) documents

• What kinds of relationships might one want to understand?

INFOVIS8803DV > SPRING 17

Document Correlation: NYTimes

INFOVIS8803DV > SPRING 17

Document Correlation: Jigsaw

Links indicate documents with common term

http://www.iilabgt.org/listview/

INFOVIS8803DV > SPRING 17

Adding DataVis to Google

Hoeber & Yang, Comparative Study of Web Search

Interfaces, 2006 Conference on Web Intelligence (ACM

Digital Library)

Concepts related

to search terms

Search

terms

Items in

window

HotMap and Concept Highlighter tested

somewhat better. See paper for details.

INFOVIS8803DV > SPRING 17

Adding DataVis to Google?

INFOVIS8803DV > SPRING 17

Understanding Relevance: TileBars

• Goal

– Minimize time and effort for deciding which documents to view in detail

• Idea

– Show the role of the query terms in the retrieved documents, making

use of document structure

• Graphical representation of term distribution and overlap

• Simultaneously indicate:

– Relative document length

– Frequency of term sets in document

– Distribution of term sets with respect to the document and each other

INFOVIS8803DV > SPRING 17

How do we think about all of this?

• Remember this outline?

• Macro-level – searching larger document collections

– Unstructured – no meta-data

– Structured – explicit meta-data

– Search history

• Micro-level

– Inter-document methods for smaller document collections

• How do retrieved documents relate to a query?

• How do retrieved documents relate to one another?

– Intra-document methods

• Word usage, grammatical style, …

• With the caveat that some methods can be used in multiple ways

INFOVIS8803DV > SPRING 17

Another Way of Remembering:

Information Space Browsing Model

Organize

Relate

Skim

Read

Understand

Navigate

Search

• How can DataVis Help Each of These Steps?

• (Some DataVis methods may help multiple steps)

INFOVIS8803DV > SPRING 17

How can DataVis Facilitate Search?

Organize

Relate

Skim

Read

Understand

Navigate

Search

• Understand overall “gist” of an

Information Space

– Understand document collection

space (usually large doc space)

• Example: Themescape

– Understand how search results

relate to information spae (usually

smaller doc space)

• Example: ResultsMap

• Understand search history / go

back to previous searches

– Examples: Graphical history,

Sparkler

INFOVIS8803DV > SPRING 17

How can DataVis Help Understand Documents?

Organize

Relate

Skim

Read

Understand

Navigate

Search

What is this document about?

• Examples

– Key Word in Context, Word Clouds,

Phrase Nets,

– Relevance to Query – Hearst’s TileBars,

Veerasamy & Belkin

INFOVIS8803DV > SPRING 17

How can DataVis Help Organize & Relate?

Organize

Relate

Skim

Read

Understand

Navigate

Search

What is a collection of docs all about?

• Examples Key Word in Context

Veerasamy & Belkin Relevance to Query

Hearst TileBars Relevance to Query

Word Clouds

ThemeScape

How does this doc relate to others?

• Examples Veerasamy & Belkin Relevance to Query

Hearst TileBars Relevance to Query

Word Clouds

ResultMap

INFOVIS8803DV > SPRING 17

How can DataVis Help Navigate (Web) Linkages?

Organize

Relate

Skim

Read

Understand

Navigate

Search

Web Linkage Graphs

INFOVIS8803DV > SPRING 17

Takeaways

• It’s a huge and important space: From searching everything (WWW) to analyzing a single document. There are many opportunities for creativity.

• From big picture overview of many docs to query-related views to detailed views of a few docs to within a single doc. Think about using the usual suspects of interaction (Details-on-demand, Dynamic queries, Semantic zoom, Animation, Brush/Link)

• Please think about the following:– What are user activities with Text and Documents? How can InfoVis support

those activities?

– Which methods scale from one or a few documents to thousands of docs on up to the WWW? Why? Why not?

– How do we know which methods are good and which are not so good?

– Are there places where using InfoVis does not make sense? What are they?

INFOVIS8803DV > SPRING 17

HW5: Text

• The purpose of this assignment is to provide you with further experience in analyzing and understanding multivariate datasets. The particular focus of this HW is a dataset that is rich with textual data. It is a document collection that consists of a set of reviews of a Samsung TV from amazon.com.

• Draw/sketch/show your design on a piece of paper or a few pages (don't go overboard). Feel free to annotate the sketch with small comments or captions to explain what it is and how it would work. On a separate page, explain your visualization design in a paragraph or two, how it would start, what the interaction would be, etc.

• More details on course webpage.

• Bring two (2) copies of the network visualization to class and submit HW4 on T-Square.