analyzing unstructured text with topic models

52
Analyzing unstructured text with topic models Mark Steyvers Dep. of Cognitive Sciences & Dep. of Computer Science University of California, Irvine orators: Padhraic Smyth, UC Irvine; Tom Griffiths UC Berkeley

Upload: nola-gilliam

Post on 03-Jan-2016

46 views

Category:

Documents


0 download

DESCRIPTION

Analyzing unstructured text with topic models. Mark Steyvers Dep. of Cognitive Sciences & Dep. of Computer Science University of California, Irvine. collaborators: Padhraic Smyth, UC Irvine; Tom Griffiths UC Berkeley. Analyzing Unstructured Text. Pennsylvania Gazette (1728-1800) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Analyzing unstructured text with topic models

Analyzing unstructured text with topic models

Mark Steyvers

Dep. of Cognitive Sciences & Dep. of Computer ScienceUniversity of California, Irvine

collaborators: Padhraic Smyth, UC Irvine; Tom Griffiths UC Berkeley

Page 2: Analyzing unstructured text with topic models

NYT

330,000 articles

Enron

250,000 emails

16 million Medline articles

NSF/ NIH

100,000 grants

Analyzing Unstructured Text

AOL queries

20,000,000 queries

650,000 users

Pennsylvania Gazette

(1728-1800)

80,000 articles

Page 3: Analyzing unstructured text with topic models

Topic Models and Text Analysis

• Can answer a number of questions:

What is in this corpus?

What is in this document, paragraph, or sentence?

What does this person/group of people write about?

What tags are appropriate for this document?

What are the topical trends over time?

Page 4: Analyzing unstructured text with topic models

Topic Models

• Automatic and unsupervised extraction of semantic themes

from large text collections.

• Widely used model in machine learning and text mining

– pLSI Model: Hoffman (1999)

– LDA Model: Blei, Ng, and Jordan (2001, 2003)

– LDA with Gibbs sampling : Griffiths and Steyvers (2003, 2004)

Page 5: Analyzing unstructured text with topic models

Basic Assumptions

• Each topic is a distribution over words

• Each document a mixture of topics

• Each word in a document originates from a single topic

Page 6: Analyzing unstructured text with topic models

Model

P( words | document ) = P(words|topic) P (topic|document)

Topic = probability distribution over words

topic weightsfor each document

Automatically learned from text corpus

Page 7: Analyzing unstructured text with topic models

MONEYLOANBANKRIVER

STREAM

RIVERSTREAM

BANKMONEY

LOAN

Topics

.4

1.0

.6

1.0

MONEY1 BANK1 BANK1 LOAN1 BANK1 MONEY1 BANK1

MONEY1 BANK1 LOAN1 LOAN1 BANK1 MONEY1 ....

Topic Weights

Documents and topic assignments

RIVER2 MONEY1 BANK2 STREAM2 BANK2 BANK1

MONEY1 RIVER2 MONEY1 BANK2 LOAN1 MONEY1 ....

RIVER2 BANK2 STREAM2 BANK2 RIVER2 BANK2....

Toy Example

Page 8: Analyzing unstructured text with topic models

Topics

?

?

MONEY? BANK BANK? LOAN? BANK? MONEY?

BANK? MONEY? BANK? LOAN? LOAN? BANK?

MONEY? ....

TopicWeights

RIVER? MONEY? BANK? STREAM? BANK? BANK?

MONEY? RIVER? MONEY? BANK? LOAN?

MONEY? ....

RIVER? BANK? STREAM? BANK? RIVER?

BANK?....

Statistical Inference

Documents and topic

assignments

?

Page 9: Analyzing unstructured text with topic models

Statistical Inference

• Exact inference is intractable

• Markov chain Monte Carlo (MCMC) with Gibbs sampling

• scalable to large document collections (e.g. all of wikipedia)

• parallelizable

• Form of dimensionality reduction

– Number of topics T= 50…2000

Page 10: Analyzing unstructured text with topic models

Examples Topics from New York Times

WEEKDOW_JONES

POINTS10_YR_TREASURY_YIELD

PERCENTCLOSE

NASDAQ_COMPOSITESTANDARD_POOR

CHANGEFRIDAY

DOW_INDUSTRIALSGRAPH_TRACKS

EXPECTEDBILLION

NASDAQ_COMPOSITE_INDEXEST_02

PHOTO_YESTERDAYYEN10

500_STOCK_INDEX

WALL_STREETANALYSTS

INVESTORSFIRM

GOLDMAN_SACHSFIRMS

INVESTMENTMERRILL_LYNCH

COMPANIESSECURITIESRESEARCH

STOCKBUSINESSANALYST

WALL_STREET_FIRMSSALOMON_SMITH_BARNEY

CLIENTSINVESTMENT_BANKINGINVESTMENT_BANKERS

INVESTMENT_BANKS

SEPT_11WAR

SECURITYIRAQ

TERRORISMNATIONKILLED

AFGHANISTANATTACKS

OSAMA_BIN_LADENAMERICAN

ATTACKNEW_YORK_REGION

NEWMILITARY

NEW_YORKWORLD

NATIONALQAEDA

TERRORIST_ATTACKS

BANKRUPTCYCREDITORS

BANKRUPTCY_PROTECTIONASSETS

COMPANYFILED

BANKRUPTCY_FILINGENRON

BANKRUPTCY_COURTKMART

CHAPTER_11FILING

COOPERBILLIONS

COMPANIESBANKRUPTCY_PROCEEDINGS

DEBTSRESTRUCTURING

CASEGROUP

Terrorism Wall Street Firms

Stock Market

Bankruptcy

Page 11: Analyzing unstructured text with topic models

Learning multiple meanings of words

PRINTINGPAPERPRINT

PRINTEDTYPE

PROCESSINK

PRESSIMAGE

PRINTERPRINTS

PRINTERSCOPY

COPIESFORM

OFFSETGRAPHICSURFACE

PRODUCEDCHARACTERS

PLAYPLAYSSTAGE

AUDIENCETHEATERACTORSDRAMA

SHAKESPEAREACTOR

THEATREPLAYWRIGHT

PERFORMANCEDRAMATICCOSTUMES

COMEDYTRAGEDY

CHARACTERSSCENESOPERA

PERFORMED

TEAMGAME

BASKETBALLPLAYERSPLAYER

PLAYPLAYINGSOCCERPLAYED

BALLTEAMS

BASKETFOOTBALL

SCORECOURTGAMES

TRYCOACH

GYMSHOT

JUDGETRIAL

COURTCASEJURY

ACCUSEDGUILTY

DEFENDANTJUSTICE

EVIDENCEWITNESSES

CRIMELAWYERWITNESS

ATTORNEYHEARING

INNOCENTDEFENSECHARGE

CRIMINAL

HYPOTHESISEXPERIMENTSCIENTIFIC

OBSERVATIONSSCIENTISTS

EXPERIMENTSSCIENTIST

EXPERIMENTALTEST

METHODHYPOTHESES

TESTEDEVIDENCE

BASEDOBSERVATION

SCIENCEFACTSDATA

RESULTSEXPLANATION

STUDYTEST

STUDYINGHOMEWORK

NEEDCLASSMATHTRY

TEACHERWRITEPLAN

ARITHMETICASSIGNMENT

PLACESTUDIED

CAREFULLYDECIDE

IMPORTANTNOTEBOOK

REVIEW

Page 12: Analyzing unstructured text with topic models

Demographic Analysis of Search Queries

Page 13: Analyzing unstructured text with topic models

AOL dataset

• Dataset:

- 20,000,000+ web queries

- 650,000+ users

• Users were given “anonymous” user-id

– No demographics in this dataset

Page 14: Analyzing unstructured text with topic models

Example query log from user #2178

ID Query Date/Time URL clicked

2178 dog eats uncooked pasta 2006-05-26 15:31:562178 inducing dog vomiting 2006-05-26 15:32:46 http://www.twodogpress.com2178 inducing dog vomiting 2006-05-26 15:32:46 http://www.canismajor.com2178 inducing dog vomiting 2006-05-26 15:32:46 http://kitchen.robbiehaf.com2178 inducing dog vomiting 2006-05-26 15:32:46 http://www.dog-first-aid-101.com2178 inducing dog vomiting 2006-05-26 15:38:362178 walmart 2006-05-12 12:39:52 http://www.walmart.com2178 sears 2006-05-12 12:44:22 http://www.sears.com2178 target 2006-05-12 17:05:36 http://www.target.com2178 babycenter.com 2006-05-12 17:43:59 http://www.babycenter.com2178 google 2006-05-16 10:54:39 http://www.google.com2178 fit pregnancy 2006-05-16 15:34:232178 baby center 2006-05-16 15:37:222178 yahoo.com 2006-05-18 17:11:05 http://www.yahoo.com2178 applebee's carside 2006-05-19 19:21:08 http://www.applebees.com2178 baby names 2006-05-20 15:02:38 http://www.babynames.com2178 baby names 2006-05-20 15:02:38 http://www.babynamesworld.com2178 baby names 2006-05-20 15:02:38 http://www.thinkbabynames.com2178 mortgage calculator 2006-05-24 14:39:05 http://www.bankrate.com2178 us zip codes 2006-05-25 21:26:47 http://www.usps.com2178 us zip codes 2006-05-25 21:26:47 http://www.usps.com

Page 15: Analyzing unstructured text with topic models

Another Query Database…

• Not publicly available

• Dataset

– 250,000+ users

– 411,000+ queries

• Age and gender of users are known:

– age brackets: 0-12, 13-17, 18-20, 21-24, 25-29, 30-

34, 35-44, 45-54, 55-64, 65+

Page 16: Analyzing unstructured text with topic models

Topic modeling of queries

• Each user searches for a mixture of topics

• Each topic is a probability distribution over query words

Page 17: Analyzing unstructured text with topic models

Four example topics (out of 200)

brainfmri

imagingfunctional

mrisubjects

magneticresonance

neuroimagingstructural

schizophreniapatientsdeficits

schizophrenicpsychosissubjects

psychoticdysfunction

abnormalitiesclinical

memoryworking

memoriestasks

retrievalencodingcognitive

processingrecognition

performance

diseasead

alzheimerdiabetes

cardiovascularinsulin

vascularblood

clinicalindividuals

autocar

partscarsusedford

hondatruck

toyota

webmdcymbalta

xanaxgout

vicodineffexor

prednisonelexaproambien

partystore

weddingbirthdayjewelry

ideascardscakegifts

hannahmontana

zacefron

disneyhigh school

musicalmiley cyrushilary duff

Probability distribution over words. Most likely words listed at the top

Page 18: Analyzing unstructured text with topic models

User = mixture of topics

brainfmri

imagingfunctional

mrisubjects

magneticresonance

neuroimagingstructural

schizophreniapatientsdeficits

schizophrenicpsychosissubjects

psychoticdysfunction

abnormalitiesclinical

memoryworking

memoriestasks

retrievalencodingcognitive

processingrecognition

performance

diseasead

alzheimerdiabetes

cardiovascularinsulin

vascularblood

clinicalindividuals

autocar

partscarsusedford

hondatruck

toyota

hannahmontana

zacefron

disneyhigh school

musicalmiley cyrushilary duff

webmdcymbalta

xanaxgout

vicodineffexor

prednisonelexaproambien

partystore

weddingbirthdayjewelry

ideascardscakegifts

User #7654

80% 20%

User #246

100%

Page 19: Analyzing unstructured text with topic models

Topic Analysis

• Find likely topics for each demographic bucket

• Find likely demographics given topics

• What’s on the mind of people in different age-groups?

Page 20: Analyzing unstructured text with topic models

0-12

13-17

18-20

21-24

25-29

30-34

35-44

45-54

55-64

65+

Prob. topic

Age

gro

up

Male

Female Topic 6

poemslove_poems

quotespoetry

love_quotesfamous_quotes

lyricslove

funny_quotesfriendship_poemsbest_love_poems

funny_poemsinspirational_quotes

love_songsshakespeare

“poems” topic

Page 21: Analyzing unstructured text with topic models

“myspace” topic

0-12

13-17

18-20

21-24

25-29

30-34

35-44

45-54

55-64

65+

Prob. topic

Age

gro

up

Male

Female

Topic 2

myspacegoogle

my_spaceyahoo

mysapceabout_blank

myphotobuckethttp_googleww.myspace

myspace_com_blogshttp_myspacemyspace.cow_myspace

myspcae

Page 22: Analyzing unstructured text with topic models

“sports” topic

0-12

13-17

18-20

21-24

25-29

30-34

35-44

45-54

55-64

65+

Prob. topic

Age

gro

up

Male

Female Topic 29

espnnfl

nfl_draftnba

2006_nfl_mock_draft2006_nfl_draft

mlbreggie_bush

nfl_mock_draftdallas_cowboys

vince_youngfox_sports

lakersraiders

espn_sports

Page 23: Analyzing unstructured text with topic models

“MTV” topic

0-12

13-17

18-20

21-24

25-29

30-34

35-44

45-54

55-64

65+

Prob. topic

Age

gro

up

Male

Female Topic 92

betchris_brown

mtvlyricsciara

50_centti

proofbow_wow

chamillionairet.i.

beyonceatl

allhiphoplil_wayne

Page 24: Analyzing unstructured text with topic models

“Clothing Stores” topic

0-12

13-17

18-20

21-24

25-29

30-34

35-44

45-54

55-64

65+

Prob. topic

Age

gro

up

Male

Female

Topic 111

old_navyvictoria_secret

hollisteramerican_eagle

gapabercrombieaeropostaleforever_21

victorias_secretexpress

charlotte_russehot_topic

targetabercrombiefitch

wet_seal

Page 25: Analyzing unstructured text with topic models

“Hairstyles” topic

0-12

13-17

18-20

21-24

25-29

30-34

35-44

45-54

55-64

65+

Prob. topic

Age

gro

up

Male

Female Topic 173

hairstyleshair_styles

prom_hairstylespictureshairstyles

haircutssally_beauty_supplycelebrity_hairstyles

hairshort_hairstyles

cosmopolitanprom_updos

prom_hair_stylesshort_hair_styles

picturesprom_hairstylesprom_hair

Page 26: Analyzing unstructured text with topic models

0-12

13-17

18-20

21-24

25-29

30-34

35-44

45-54

55-64

65+

Prob. topic

Age

gro

up

Male

Female Topic 10

food_networkrecipes

foodnetworkfoodtv

martha_stewartkraft

betty_crockerfood_tv

food_network_recipesallrecipes

easter_recipesepicuriousrachel_raykraft_foods

chicken_recipes

“recipes” topic

Page 27: Analyzing unstructured text with topic models

Results

• Topic models give quick summaries of demographic

trends in query datasets

• Other potential applications:

– e.g. blogs, social networking sites, email, etc

– clinical data, e.g. therapy discussions

Page 28: Analyzing unstructured text with topic models

Analyzing Emailswho writes on what topics?

Page 29: Analyzing unstructured text with topic models

Enron email data

250,000 emails

5000 authors

1999-2002

Page 30: Analyzing unstructured text with topic models

Author-topic models

• We can learn the association between authors of

documents and topics

• Assume each author works on a mixture of topics

Page 31: Analyzing unstructured text with topic models

ENRON Email: who writes on certain topics?

WORD PROB. WORD PROB. WORD PROB. WORD PROB.

HOLIDAY 0.0857 TEXANS 0.0145 GOD 0.0357 AMAZON 0.0312

PARTY 0.0368 WIN 0.0143 LIFE 0.0272 GIFT 0.0226

YEAR 0.0316 FOOTBALL 0.0137 MAN 0.0116 CLICK 0.0193

SEASON 0.0305 FANTASY 0.0129 PEOPLE 0.0103 SAVE 0.0147

COMPANY 0.0255 SPORTSLINE 0.0129 CHRIST 0.0092 SHOPPING 0.0140

CELEBRATION 0.0199 PLAY 0.0123 FAITH 0.0083 OFFER 0.0124

ENRON 0.0198 TEAM 0.0114 LORD 0.0079 HOLIDAY 0.0122

TIME 0.0194 GAME 0.0112 JESUS 0.0075 RECEIVE 0.0102

RECOGNIZE 0.019 SPORTS 0.011 SPIRITUAL 0.0066 SHIPPING 0.0100

MONTH 0.018 GAMES 0.0109 VISIT 0.0065 FLOWERS 0.0099

SENDER PROB. SENDER PROB. SENDER PROB. SENDER PROB.

chairman & ceo 0.131 cbs sportsline com 0.0866 crosswalk com 0.2358 amazon com 0.1344

*** 0.0102 houston texans 0.0267 wordsmith 0.0208 jos a bank 0.0266

*** 0.0046 houstontexans 0.0203 *** 0.0107 sharperimageoffers 0.0136

*** 0.0022 sportsline rewards 0.0175 doctor dictionary 0.0101 travelocity com 0.0094

general announcement 0.0017 pro football 0.0136 *** 0.0061 barnes & noble com 0.0089

TOPIC 109TOPIC 66 TOPIC 182 TOPIC 113

... But also over senders (authors) of email. Most likely authors listed at the top

Page 32: Analyzing unstructured text with topic models

Enron email: two example topics (T=100)

WORD PROB.

BUSH 0.0227

LAY 0.0193

MR 0.0183

WHITE 0.0153

ENRON 0.0150

HOUSE 0.0148

PRESIDENT 0.0131

ADMINISTRATION 0.0115

COMPANY 0.0090

ENERGY 0.0085

SENDER PROB.

NELSON, KIMBERLY (ETS) 0.3608

PALMER, SARAH 0.0997

DENNE, KAREN 0.0541

HOTTE, STEVE 0.0340

DUPREE, DIANNA 0.0282

ARMSTRONG, JULIE 0.0222

LOKEY, TEB 0.0194

SULLIVAN, LORA 0.0073

VILLARREAL, LILLIAN 0.0040

BAGOT, NANCY 0.0026

TOPIC 10

WORD PROB.

ANDERSEN 0.0241

FIRM 0.0134

ACCOUNTING 0.0119

SEC 0.0065

SETTLEMENT 0.0062

AUDIT 0.0054

CORPORATE 0.0053

FINANCIAL 0.0052

JUSTICE 0.0052

INFORMATION 0.0050

SENDER PROB.

HILTABRAND, LESLIE 0.1359

WELLS, TORI L. 0.0865

DUPREE, DIANNA 0.0825

ARMSTRONG, JULIE 0.0316

DENNE, KAREN 0.0208

SULLIVAN, LORA 0.0072

[email protected] 0.0026

WILSON, DANNY 0.0016

HU, SYLVIA 0.0013

MATHEWS, LEENA 0.0012

TOPIC 32

Page 33: Analyzing unstructured text with topic models

Detecting Papers on Unusual Topics for Authors

• We can calculate perplexity (unusualness) for words in a

document given an author

Papers ranked by perplexity for M. Jordan:

Page 34: Analyzing unstructured text with topic models

Author Separation

Can model attribute words to authors correctly within a document?

A method1 is described which like the kernel1 trick1 in support1 vector1 machines1 SVMs1letsus generalizedistance1based2algorithmsto operatein feature1 spacesusually nonlinearlyrelatedto the input1 spaceThis is done by identifyinga classof kernels1 which can be representedas norm1 based2 distances1in HilbertspacesIt turns1 out that commonkernel1 algorithmssuch as SVMs1and kernel1 PCA1 are actually really distance1based2algorithmsand can be run2 with that classof kernels1 too As well as providing1 a useful new insight1 into how these algorithmsworkthe present2 workcan formthe basis1 for conceivingnew algorithms

This paperpresents2 a comprehensiveapproachfor model2 based2 diagnosis2which includesproposalsfor characterizingand computing2preferred2 diagnoses2assumingthat the system2 description2 is augmentedwith a system2 structure2 a directed2 graph2 explicating the interconnections between system2components2

Specificallywe first introducethe notionof a consequence2which is a syntactically2 unconstrainedpropositional2 sentence2 that characterizesall consistency2 based2 diagnoses2and show2 that standard2

characterizationsof diagnoses2 such as minimalconflicts1 correspondto syntactic2 variations1 on a consequence2Second we proposea new syntactic2 variationon the consequence2 known as negation2

normalformNNF and discussits meritscomparedto standardvariationsThird we introducea basicalgorithm2for computingconsequencesin NNF given a structuredsystem2 descriptionWe showthat if the system2structure2 does not contain cycles2 then there is always a linearsize2 consequence2in NNF which can be computedin lineartime2 For arbitrary1 system2 structures2 we showa preciseconnectionbetween the complexity2 of computing2 consequencesand the topologyof the underlyingsystem2structure2 Finallywe present2 an algorithm2 that enumerates2 the preferred2 diagnoses2characterizedby a consequence2The algorithm2is shown1 to take lineartime2 in the size2 of the consequence2if the preferencecriterion1 satisfiessome generalconditions

Written by(1) Scholkopf_B

Written by(2) Darwiche_A

Page 35: Analyzing unstructured text with topic models

Application:Faculty Browser

Page 36: Analyzing unstructured text with topic models

Faculty Browser

• Automatically analyzes computer science papers by

UC San Diego and UC Irvine researchers

• Finds topically related researchers

Page 37: Analyzing unstructured text with topic models

one topic

most prolific researchers in this topic

Page 38: Analyzing unstructured text with topic models

topics this researcher is interested in

other researchers with similar

topical interests

one researcher

Page 39: Analyzing unstructured text with topic models

Inferred network of researchers connected through topics

Page 40: Analyzing unstructured text with topic models

Modeling Extensions

Page 41: Analyzing unstructured text with topic models

330,000 articles

2000-2002

Entity-topic modeling

Who is mentioned in what context?

Page 42: Analyzing unstructured text with topic models

Three investigations began Thursday into the securities and exchange_commission's choice of william_webster to head a new board overseeing the accounting profession. house and senate_democrats called for the resignations of both judge_webster and harvey_pitt, the commission's chairman. The white_house expressed support for judge_webster as well as for harvey_pitt, who was harshly criticized Thursday for failing to inform other commissioners before they approved the choice of judge_webster that he had led the audit committee of a company facing fraud accusations. “The president still has confidence in harvey_pitt,” said dan_bartlett, bush's communications director …

Extracted Named Entities

Used standard algorithms to extract named entities:

- People- Places- Organizations

Page 43: Analyzing unstructured text with topic models

Standard Topic Model with Entities

team 0.028 tour 0.039 holiday 0.071 award 0.026play 0.015 rider 0.029 gift 0.050 film 0.020game 0.013 riding 0.017 toy 0.023 actor 0.020season 0.012 bike 0.016 season 0.019 nomination 0.019final 0.011 team 0.016 doll 0.014 movie 0.015games 0.011 stage 0.014 tree 0.011 actress 0.011point 0.011 race 0.013 present 0.008 won 0.011series 0.011 won 0.012 giving 0.008 director 0.010player 0.010 bicycle 0.010 special 0.007 nominated 0.010coach 0.009 road 0.009 shopping 0.007 supporting 0.010playoff 0.009 hour 0.009 family 0.007 winner 0.008championship 0.007 scooter 0.008 celebration 0.007 picture 0.008playing 0.006 mountain 0.008 card 0.007 performance 0.007win 0.006 place 0.008 tradition 0.006 nominees 0.007LAKERS 0.062 LANCE-ARMSTRONG 0.021 CHRISTMAS 0.058 OSCAR 0.035SHAQUILLE-O-NEAL0.028 FRANCE 0.011 THANKSGIVING 0.018 ACADEMY 0.020KOBE-BRYANT 0.028 JAN-ULLRICH 0.003 SANTA-CLAUS 0.009 HOLLYWOOD 0.009PHIL-JACKSON 0.019 LANCE 0.003 BARBIE 0.004 DENZEL-WASHINGTON 0.006NBA 0.013 U-S-POSTAL-SERVICE 0.002 HANUKKAH 0.003 JULIA-ROBERT 0.005SACRAMENTO 0.007 MARCO-PANTANI 0.002 MATTEL 0.003 RUSSELL-CROWE 0.005RICK-FOX 0.007 PARIS 0.002 GRINCH 0.003 TOM-HANK 0.005PORTLAND 0.006 ALPS 0.002 HALLMARK 0.002 STEVEN-SODERBERGH 0.004ROBERT-HORRY 0.006 PYRENEES 0.001 EASTER 0.002 ERIN-BROCKOVICH 0.003DEREK-FISHER 0.006 SPAIN 0.001 HASBRO 0.002 KEVIN-SPACEY 0.003

Basketball Holidays OscarsTour de France

Page 44: Analyzing unstructured text with topic models

computer 0.069 play 0.030technology 0.026 show 0.029system 0.015 stage 0.022digital 0.014 theater 0.022chip 0.013 director 0.017software 0.013 production 0.017machine 0.011 performance 0.016devices 0.010 dance 0.014machines 0.010 audience 0.014video 0.009 festival 0.013Companies 1.000 Theater 0.960

Music 0.040

IBM 0.074 BROADWAY 0.119 BACH 0.035APPLE 0.061 NEW_YORK 0.044 BEETHOVEN 0.026INTEL 0.059 SHAKESPEARE 0.029 LOUIS_ARMSTRONG 0.019MICROSOFT 0.053 THEATER 0.022 MOZART 0.019COMPAQ 0.041 LONDON 0.019 CARNEGIE_HALL 0.017SONY 0.029 GUINNESS 0.018 LATIN 0.017DELL 0.019 TONY 0.016HP 0.018 LINCOLN_CTR 0.015

ArtsComputers

MusicCompanies Theatre

Page 45: Analyzing unstructured text with topic models

Example of Extracted Entity-Topic Network

Muslim_Militance

Mid_East_Conflict

Palestinian_Territories

Pakistan_Indian_War

FBI_Investigation

Detainees

Mid_East_Peace

US_Military

Religion

Terrorist_Attacks

Afghanistan_War

AL_QAEDA

HAMID_KARZAIMOHAMMED

MOHAMMED_ATTA

NORTHERN_ALLIANCE

BIN_LADEN

TALIBAN

ZAWAHIRI

YASSER_ARAFAT

EHUD_BARAK

ARIEL_SHARON

HAMAS

AL_HAZMI

KING_HUSSEIN

Page 46: Analyzing unstructured text with topic models

Jan00 Jul00 Jan01 Jul01 Jan02 Jul02 Jan030

50

100

Jan00 Jul00 Jan01 Jul01 Jan02 Jul02 Jan030

5

10

15

Topic Trends

Tour-de-France

Anthrax

Jan00 Jul00 Jan01 Jul01 Jan02 Jul02 Jan030

10

20

30Quarterly Earnings

Proportion of words assigned to topic for that

time slice

Page 47: Analyzing unstructured text with topic models

Learning Topic Hierarchies(example: psych Review Abstracts)

RESPONSESTIMULUS

REINFORCEMENTRECOGNITION

STIMULIRECALLCHOICE

CONDITIONING

SPEECHREADINGWORDS

MOVEMENTMOTORVISUALWORD

SEMANTIC

ACTIONSOCIALSELF

EXPERIENCEEMOTION

GOALSEMOTIONALTHINKING

GROUPIQ

INTELLIGENCESOCIAL

RATIONALINDIVIDUAL

GROUPSMEMBERS

SEXEMOTIONS

GENDEREMOTIONSTRESSWOMENHEALTH

HANDEDNESS

REASONINGATTITUDE

CONSISTENCYSITUATIONALINFERENCEJUDGMENT

PROBABILITIESSTATISTICAL

IMAGECOLOR

MONOCULARLIGHTNESS

GIBSONSUBMOVEMENTORIENTATIONHOLOGRAPHIC

CONDITIONINSTRESS

EMOTIONALBEHAVIORAL

FEARSTIMULATIONTOLERANCERESPONSES

AMODEL

MEMORYFOR

MODELSTASK

INFORMATIONRESULTSACCOUNT

SELFSOCIAL

PSYCHOLOGYRESEARCH

RISKSTRATEGIES

INTERPERSONALPERSONALITY

SAMPLING

MOTIONVISUAL

SURFACEBINOCULAR

RIVALRYCONTOUR

DIRECTIONCONTOURSSURFACES

DRUGFOODBRAIN

AROUSALACTIVATIONAFFECTIVEHUNGER

EXTINCTIONPAIN

THEOF

ANDTOINAIS

Page 48: Analyzing unstructured text with topic models

theorymodeldata

informationproposed

modeltheorymodelsw ord

response

readingtext

readersmeaning

comprehension

biasassociative

matricesmatrix

al

memorylistitemitems

recognition

distributedgrams

associateassociations

paired

strengthfamiliarityretroactivedeviationlikelihood

responseinstrumentalresponsesconditioning

behavior

choicedelays

alternativesfixed

rew ard

memorymodelmodels

informationsocial

know ledgeskill

readingaccessspecific

modeleffectslearningtheory

systems

memoryretrieval

serialstoragew orking

preferencereinforcement

choicepunishmentcontingent

modeltheory

informationeffectsaccount

imagesperceptionaccordinglightnessobjects

visualimagery

representationsmental

subsystems

movementeye

positionspeedtarget

orientationeroticbem

sexualebe

situationalconsistency

crosstemporalbehavior

objectbasedneglectattentionspace

stimulivisual

componentcontourforw ard

attributestochastic

choicedifferencetransitivity

maskingmetacontrast

typeinhibition

mask

serialfunctionlatencypositionitems

reasoningbayesiansimilaritiesstatements

gain

similaritygeometricobjectsdensitydistance

ceconditioningprinciples

reinforcementrew ard

modelmemory

processesmodelslearning

imagecomponents

boundnearestneighbor

memoryreasoning

interferenceprocess

background

theorysentence

jamesfit

emotionmodel

memorydecisionresponse

theorytheory

achievementemotion

motivationfailure

modelcs

avoidanceucs

conditioningmodel

memoryproblems

itemstheoretical

goodnessapproach

representationholographic

pictorial

lettersmodelw ordsletter

memoryfunction

psychometriccorrelationsindividuals

performancestresssystemimmunearousal

fight

sexaffects

biologicaldifferenceshandedness

cognitivegigerenzerheuristicsreasoning

biases

childchildren

developmentfieldrisk

bayesianinferencealgorithmsauthors

frequency

speechauditoryacoustic

perceptualsound

actioncontrolintention

goalintentions

personalitybehavior

traitconsistencyidiographic

surfacerepresentations

surfacesoccludingcontour

psychologicalpsychology

reviewamerican

association

eventsinterpersonal

eventimpersonalequilibrium

categoriescategorymetaphor

objectmetaphors

motioncontrast

pathvisual

contour

leftcerebral

handednessspeechhuman

socialperceptionimpressionresearchapproach

sleepimagerydreaming

remeye

reinforcementbehaviorextinctionmatching

partial

binocularrivalry

stereopsismonocular

visual

structurerelations

scaledimensional

keys

riskconjunction

decisionprobabilities

risky

distanceretinal

disparityimage

perceived

perceptionvisual

directionrule

adaptation

partthinking

kindscientificactivities

behaviordevelopmentevolutionary

genescomparative

groupintelligenceintellectual

iqconnections

behaviorfood

drinkinghypothalamusphysiological

taskresource

performanceprocessinganaphors

developmentalsocialethnic

processesdevelopment

fearanxiety

painamygdalaautomatic

neuralvisual

neuronsbehavioralmasking

strategiesproblems

termconfirmation

limitationslanguagesemanticlinguisticthought

correlations

learningmapsmap

barrierparallel

statisticalheuristicsknow ledge

intuitiveheuristic face

recognitionfaces

damagedsemantic

Page 49: Analyzing unstructured text with topic models

Hidden Markov Topics Model

• Syntactic dependencies short range dependencies

• Semantic dependencies long-range

z z z z

w w w w

s s s s

Semantic state: generate words from topic model

Syntactic states: generate words from HMM

(Griffiths, Steyvers, Blei, & Tenenbaum, 2004)

Page 50: Analyzing unstructured text with topic models

MODELALGORITHM

SYSTEMCASE

PROBLEMNETWORKMETHOD

APPROACHPAPER

PROCESS

ISWASHAS

BECOMESDENOTES

BEINGREMAINS

REPRESENTSEXISTSSEEMS

SEESHOWNOTE

CONSIDERASSUMEPRESENT

NEEDPROPOSEDESCRIBESUGGEST

USEDTRAINED

OBTAINEDDESCRIBED

GIVENFOUND

PRESENTEDDEFINED

GENERATEDSHOWN

INWITHFORON

FROMAT

USINGINTOOVER

WITHIN

HOWEVERALSOTHENTHUS

THEREFOREFIRSTHERENOW

HENCEFINALLY

#*IXTN-CFP

EXPERTSEXPERTGATING

HMEARCHITECTURE

MIXTURELEARNINGMIXTURESFUNCTION

GATE

DATAGAUSSIANMIXTURE

LIKELIHOODPOSTERIOR

PRIORDISTRIBUTION

EMBAYESIAN

PARAMETERS

STATEPOLICYVALUE

FUNCTIONACTION

REINFORCEMENTLEARNINGCLASSESOPTIMAL

*

MEMBRANESYNAPTIC

CELL*

CURRENTDENDRITICPOTENTIAL

NEURONCONDUCTANCE

CHANNELS

IMAGEIMAGESOBJECT

OBJECTSFEATURE

RECOGNITIONVIEWS

#PIXEL

VISUAL

KERNELSUPPORTVECTOR

SVMKERNELS

#SPACE

FUNCTIONMACHINES

SET

NETWORKNEURAL

NETWORKSOUPUTINPUT

TRAININGINPUTS

WEIGHTS#

OUTPUTS

NIPS Semantics

NIPS Syntax

Page 51: Analyzing unstructured text with topic models

Random sentence generation

LANGUAGE:[S] RESEARCHERS GIVE THE SPEECH[S] THE SOUND FEEL NO LISTENERS[S] WHICH WAS TO BE MEANING[S] HER VOCABULARIES STOPPED WORDS[S] HE EXPRESSLY WANTED THAT BETTER VOWEL

Page 52: Analyzing unstructured text with topic models

Software

Public-domain MATLAB toolbox for topic modeling on the Web:

http://psiexp.ss.uci.edu/research/programs_data/toolbox.htm