probabilistic topic models mark steyvers department of cognitive sciences university of california,...

96
Probabilistic Topic Models Mark Steyvers Department of Cognitive Sciences University of California, Irvine oint work with: om Griffiths, UC Berkeley adhraic Smyth, UC Irvine ave Newman, UC Irvine

Post on 22-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Probabilistic Topic Models

Mark SteyversDepartment of Cognitive Sciences

University of California, Irvine

Joint work with:Tom Griffiths, UC BerkeleyPadhraic Smyth, UC IrvineDave Newman, UC Irvine

Overview

I Problems of interest

II Probabilistic Topic Models

II Using topic models in machine learning and data mining

III Using topic models in psychology

IV Conclusion

Problems of Interest

What topics does this text collection “span”?

Which documents are about a particular topic?

Who writes about a particular topic?

How have topics changed over time?

How to represent the “gist” of a list of words?

How to model associations between words?

Machine Learning/ Data mining Psychology

Probabilistic Topic Models

Based on pLSI and LDA model(e.g., Hoffman, 2001; Blei, Ng, Jordan, 2003)

A probability model linking documents, topics, and words Topic distribution over words Document mixture of topics

Topics and topic mixtures can be extracted from large collections of text

Example Topics extracted from NIH/NSF grants

brainfmri

imagingfunctional

mrisubjects

magneticresonance

neuroimagingstructural

schizophreniapatientsdeficits

schizophrenicpsychosissubjects

psychoticdysfunction

abnormalitiesclinical

memoryworking

memoriestasks

retrievalencodingcognitive

processingrecognition

performance

diseasead

alzheimerdiabetes

cardiovascularinsulin

vascularblood

clinicalindividuals

Important point: these distributions are learned in a completely automated “unsupervised” fashion from the data

Overview

I Problems of interest

II Probabilistic Topic Models

II Using topic models in machine learning and data mining

III Using topic models in psychology

IV Conclusion

Parameters Real WorldData

P( Data | Parameters)

P( Parameters | Data)

Probabilistic Model

Statistical Inference

Document generation as a probabilistic process

Each topic is a distribution over words Each document a mixture of topics Each word chosen from a single topic

1

( ) ( | ) ( )T

i i i ij

P w P w z j P z j

From paramet

ers (j)

Fromparameters

(d)

Prior Distributions

Dirichlet priors encourage sparsity on topic mixtures and topics

θ ~ Dirichlet( )

~ Dirichlet( )

Topic 1 Topic 2

Topic 3

Word 1 Word 2

Word 3

Topics

.4

1.0

.6

1.0

MONEYLOANBANKRIVER

STREAM

RIVERSTREAM

BANKMONEY

LOAN

MONEY1 BANK1 BANK1 LOAN1 BANK1 MONEY1

BANK1 MONEY1 BANK1 LOAN1 LOAN1 BANK1

MONEY1 ....

Mixtures θ

Documents and topic

assignments

RIVER2 MONEY1 BANK2 STREAM2 BANK2 BANK1

MONEY1 RIVER2 MONEY1 BANK2 LOAN1

MONEY1 ....

RIVER2 BANK2 STREAM2 BANK2 RIVER2

BANK2....

Example of generating words

Topics

?

?

MONEY? BANK BANK? LOAN? BANK? MONEY?

BANK? MONEY? BANK? LOAN? LOAN? BANK?

MONEY? ....

Mixtures θ

RIVER? MONEY? BANK? STREAM? BANK? BANK?

MONEY? RIVER? MONEY? BANK? LOAN?

MONEY? ....

RIVER? BANK? STREAM? BANK? RIVER?

BANK?....

Inference

Documents and topic

assignments

?

Bayesian Inference

Three sets of latent variables topic mixtures θ word distributions topic assignments z

Integrate out θ and and estimate topic assignments:

Use MCMC with Gibbs sampling for approximate inference

,

|, z

P w zP z w

P w z

Sum over termsnT

Gibbs Sampling

Start with random assignments of words to topics

Repeat M iterations Repeat for all words i

Sample a new topic assignment for word i conditioned on all other topic assignments( | )i ip z t z

16 Artificial Documents

River Stream Bank Money Loan123456789

10111213141516

Can we recover the original topics and topic mixtures from this data?

docu

men

ts

Starting the Gibbs Sampling

River Stream Bank Money Loan123456789

10111213141516

Assign word tokens randomly to topics (●=topic 1; ●=topic 2 )

River Stream Bank Money Loan123456789

10111213141516

After 1 iteration

River Stream Bank Money Loan123456789

10111213141516

After 4 iterations

River Stream Bank Money Loan123456789

10111213141516

After 32 iterations

stream .40 bank .39bank .35 money .32river .25 loan .29

topic 1 topic 2●● ●●

Choosing the number of topics

Bayesian Model Selection:

choose T to maximize:

Requires integration over all parameters. Approximate by Gibbs sampling

Results with artificial data with 10 topics:

| modelP w

T

5 10 15 20

Log

P(

w |

T )

-2.3e-5

-2.2e-5

-2.1e-5

-2.0e-5

-1.9e-5

-1.8e-5

-1.7e-5

-1.6e-5

Overview

I Problems of interest

II Probabilistic Topic Models

II Using topic models in machine learning and data mining

III Using topic models in psychology

IV Conclusion

Extracting Topics from Email

250,000 250,000 emailsemails

28,000 28,000 authorsauthors

1999-20021999-2002

Enron Email TEXANSWIN

FOOTBALLFANTASY

SPORTSLINEPLAYTEAMGAME

SPORTSGAMES

GODLIFEMAN

PEOPLECHRISTFAITHLORDJESUS

SPIRITUALVISIT

TRAVELROUNDTRIP

SAVEDEALSHOTELBOOKSALE

FARESTRIP

CITIES

FERCMARKET

ISOCOMMISSION

ORDERFILING

COMMENTSPRICE

CALIFORNIAFILED

POWERCALIFORNIAELECTRICITY

UTILITIESPRICESMARKET

PRICEUTILITY

CUSTOMERSELECTRIC

STATEPLAN

CALIFORNIADAVISRATE

BANKRUPTCYSOCALPOWERBONDSMOU

Topic trends from New York Times

TOURRIDER

LANCE_ARMSTRONGTEAMBIKERACE

FRANCEJan00 Jul00 Jan01 Jul01 Jan02 Jul02 Jan03

0

5

10

15Tour-de-France

Jan00 Jul00 Jan01 Jul01 Jan02 Jul02 Jan030

10

20

30COMPANYQUARTERPERCENTANALYSTSHARESALES

EARNING

Quarterly Earnings

ANTHRAXLETTER

MAILWORKEROFFICESPORESPOSTAL

BUILDING Jan00 Jul00 Jan01 Jul01 Jan02 Jul02 Jan030

50

100 Anthrax

330,000 330,000 articlesarticles

2000-20022000-2002

Topic trends in NIPS conference

Year

1986 1988 1990 1992 1994 1996 1998 2000

Pro

port

ion

of w

ords

0.0000

0.0005

0.0010

0.0015

0.0020

Year

1986 1988 1990 1992 1994 1996 1998 2000

Pro

port

ion

of w

ords

0.00

0.01

0.02

0.03KERNELSUPPORTVECTORMARGIN

SVMKERNELS

SPACEDATA

MACHINES

... while SVM’s become more popular.

LAYERNET

NEURALLAYERSNETS

ARCHITECTURENUMBER

FEEDFORWARDSINGLE

Neural networks on the decline ...

NCICancer biology,

detection and diagnosis

NCIAIDS Research

NCICancer

Research Centers

NCICancer

causationNCI

Cancer prevention and control

NCICancer

treatment

NCIResearch

manpower development

NIMHAIDS Research

NIMHExtramural research

NIMHIntramural research

BCSArchaeology,

archeometry, and ...

BCSBehavioral

and cognitive sciences - Other

BCSChild learning

and development

BCSCultural

anthropology

BCSEnvironmental social

and behavioral scienceBCSGeography

and regional science

BCSHuman cognition and perception

BCSInstrumentation

BCSLinguistics

BCSPhysical

anthropology

BCSSocial

psychology

INTAfrica, Near East, and South Asia

INTAmericas

INTCentral

and Eastern Europe

INTEast Asia

and Pacific

INTInternational

activities - Other

INTJapan

and Korea INTWestern Europe

SESDecision, risk,

and management science

SESMethodology, measures,

and statistics

SESEconomics

SESEthics

and values studies

SESInnovation

and organizational change

SESLaw

and social science

SESPolitical science

SESResearch on science

and technology

SESScience

and technology studies

SESSocial and economic

sciences - Other

SESSociologySES

Transformations to quality organizations

BIRBiological

infrastructure - Other

BIRHuman

resources

BIRInstrumentation

BIRResearch resources

DEBEcological

studies

DEBEnvironmental biology - Other

DEBSystematic

& population biology

IBNDevelopmental mechanisms

IBNIntegrative biology

and neuroscience - Other

IBNNeuroscience

IBNPhysiology

and ethology

MCBBiochemical

and biomolecular processes

MCBBiomolecular structure

& function

MCBCell biology

MCBGenetics

MCBMolecular and cellular biosciences - Other

PGRPlant genome research project

NIH

NSF – BIO

NSF – SBE

2D visualization of funding programs – nearby program support similar topics

Funding Amounts per Topic

We have $ funding per grant

We have distribution of topics for each grant

Solve for the $ amount per topic

What are expensive topics?

Funding % Interpretation3.47 research center2.87 cancer control2.26 mental health services2.01 clinical treatment1.87 cancer 1.73 gene sequencing1.61 risk factors1.56 children/parents1.51 tumors1.48 training program1.47 immunology1.43 disorders1.40 patient treatment

Funding % Interpretation0.60 conference/meetings0.56 theory0.55 public policy0.55 collaborative projects0.55 marine environment0.55 decision making0.55 ecological diversity0.53 sexual behavior0.52 markets0.51 science/technology0.49 computer systems0.45 language0.44 archaelogy

High $$$ topics Low $$$ topics

Overview

I Problems of interest

II Probabilistic Topic Models

II Using topic models in machine learning and data mining

III Using topic models in psychology

IV Conclusion

Basic Problems for Semantic Memory

Prediction what fact, concept, or word is next?

Gist extraction What is this set of words about?

Disambiguation What is the sense of this word?

2 1|P w w

| ,P z w context

|P z w

TASA Corpus

Applied model to educational text corpus: TASA

33K docs 6M words

Three documents with the word “play”

(numbers & colors topic assignments)

A Play082 is written082 to be performed082 on a stage082 before a live093 audience082 or before motion270 picture004 or television004 cameras004 ( for later054 viewing004 by large202 audiences082). A Play082 is written082 because playwrights082 have something ... He was listening077 to music077 coming009 from a passing043 riverboat. The music077 had already captured006 his heart157 as well as his ear119. It was jazz077. Bix beiderbecke had already had music077 lessons077. He wanted268 to play077 the cornet. And he wanted268 to play077 jazz077... J im296 plays166 the game166. J im296 likes081 the game166 for one. The game166 book254 helps081 jim296. Don180 comes040 into the house038. Don180 and jim296 read254 the game166 book254. The boys020 see a game166 for two. The two boys020 play166 the game166....

Word Association

CUE:

PLAY

(Nelson, McEvoy, & Schreiber USF word association norms)

Word P( word ) Word P( word ) Word Cosine FUN .141 BALL .041 KICKBALL .558 GAME 42 BALL .134 GAME .039 VOLLEYBALL .519 BALL 33 GAME .074 CHILDREN .019 GAMES .492 CHILDREN 30 WORK .067 ROLE .014 COSTUMES .478 SCHOOL 27

GROUND .060 GAMES .014 DRAMA .469 ROLE 25 MATE .027 MUSIC .009 ROLE .465 WANT 24 CHILD .020 BASEBALL .009 PLAYWRIGHT .464 GAMES 23 ENJOY .020 HIT .008 FUN .454 MOTHER 23 WIN .020 FUN .008 ACTOR .448 THINGS 21

ACTOR .013 TEAM .008 REHEARSALS .445 MUSIC 21 FIGHT .013 IMPORTANT .006 GAME .445 HELP 20 HORSE .013 BAT .006 ACTORS .439 FUN 19

KID .013 RUN .006 CHECKERS .431 READ 18 MUSIC .013 STAGE .005 MOLIERE .429 DON 18

Responses

Association networks

BASEBALL

BAT

BALL

GAME

PLAY

STAGE THEATER

Topic model for Word Association

Association as a problem of prediction: Given that a single word is observed, predict

what other words might occur in that context

Under a single topic assumption:

2 1 2 1| | |z

P w w P w z P z w

Response Cue

Model predictions from TASA corpus

Word P( word ) Word P( word ) Word Cosine FUN .141 BALL .041 KICKBALL .558 GAME 42 BALL .134 GAME .039 VOLLEYBALL .519 BALL 33 GAME .074 CHILDREN .019 GAMES .492 CHILDREN 30 WORK .067 ROLE .014 COSTUMES .478 SCHOOL 27

GROUND .060 GAMES .014 DRAMA .469 ROLE 25 MATE .027 MUSIC .009 ROLE .465 WANT 24 CHILD .020 BASEBALL .009 PLAYWRIGHT .464 GAMES 23 ENJOY .020 HIT .008 FUN .454 MOTHER 23 WIN .020 FUN .008 ACTOR .448 THINGS 21

ACTOR .013 TEAM .008 REHEARSALS .445 MUSIC 21 FIGHT .013 IMPORTANT .006 GAME .445 HELP 20 HORSE .013 BAT .006 ACTORS .439 FUN 19

KID .013 RUN .006 CHECKERS .431 READ 18 MUSIC .013 STAGE .005 MOLIERE .429 DON 18

HUMANS TOPICS (T=500)

RANK 9

Median rank of first associate

# Topics

300

500

700

90011

0013

0015

0017

00

Med

ian

Ra

nk

0

10

20

30

40

TOPICS

Latent Semantic Analysis(Landauer & Dumais, 1997)

word-document counts

high dimensional space

SVD

Each word is a single point in semantic space Similarity measured by cosine of angle between

word vectors

BALL

PLAY

STAGE

THEATER

GAME

Median rank of first associate

# Topics

300

500

700

90011

0013

0015

0017

00

Med

ian

Ra

nk

0

10

20

30

40

0

10

20

30

40 Cosine

Inner product

TOPICS LSA

Objections to spatial representations

Tversky (1977) on similarity: asymmetry violation of the triangle inequality rich neighborhood structure

Semantic association has the same properties asymmetry violation of the triangle inequality “small world” neighborhood structure

(Steyvers & Tenenbaum, 2005)

Topics as latent factors to group words

BASEBALL

BAT

BALL

GAME

STAGE THEATER

TOPIC

TOPIC

PLAY

BASEBALL

BAT

BALL

GAME

PLAY

STAGE THEATER

Note: there are no direct connections between words

What about collocations?

Why are these words related? DOW - JONES BUMBLE - BEE WHITE – HOUSE

Suggests at least two routes for association: Semantic Collocation

Integrate collocations into topic model

Related to paradigmatic/syntagmatic distinction

TOPIC MIXTURE

TOPIC

WORD

X

TOPIC

WORD

X

TOPIC

WORDIf x=0, sample a word from the topic

If x=1, sample a word from the distribution based on previous word

......

......

......

Collocation Topic Model

TOPIC MIXTURE

TOPIC

DOW

X=1

JONES

X=0

TOPIC

RISES

Example: “DOW JONES RISES”

JONES is more likely explained as a word following DOW than as word sampled from topic

Result: DOW_JONES recognized as collocation

......

......

......

Collocation Topic Model

Examples Topics from New York Times

WEEKDOW_JONES

POINTS10_YR_TREASURY_YIELD

PERCENTCLOSE

NASDAQ_COMPOSITESTANDARD_POOR

CHANGEFRIDAY

DOW_INDUSTRIALSGRAPH_TRACKS

EXPECTEDBILLION

NASDAQ_COMPOSITE_INDEXEST_02

PHOTO_YESTERDAYYEN10

500_STOCK_INDEX

WALL_STREETANALYSTS

INVESTORSFIRM

GOLDMAN_SACHSFIRMS

INVESTMENTMERRILL_LYNCH

COMPANIESSECURITIESRESEARCH

STOCKBUSINESSANALYST

WALL_STREET_FIRMSSALOMON_SMITH_BARNEY

CLIENTSINVESTMENT_BANKINGINVESTMENT_BANKERS

INVESTMENT_BANKS

SEPT_11WAR

SECURITYIRAQ

TERRORISMNATIONKILLED

AFGHANISTANATTACKS

OSAMA_BIN_LADENAMERICAN

ATTACKNEW_YORK_REGION

NEWMILITARY

NEW_YORKWORLD

NATIONALQAEDA

TERRORIST_ATTACKS

BANKRUPTCYCREDITORS

BANKRUPTCY_PROTECTIONASSETS

COMPANYFILED

BANKRUPTCY_FILINGENRON

BANKRUPTCY_COURTKMART

CHAPTER_11FILING

COOPERBILLIONS

COMPANIESBANKRUPTCY_PROCEEDINGS

DEBTSRESTRUCTURING

CASEGROUP

Terrorism Wall Street Firms

Stock Market

Bankruptcy

Context dependent collocations

Example: WHITE - HOUSE

In context of government, WHITE-HOUSE is a collocation

In context of colors of houses, WHITE - HOUSE is treated as separate words

Overview

I Problems of interest

II Probabilistic Topic Models

II Using topic models in machine learning and data mining

III Using topic models in psychology

IV Conclusion

Psychology Computer Science

Models for information retrieval might be helpful in understanding semantic memory

Research on semantic memory can be useful for developing better information retrieval algorithms

Software

Public-domain MATLAB toolbox for topic modeling on the Web:

http://psiexp.ss.uci.edu/research/programs_data/toolbox.htm

Gibbs Sampler Stability

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100

2

4

6

8

10

12

14

16

Comparing topics from two runs

KL

dis

tan

ce

Topics from Run 1

Re-o

rdere

d T

op

ics

Ru

n 2

BEST KL = .46

WORST KL = 9.40

ACCOUNT .042 ACCOUNT .041ACCOUNTS .028 ACCOUNTS .026

CASH .025 BALANCE .025BALANCE .025 CASH .023AMOUNT .022 AMOUNT .022COLUMN .021 COLUMN .021BUSINESS .018 BUSINESS .017

TOTAL .017 CHECK .016JOURNAL .015 RECORD .016

CHECK .015 TOTAL .016RECORDED .015 ACCOUNTING .016

CREDIT .015 JOURNAL .015RECORD .015 PERIOD .015

GENERAL .014 CREDIT .014PERIOD .013 RECORDED .014

Run 1 Run 2

MONEY .094 MONEY .086GOLD .044 PAY .033POOR .034 BANK .027

FOUND .023 INCOME .027RICH .021 INTEREST .022

SILVER .020 TAX .021HARD .019 PAID .016

DOLLARS .018 TAXES .016GIVE .016 BANKS .015

WORTH .016 INSURANCE .015BUY .015 AMOUNT .011

WORKED .014 CREDIT .010LOST .013 DOLLARS .010

SOON .013 COST .008PAY .013 FUNDS .008

Run 1 Run 2

Extensions of Topic Models

Combining topics and syntax

Topic hierarchies

Topic segmentation no need for document boundaries

Modeling authors as well as documents who wrote this part of the paper?

Words can have high probability in multiple topics

PRINTINGPAPERPRINT

PRINTEDTYPE

PROCESSINK

PRESSIMAGE

PRINTERPRINTS

PRINTERSCOPY

COPIESFORM

OFFSETGRAPHICSURFACE

PRODUCEDCHARACTERS

PLAYPLAYSSTAGE

AUDIENCETHEATERACTORSDRAMA

SHAKESPEAREACTOR

THEATREPLAYWRIGHT

PERFORMANCEDRAMATICCOSTUMES

COMEDYTRAGEDY

CHARACTERSSCENESOPERA

PERFORMED

TEAMGAME

BASKETBALLPLAYERSPLAYERPLAY

PLAYINGSOCCERPLAYED

BALLTEAMSBASKET

FOOTBALLSCORECOURTGAMES

TRYCOACH

GYMSHOT

JUDGETRIAL

COURTCASEJURY

ACCUSEDGUILTY

DEFENDANTJUSTICE

EVIDENCEWITNESSES

CRIMELAWYERWITNESS

ATTORNEYHEARING

INNOCENTDEFENSECHARGE

CRIMINAL

HYPOTHESISEXPERIMENTSCIENTIFIC

OBSERVATIONSSCIENTISTS

EXPERIMENTSSCIENTIST

EXPERIMENTALTEST

METHODHYPOTHESES

TESTEDEVIDENCE

BASEDOBSERVATION

SCIENCEFACTSDATA

RESULTSEXPLANATION

STUDYTEST

STUDYINGHOMEWORK

NEEDCLASSMATHTRY

TEACHERWRITEPLAN

ARITHMETICASSIGNMENT

PLACESTUDIED

CAREFULLYDECIDE

IMPORTANTNOTEBOOK

REVIEW

(Based on TASA corpus)

Problems with Spatial Representations

Violation of triangle inequality

Can find similarity judgments that violate this:

A

B C

AC AB + BC

Violation of triangle inequality

Can find associations that violate this:

A

B C

AC AB + BC

SOCCER FIELD MAGNETIC

No Triangle Inequality with Topics

SOCCER

MAGNETICFIELD

TOPIC 1 TOPIC 2

Topic structure easily explains violations of triangle inequality

Small-World Structure of Associations(Steyvers & Tenenbaum, 2005)

Properties:1) Short path lengths2) Clustering3) Power law degree distributions

Small world graphs arise elsewhere: internet, social relations, biology

BASEBALL

BAT

BALL

GAME

PLAY

STAGE THEATER

Power law degree distributions in Semantic Networks

Power law degree distribution some words are “hubs” in a semantic network

UNDIRECTED ASSOCIATIVE NETWORK

101 102

P(

k )

10-6

10-5

10-4

10-3

10-2

10-1

100

WORDNET

100 101 10210-6

10-5

10-4

10-3

10-2

10-1

100

ROGET'STHESAURUS

100 10110-3

10-2

10-1

100

kkk

k = #incoming links100 101 102

10-4

10-3

10-2

10-1

100

P(

k )

d=50 d=200d=400

k = #incoming links

101 102

10-4

10-3

10-2

10-1

100

P(

k )

Creating Association Networks

TOPICS MODEL:

• Connect i to j when P( w=j | w=i ) > threshold

LSA:

For each word, generate K associates by picking K nearest neighbors in semantic space

=-2.05

Associations in free recall

STUDY THESE WORDS: Bed, Rest, Awake, Tired, Dream, Wake, Snooze, Blanket, Doze, Slumber, Snore, Nap, Peace, Yawn, Drowsy

RECALL WORDS .....

FALSE RECALL: “Sleep” 61%

Recall as a reconstructive process

Reconstruct study list based on the stored “gist”

The gist can be represented by a distribution over topics

Under a single topic assumption: 1 1| | |n n

z

P w P w z P z w w

Retrieved wordStudy list

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2

BEDRESTTIRED

AWAKEWAKE

NAPDREAM

YAWNDROWSYBLANKETSNORE

SLUMBERPEACEDOZE

SLEEPNIGHT

ASLEEPMORNINGHOURS

SLEEPYEYESAWAKENED

Predictions for the “Sleep” list

STUDYLIST

EXTRALIST

(top 8)

w|1nwP

Hidden Markov Topic Model

Hidden Markov Topics Model

Syntactic dependencies short range dependencies Semantic dependencies long-range

z z z z

w w w w

s s s s

Semantic state: generate words from topic model

Syntactic states: generate words from HMM

(Griffiths, Steyvers, Blei, & Tenenbaum, 2004)

HEART 0.2 LOVE 0.2SOUL 0.2TEARS 0.2JOY 0.2

z = 1 0.4

SCIENTIFIC 0.2 KNOWLEDGE 0.2WORK 0.2RESEARCH 0.2MATHEMATICS 0.2

z = 2 0.6

THE 0.6 A 0.3MANY 0.1

OF 0.6 FOR 0.3BETWEEN 0.1

0.9

0.1

0.2

0.8

0.7

0.3

Transition between semantic state and syntactic states

THE ………………………………

HEART 0.2 LOVE 0.2SOUL 0.2TEARS 0.2JOY 0.2

z = 1 0.4

SCIENTIFIC 0.2 KNOWLEDGE 0.2WORK 0.2RESEARCH 0.2MATHEMATICS 0.2

z = 2 0.6

x = 1

THE 0.6 A 0.3MANY 0.1

x = 3

OF 0.6 FOR 0.3BETWEEN 0.1

x = 2

0.9

0.1

0.2

0.8

0.7

0.3

Combining topics and syntax

THE LOVE……………………

HEART 0.2 LOVE 0.2SOUL 0.2TEARS 0.2JOY 0.2

z = 1 0.4

SCIENTIFIC 0.2 KNOWLEDGE 0.2WORK 0.2RESEARCH 0.2MATHEMATICS 0.2

z = 2 0.6

x = 1

THE 0.6 A 0.3MANY 0.1

x = 3

OF 0.6 FOR 0.3BETWEEN 0.1

x = 2

0.9

0.1

0.2

0.8

0.7

0.3

Combining topics and syntax

THE LOVE OF………………

HEART 0.2 LOVE 0.2SOUL 0.2TEARS 0.2JOY 0.2

z = 1 0.4

SCIENTIFIC 0.2 KNOWLEDGE 0.2WORK 0.2RESEARCH 0.2MATHEMATICS 0.2

z = 2 0.6

x = 1

THE 0.6 A 0.3MANY 0.1

x = 3

OF 0.6 FOR 0.3BETWEEN 0.1

x = 2

0.9

0.1

0.2

0.8

0.7

0.3

Combining topics and syntax

THE LOVE OF RESEARCH ……

HEART 0.2 LOVE 0.2SOUL 0.2TEARS 0.2JOY 0.2

z = 1 0.4

SCIENTIFIC 0.2 KNOWLEDGE 0.2WORK 0.2RESEARCH 0.2MATHEMATICS 0.2

z = 2 0.6

x = 1

THE 0.6 A 0.3MANY 0.1

x = 3

OF 0.6 FOR 0.3BETWEEN 0.1

x = 2

0.9

0.1

0.2

0.8

0.7

0.3

Combining topics and syntax

FOODFOODSBODY

NUTRIENTSDIETFAT

SUGARENERGY

MILKEATINGFRUITS

VEGETABLESWEIGHT

FATSNEEDS

CARBOHYDRATESVITAMINSCALORIESPROTEIN

MINERALS

MAPNORTHEARTHSOUTHPOLEMAPS

EQUATORWESTLINESEAST

AUSTRALIAGLOBEPOLES

HEMISPHERELATITUDE

PLACESLAND

WORLDCOMPASS

CONTINENTS

DOCTORPATIENTHEALTH

HOSPITALMEDICAL

CAREPATIENTS

NURSEDOCTORSMEDICINENURSING

TREATMENTNURSES

PHYSICIANHOSPITALS

DRSICK

ASSISTANTEMERGENCY

PRACTICE

BOOKBOOKS

READINGINFORMATION

LIBRARYREPORT

PAGETITLE

SUBJECTPAGESGUIDE

WORDSMATERIALARTICLE

ARTICLESWORDFACTS

AUTHORREFERENCE

NOTE

GOLDIRON

SILVERCOPPERMETAL

METALSSTEELCLAYLEADADAM

OREALUMINUM

MINERALMINE

STONEMINERALS

POTMININGMINERS

TIN

BEHAVIORSELF

INDIVIDUALPERSONALITY

RESPONSESOCIAL

EMOTIONALLEARNINGFEELINGS

PSYCHOLOGISTSINDIVIDUALS

PSYCHOLOGICALEXPERIENCES

ENVIRONMENTHUMAN

RESPONSESBEHAVIORSATTITUDES

PSYCHOLOGYPERSON

CELLSCELL

ORGANISMSALGAE

BACTERIAMICROSCOPEMEMBRANEORGANISM

FOODLIVINGFUNGIMOLD

MATERIALSNUCLEUSCELLED

STRUCTURESMATERIAL

STRUCTUREGREENMOLDS

Semantic topics

PLANTSPLANT

LEAVESSEEDSSOIL

ROOTSFLOWERS

WATERFOOD

GREENSEED

STEMSFLOWER

STEMLEAF

ANIMALSROOT

POLLENGROWING

GROW

GOODSMALL

NEWIMPORTANT

GREATLITTLELARGE

*BIG

LONGHIGH

DIFFERENTSPECIAL

OLDSTRONGYOUNG

COMMONWHITESINGLE

CERTAIN

THEHIS

THEIRYOURHERITSMYOURTHIS

THESEA

ANTHATNEW

THOSEEACH

MRANYMRSALL

MORESUCHLESS

MUCHKNOWN

JUSTBETTERRATHER

GREATERHIGHERLARGERLONGERFASTER

EXACTLYSMALLER

SOMETHINGBIGGERFEWERLOWER

ALMOST

ONAT

INTOFROMWITH

THROUGHOVER

AROUNDAGAINSTACROSS

UPONTOWARDUNDERALONGNEAR

BEHINDOFF

ABOVEDOWN

BEFORE

SAIDASKED

THOUGHTTOLDSAYS

MEANSCALLEDCRIED

SHOWSANSWERED

TELLSREPLIED

SHOUTEDEXPLAINEDLAUGHED

MEANTWROTE

SHOWEDBELIEVED

WHISPERED

ONESOMEMANYTWOEACHALL

MOSTANY

THREETHIS

EVERYSEVERAL

FOURFIVEBOTHTENSIX

MUCHTWENTY

EIGHT

HEYOU

THEYI

SHEWEIT

PEOPLEEVERYONE

OTHERSSCIENTISTSSOMEONE

WHONOBODY

ONESOMETHING

ANYONEEVERYBODY

SOMETHEN

Syntactic classes

BEMAKE

GETHAVE

GOTAKE

DOFINDUSESEE

HELPKEEPGIVELOOKCOMEWORKMOVELIVEEAT

BECOME

MODELALGORITHM

SYSTEMCASE

PROBLEMNETWORKMETHOD

APPROACHPAPER

PROCESS

ISWASHAS

BECOMESDENOTES

BEINGREMAINS

REPRESENTSEXISTSSEEMS

SEESHOWNOTE

CONSIDERASSUMEPRESENT

NEEDPROPOSEDESCRIBESUGGEST

USEDTRAINED

OBTAINEDDESCRIBED

GIVENFOUND

PRESENTEDDEFINED

GENERATEDSHOWN

INWITHFORON

FROMAT

USINGINTOOVER

WITHIN

HOWEVERALSOTHENTHUS

THEREFOREFIRSTHERENOW

HENCEFINALLY

#*IXTN-CFP

EXPERTSEXPERTGATING

HMEARCHITECTURE

MIXTURELEARNINGMIXTURESFUNCTION

GATE

DATAGAUSSIANMIXTURE

LIKELIHOODPOSTERIOR

PRIORDISTRIBUTION

EMBAYESIAN

PARAMETERS

STATEPOLICYVALUE

FUNCTIONACTION

REINFORCEMENTLEARNINGCLASSESOPTIMAL

*

MEMBRANESYNAPTIC

CELL*

CURRENTDENDRITICPOTENTIAL

NEURONCONDUCTANCE

CHANNELS

IMAGEIMAGESOBJECT

OBJECTSFEATURE

RECOGNITIONVIEWS

#PIXEL

VISUAL

KERNELSUPPORTVECTOR

SVMKERNELS

#SPACE

FUNCTIONMACHINES

SET

NETWORKNEURAL

NETWORKSOUPUTINPUT

TRAININGINPUTS

WEIGHTS#

OUTPUTS

NIPS Semantics

NIPS Syntax

Random sentence generation

LANGUAGE:[S] RESEARCHERS GIVE THE SPEECH[S] THE SOUND FEEL NO LISTENERS[S] WHICH WAS TO BE MEANING[S] HER VOCABULARIES STOPPED WORDS[S] HE EXPRESSLY WANTED THAT BETTER VOWEL

Nested Chinese Restaurant Process

Topic Hierarchies In regular topic model, no relations between

topics

Alternative: hierarchical topic organization topic 1

topic 2 topic 3

topic 4 topic 5 topic 6 topic 7

Nested Chinese Restaurant Process Blei, Griffiths, Jordan, Tenenbaum (2004) Learn hierarchical structure, as well as topics

within structure

Example: Psych Review Abstracts

RESPONSESTIMULUS

REINFORCEMENTRECOGNITION

STIMULIRECALLCHOICE

CONDITIONING

SPEECHREADINGWORDS

MOVEMENTMOTORVISUALWORD

SEMANTIC

ACTIONSOCIALSELF

EXPERIENCEEMOTION

GOALSEMOTIONALTHINKING

GROUPIQ

INTELLIGENCESOCIAL

RATIONALINDIVIDUAL

GROUPSMEMBERS

SEXEMOTIONS

GENDEREMOTIONSTRESSWOMENHEALTH

HANDEDNESS

REASONINGATTITUDE

CONSISTENCYSITUATIONALINFERENCEJUDGMENT

PROBABILITIESSTATISTICAL

IMAGECOLOR

MONOCULARLIGHTNESS

GIBSONSUBMOVEMENTORIENTATIONHOLOGRAPHIC

CONDITIONINSTRESS

EMOTIONALBEHAVIORAL

FEARSTIMULATIONTOLERANCERESPONSES

AMODEL

MEMORYFOR

MODELSTASK

INFORMATIONRESULTSACCOUNT

SELFSOCIAL

PSYCHOLOGYRESEARCH

RISKSTRATEGIES

INTERPERSONALPERSONALITY

SAMPLING

MOTIONVISUAL

SURFACEBINOCULAR

RIVALRYCONTOUR

DIRECTIONCONTOURSSURFACES

DRUGFOODBRAIN

AROUSALACTIVATIONAFFECTIVEHUNGER

EXTINCTIONPAIN

THEOF

ANDTOINAIS

Generative Process

RESPONSESTIMULUS

REINFORCEMENTRECOGNITION

STIMULIRECALLCHOICE

CONDITIONING

SPEECHREADINGWORDS

MOVEMENTMOTORVISUALWORD

SEMANTIC

ACTIONSOCIALSELF

EXPERIENCEEMOTION

GOALSEMOTIONALTHINKING

GROUPIQ

INTELLIGENCESOCIAL

RATIONALINDIVIDUAL

GROUPSMEMBERS

SEXEMOTIONS

GENDEREMOTIONSTRESSWOMENHEALTH

HANDEDNESS

REASONINGATTITUDE

CONSISTENCYSITUATIONALINFERENCEJUDGMENT

PROBABILITIESSTATISTICAL

IMAGECOLOR

MONOCULARLIGHTNESS

GIBSONSUBMOVEMENTORIENTATIONHOLOGRAPHIC

CONDITIONINSTRESS

EMOTIONALBEHAVIORAL

FEARSTIMULATIONTOLERANCERESPONSES

AMODEL

MEMORYFOR

MODELSTASK

INFORMATIONRESULTSACCOUNT

SELFSOCIAL

PSYCHOLOGYRESEARCH

RISKSTRATEGIES

INTERPERSONALPERSONALITY

SAMPLING

MOTIONVISUAL

SURFACEBINOCULAR

RIVALRYCONTOUR

DIRECTIONCONTOURSSURFACES

DRUGFOODBRAIN

AROUSALACTIVATIONAFFECTIVEHUNGER

EXTINCTIONPAIN

THEOF

ANDTOINAIS

Other Slides

Markov chain Monte Carlo

Sample from a Markov chain constructed to converge to the target distribution

Allows sampling from unnormalized posterior

Can compute approximate statistics from intractable distributions

Gibbs sampling one such method, construct Markov chain with conditional distributions

Enron email: two example topics (T=100)

WORD PROB.

BUSH 0.0227

LAY 0.0193

MR 0.0183

WHITE 0.0153

ENRON 0.0150

HOUSE 0.0148

PRESIDENT 0.0131

ADMINISTRATION 0.0115

COMPANY 0.0090

ENERGY 0.0085

SENDER PROB.

NELSON, KIMBERLY (ETS) 0.3608

PALMER, SARAH 0.0997

DENNE, KAREN 0.0541

HOTTE, STEVE 0.0340

DUPREE, DIANNA 0.0282

ARMSTRONG, JULIE 0.0222

LOKEY, TEB 0.0194

SULLIVAN, LORA 0.0073

VILLARREAL, LILLIAN 0.0040

BAGOT, NANCY 0.0026

TOPIC 10

WORD PROB.

ANDERSEN 0.0241

FIRM 0.0134

ACCOUNTING 0.0119

SEC 0.0065

SETTLEMENT 0.0062

AUDIT 0.0054

CORPORATE 0.0053

FINANCIAL 0.0052

JUSTICE 0.0052

INFORMATION 0.0050

SENDER PROB.

HILTABRAND, LESLIE 0.1359

WELLS, TORI L. 0.0865

DUPREE, DIANNA 0.0825

ARMSTRONG, JULIE 0.0316

DENNE, KAREN 0.0208

SULLIVAN, LORA 0.0072

[email protected] 0.0026

WILSON, DANNY 0.0016

HU, SYLVIA 0.0013

MATHEWS, LEENA 0.0012

TOPIC 32

Enron email: two work unrelated topics

WORD PROB.

TRAVEL 0.0161

ROUNDTRIP 0.0124

SAVE 0.0118

DEALS 0.0097

HOTEL 0.0095

BOOK 0.0094

SALE 0.0089

FARES 0.0083

TRIP 0.0072

CITIES 0.0070

SENDER PROB.

TRAVELOCITY MEMBER SERVICES 0.0763

BESTFARES.COM HOT DEALS 0.0502

<[email protected]> 0.0315

LISTS.COOLVACATIONS.COM 0.0151

CHEAP TICKETS 0.0111

EXPEDIA FARE TRACKER 0.0106

TRAVELOCITY.COM 0.0096

[email protected] 0.0088

[email protected] 0.0066

LASTMINUTE.COM 0.0051

TOPIC 38

WORD PROB.

NEWS 0.0245

MAIL 0.0182

NYTIMES 0.0149

YORK 0.0128

PAGE 0.0095

TIMES 0.0090

HEADLINES 0.0079

BUSH 0.0077

DELIVERY 0.0070

HTML 0.0068

SENDER PROB.

THE NEW YORK TIMES DIRECT 0.3438

<[email protected]> 0.0104

THE ECONOMIST 0.0029

@TIMES - INSIDE NYTIMES.COM 0.0015

[email protected] 0.0011

AMAZON.COM DELIVERS BESTSELLERS 0.0009

NYTIMES.COM 0.0009

HYATT, JERRY 0.0008

NEWSLETTER_TEXT 0.0008

CHRIS LONG 0.0007

TOPIC 25

Using Topic Models for Information Retrieval

Clusters v. Topics

Hidden Markov Models in Molecular Biology: New Algorithms and ApplicationsPierre Baldi, Yves C Hauvin, Tim Hunkapiller, Marcella A. McClure

Hidden Markov Models (HMMs) can be applied to several important problems in molecular biology. We introduce a new convergent learning algorithm for HMMs that, unlike the classical Baum-Welch algorithm is smooth and can be applied on-line or in batch mode, with or without the usual Viterbi most likely path approximation. Left-right HMMs with insertion and deletion states are then trained to represent several protein families including immunoglobulins and kinases. In all cases, the models derived capture all the important statistical properties of the families and can be used efficiently in a number of important tasks such as multiple alignment, motif detection, andclassification.

[cluster 88]

model data models time neural figure state learning set parameters network probability number networks training function system algorithm hidden markov

[topic 10] state hmm markov sequence models hidden states probabilities sequences parameters transition probability training hmms hybrid model likelihood modeling

[topic 37] genetic structure chain protein population region algorithms human mouse selection fitness proteins search evolution generation function sequence sequences genes

Multiple TopicsOne Cluster

word

s

C = F Q

documentsw

ord

stopics

top

ics documents

Topic model

word

s

documents

C = U D VT

word

s

dims

dims

dim

s

dim

s

documents

LSA

normalizedco-occurrence matrix

mixtureweights

mixturecomponents

Testing violation of triangle inequality

Look for triplets of associates w1 w2 w3 such that

P( w2 | w1 ) >

P( w3 | w2 ) >

and measure P( w3 | w1 )

Vary threshold

Documents as Topics Mixtures:a Geometric Interpretation

P(word3)

P(word1)

0

1

1

1

P(word2)

P(word1)+P(word2)+P(word3) = 1

topic 1

topic 2

= observed= observeddocumentdocument

word pixeldocument image

sample each pixel froma mixture of topics

Generating Artificial Data: a visual example

10 “topics”with 5 x 5 pixels

A subset of documents

Inferred Topics

0

1

5

10

50

100

Itera

tion

s

Interpretable decomposition

• SVD gives a basis for the data, but not an interpretable one

• The true basis is not orthogonal, so rotation does no good

Topic model

SVD

INPUT: word-document counts (word order is irrelevant)

OUTPUT:topic assignments to each word P( zi )likely words in each topic P( w | z )likely topics in each document (“gist”) P( z | d )

Summary

Level 1 Level 2 Level 3 Level 4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57

Cancer biology, detection and diagnosis 1

AIDS Research 2

Cancer Research Centers 3

Cancer causation 4

Cancer prevention and control 5

Cancer treatment 6

NCI research manpower development 7

AIDS Research 8

Extramural research 9

Intramural research 10

Archaeology, archeometry, and ... 11

Behavioral and cognitive sciences - Other 12

Child learning and development 13

Cultural anthropology 14Environmental social and behavioral science 15

Geography and regional science 16

Human cognition and perception 17

Instrumentation 18

Linguistics 19

Physical anthropology 20

Social psychology 21

Africa, Near East, and South Asia 22

Americas 23

Central and Eastern Europe 24

East Asia and Pacific 25

International activities - Other 26

Japan and Korea 27

Western Europe 28

Decision, risk, and management science 29

Methodology, measures, and statistics 30

Economics 31

Ethics and values studies 32

Innovation and organizational change 33

Law and social science 34

Political science 35

Research on science and technology 36

Science and technology studies 37

Social and economic sciences - Other 38

Sociology 39

Transformations to quality organizations 40

Biological infrastructure - Other 41

Human resources 42

Instrumentation 43

Research resources 44

Ecological studies 45

Environmental biology - Other 46

Systematic & population biology 47

Developmental mechanisms 48Integrative biology and neuroscience - Other 49

Neuroscience 50

Physiology and ethology 51

Biochemical and biomolecular processes 52

Biomolecular structure & function 53

Cell biology 54

Genetics 55

Molecular and cellular biosciences - Other 56

Plant genome research (119) Plant genome research project 57

National Science

Foundation (10580)

Social, Behavioral,

and Economic Sciences

(SBE) (4584)

Biological Sciences

(BIO) (5996)

Behavioral and cognitive sciences (BCS) (1469)

International science and engineering (INT) (formerly International cooperative scientific activities) (1299)

Social and economic sciences (SES) (1816)

Biological infrastructure (BIR/DBI) (1061)

Environmental biology (DEB) (1609)

Integrative biology and neuroscience (IBN/BNS)

(1673)

Molecular and cellular biosciences (MCB/DCB)

(1534)

Dept of Health and

Human Services (11609)

National Institutes of

Health (11609)

National Cancer Institute (7574)

National Institute of Mental Health (4035)

Similarity between 55 funding programs

Example Topics extracted from NIH/NSF grants

brainfmri

imagingfunctional

mrisubjects

magneticresonance

neuroimagingstructural

schizophreniapatientsdeficits

schizophrenicpsychosissubjects

psychoticdysfunction

abnormalitiesclinical

memoryworking

memoriestasks

retrievalencodingcognitive

processingrecognition

performance

diseasead

alzheimerdiabetes

cardiovascularinsulin

vascularblood

clinicalindividuals

Gibbs Sampling

' '' '

( | )i itd wt

i i i it d w tt w

n np z t z

n T n W

count of word w assigned to topic t

count of topic t assigned to doc d

probability that word i is assigned to topic t