probabilistic topic models mark steyvers department of cognitive sciences university of california,...
Post on 22-Dec-2015
218 views
TRANSCRIPT
Probabilistic Topic Models
Mark SteyversDepartment of Cognitive Sciences
University of California, Irvine
Joint work with:Tom Griffiths, UC BerkeleyPadhraic Smyth, UC IrvineDave Newman, UC Irvine
Overview
I Problems of interest
II Probabilistic Topic Models
II Using topic models in machine learning and data mining
III Using topic models in psychology
IV Conclusion
Problems of Interest
What topics does this text collection “span”?
Which documents are about a particular topic?
Who writes about a particular topic?
How have topics changed over time?
How to represent the “gist” of a list of words?
How to model associations between words?
Machine Learning/ Data mining Psychology
Probabilistic Topic Models
Based on pLSI and LDA model(e.g., Hoffman, 2001; Blei, Ng, Jordan, 2003)
A probability model linking documents, topics, and words Topic distribution over words Document mixture of topics
Topics and topic mixtures can be extracted from large collections of text
Example Topics extracted from NIH/NSF grants
brainfmri
imagingfunctional
mrisubjects
magneticresonance
neuroimagingstructural
schizophreniapatientsdeficits
schizophrenicpsychosissubjects
psychoticdysfunction
abnormalitiesclinical
memoryworking
memoriestasks
retrievalencodingcognitive
processingrecognition
performance
diseasead
alzheimerdiabetes
cardiovascularinsulin
vascularblood
clinicalindividuals
Important point: these distributions are learned in a completely automated “unsupervised” fashion from the data
Overview
I Problems of interest
II Probabilistic Topic Models
II Using topic models in machine learning and data mining
III Using topic models in psychology
IV Conclusion
Parameters Real WorldData
P( Data | Parameters)
P( Parameters | Data)
Probabilistic Model
Statistical Inference
Document generation as a probabilistic process
Each topic is a distribution over words Each document a mixture of topics Each word chosen from a single topic
1
( ) ( | ) ( )T
i i i ij
P w P w z j P z j
From paramet
ers (j)
Fromparameters
(d)
Prior Distributions
Dirichlet priors encourage sparsity on topic mixtures and topics
θ ~ Dirichlet( )
~ Dirichlet( )
Topic 1 Topic 2
Topic 3
Word 1 Word 2
Word 3
Topics
.4
1.0
.6
1.0
MONEYLOANBANKRIVER
STREAM
RIVERSTREAM
BANKMONEY
LOAN
MONEY1 BANK1 BANK1 LOAN1 BANK1 MONEY1
BANK1 MONEY1 BANK1 LOAN1 LOAN1 BANK1
MONEY1 ....
Mixtures θ
Documents and topic
assignments
RIVER2 MONEY1 BANK2 STREAM2 BANK2 BANK1
MONEY1 RIVER2 MONEY1 BANK2 LOAN1
MONEY1 ....
RIVER2 BANK2 STREAM2 BANK2 RIVER2
BANK2....
Example of generating words
Topics
?
?
MONEY? BANK BANK? LOAN? BANK? MONEY?
BANK? MONEY? BANK? LOAN? LOAN? BANK?
MONEY? ....
Mixtures θ
RIVER? MONEY? BANK? STREAM? BANK? BANK?
MONEY? RIVER? MONEY? BANK? LOAN?
MONEY? ....
RIVER? BANK? STREAM? BANK? RIVER?
BANK?....
Inference
Documents and topic
assignments
?
Bayesian Inference
Three sets of latent variables topic mixtures θ word distributions topic assignments z
Integrate out θ and and estimate topic assignments:
Use MCMC with Gibbs sampling for approximate inference
,
|, z
P w zP z w
P w z
Sum over termsnT
Gibbs Sampling
Start with random assignments of words to topics
Repeat M iterations Repeat for all words i
Sample a new topic assignment for word i conditioned on all other topic assignments( | )i ip z t z
16 Artificial Documents
River Stream Bank Money Loan123456789
10111213141516
Can we recover the original topics and topic mixtures from this data?
docu
men
ts
Starting the Gibbs Sampling
River Stream Bank Money Loan123456789
10111213141516
Assign word tokens randomly to topics (●=topic 1; ●=topic 2 )
River Stream Bank Money Loan123456789
10111213141516
After 32 iterations
stream .40 bank .39bank .35 money .32river .25 loan .29
topic 1 topic 2●● ●●
Choosing the number of topics
Bayesian Model Selection:
choose T to maximize:
Requires integration over all parameters. Approximate by Gibbs sampling
Results with artificial data with 10 topics:
| modelP w
T
5 10 15 20
Log
P(
w |
T )
-2.3e-5
-2.2e-5
-2.1e-5
-2.0e-5
-1.9e-5
-1.8e-5
-1.7e-5
-1.6e-5
Overview
I Problems of interest
II Probabilistic Topic Models
II Using topic models in machine learning and data mining
III Using topic models in psychology
IV Conclusion
Extracting Topics from Email
250,000 250,000 emailsemails
28,000 28,000 authorsauthors
1999-20021999-2002
Enron Email TEXANSWIN
FOOTBALLFANTASY
SPORTSLINEPLAYTEAMGAME
SPORTSGAMES
GODLIFEMAN
PEOPLECHRISTFAITHLORDJESUS
SPIRITUALVISIT
TRAVELROUNDTRIP
SAVEDEALSHOTELBOOKSALE
FARESTRIP
CITIES
FERCMARKET
ISOCOMMISSION
ORDERFILING
COMMENTSPRICE
CALIFORNIAFILED
POWERCALIFORNIAELECTRICITY
UTILITIESPRICESMARKET
PRICEUTILITY
CUSTOMERSELECTRIC
STATEPLAN
CALIFORNIADAVISRATE
BANKRUPTCYSOCALPOWERBONDSMOU
Topic trends from New York Times
TOURRIDER
LANCE_ARMSTRONGTEAMBIKERACE
FRANCEJan00 Jul00 Jan01 Jul01 Jan02 Jul02 Jan03
0
5
10
15Tour-de-France
Jan00 Jul00 Jan01 Jul01 Jan02 Jul02 Jan030
10
20
30COMPANYQUARTERPERCENTANALYSTSHARESALES
EARNING
Quarterly Earnings
ANTHRAXLETTER
MAILWORKEROFFICESPORESPOSTAL
BUILDING Jan00 Jul00 Jan01 Jul01 Jan02 Jul02 Jan030
50
100 Anthrax
330,000 330,000 articlesarticles
2000-20022000-2002
Topic trends in NIPS conference
Year
1986 1988 1990 1992 1994 1996 1998 2000
Pro
port
ion
of w
ords
0.0000
0.0005
0.0010
0.0015
0.0020
Year
1986 1988 1990 1992 1994 1996 1998 2000
Pro
port
ion
of w
ords
0.00
0.01
0.02
0.03KERNELSUPPORTVECTORMARGIN
SVMKERNELS
SPACEDATA
MACHINES
... while SVM’s become more popular.
LAYERNET
NEURALLAYERSNETS
ARCHITECTURENUMBER
FEEDFORWARDSINGLE
Neural networks on the decline ...
Finding Funding Overlap
Analyze 22,000+ grants active during 2002
55 funding programs from NSF and NIH Focus on behavioral sciences
Questions of interest: What are the areas of overlap between funding
programs? What topics are funded?
NCICancer biology,
detection and diagnosis
NCIAIDS Research
NCICancer
Research Centers
NCICancer
causationNCI
Cancer prevention and control
NCICancer
treatment
NCIResearch
manpower development
NIMHAIDS Research
NIMHExtramural research
NIMHIntramural research
BCSArchaeology,
archeometry, and ...
BCSBehavioral
and cognitive sciences - Other
BCSChild learning
and development
BCSCultural
anthropology
BCSEnvironmental social
and behavioral scienceBCSGeography
and regional science
BCSHuman cognition and perception
BCSInstrumentation
BCSLinguistics
BCSPhysical
anthropology
BCSSocial
psychology
INTAfrica, Near East, and South Asia
INTAmericas
INTCentral
and Eastern Europe
INTEast Asia
and Pacific
INTInternational
activities - Other
INTJapan
and Korea INTWestern Europe
SESDecision, risk,
and management science
SESMethodology, measures,
and statistics
SESEconomics
SESEthics
and values studies
SESInnovation
and organizational change
SESLaw
and social science
SESPolitical science
SESResearch on science
and technology
SESScience
and technology studies
SESSocial and economic
sciences - Other
SESSociologySES
Transformations to quality organizations
BIRBiological
infrastructure - Other
BIRHuman
resources
BIRInstrumentation
BIRResearch resources
DEBEcological
studies
DEBEnvironmental biology - Other
DEBSystematic
& population biology
IBNDevelopmental mechanisms
IBNIntegrative biology
and neuroscience - Other
IBNNeuroscience
IBNPhysiology
and ethology
MCBBiochemical
and biomolecular processes
MCBBiomolecular structure
& function
MCBCell biology
MCBGenetics
MCBMolecular and cellular biosciences - Other
PGRPlant genome research project
NIH
NSF – BIO
NSF – SBE
2D visualization of funding programs – nearby program support similar topics
Funding Amounts per Topic
We have $ funding per grant
We have distribution of topics for each grant
Solve for the $ amount per topic
What are expensive topics?
Funding % Interpretation3.47 research center2.87 cancer control2.26 mental health services2.01 clinical treatment1.87 cancer 1.73 gene sequencing1.61 risk factors1.56 children/parents1.51 tumors1.48 training program1.47 immunology1.43 disorders1.40 patient treatment
Funding % Interpretation0.60 conference/meetings0.56 theory0.55 public policy0.55 collaborative projects0.55 marine environment0.55 decision making0.55 ecological diversity0.53 sexual behavior0.52 markets0.51 science/technology0.49 computer systems0.45 language0.44 archaelogy
High $$$ topics Low $$$ topics
Overview
I Problems of interest
II Probabilistic Topic Models
II Using topic models in machine learning and data mining
III Using topic models in psychology
IV Conclusion
Basic Problems for Semantic Memory
Prediction what fact, concept, or word is next?
Gist extraction What is this set of words about?
Disambiguation What is the sense of this word?
2 1|P w w
| ,P z w context
|P z w
Three documents with the word “play”
(numbers & colors topic assignments)
A Play082 is written082 to be performed082 on a stage082 before a live093 audience082 or before motion270 picture004 or television004 cameras004 ( for later054 viewing004 by large202 audiences082). A Play082 is written082 because playwrights082 have something ... He was listening077 to music077 coming009 from a passing043 riverboat. The music077 had already captured006 his heart157 as well as his ear119. It was jazz077. Bix beiderbecke had already had music077 lessons077. He wanted268 to play077 the cornet. And he wanted268 to play077 jazz077... J im296 plays166 the game166. J im296 likes081 the game166 for one. The game166 book254 helps081 jim296. Don180 comes040 into the house038. Don180 and jim296 read254 the game166 book254. The boys020 see a game166 for two. The two boys020 play166 the game166....
Word Association
CUE:
PLAY
(Nelson, McEvoy, & Schreiber USF word association norms)
Word P( word ) Word P( word ) Word Cosine FUN .141 BALL .041 KICKBALL .558 GAME 42 BALL .134 GAME .039 VOLLEYBALL .519 BALL 33 GAME .074 CHILDREN .019 GAMES .492 CHILDREN 30 WORK .067 ROLE .014 COSTUMES .478 SCHOOL 27
GROUND .060 GAMES .014 DRAMA .469 ROLE 25 MATE .027 MUSIC .009 ROLE .465 WANT 24 CHILD .020 BASEBALL .009 PLAYWRIGHT .464 GAMES 23 ENJOY .020 HIT .008 FUN .454 MOTHER 23 WIN .020 FUN .008 ACTOR .448 THINGS 21
ACTOR .013 TEAM .008 REHEARSALS .445 MUSIC 21 FIGHT .013 IMPORTANT .006 GAME .445 HELP 20 HORSE .013 BAT .006 ACTORS .439 FUN 19
KID .013 RUN .006 CHECKERS .431 READ 18 MUSIC .013 STAGE .005 MOLIERE .429 DON 18
Responses
Association networks
BASEBALL
BAT
BALL
GAME
PLAY
STAGE THEATER
Topic model for Word Association
Association as a problem of prediction: Given that a single word is observed, predict
what other words might occur in that context
Under a single topic assumption:
2 1 2 1| | |z
P w w P w z P z w
Response Cue
Model predictions from TASA corpus
Word P( word ) Word P( word ) Word Cosine FUN .141 BALL .041 KICKBALL .558 GAME 42 BALL .134 GAME .039 VOLLEYBALL .519 BALL 33 GAME .074 CHILDREN .019 GAMES .492 CHILDREN 30 WORK .067 ROLE .014 COSTUMES .478 SCHOOL 27
GROUND .060 GAMES .014 DRAMA .469 ROLE 25 MATE .027 MUSIC .009 ROLE .465 WANT 24 CHILD .020 BASEBALL .009 PLAYWRIGHT .464 GAMES 23 ENJOY .020 HIT .008 FUN .454 MOTHER 23 WIN .020 FUN .008 ACTOR .448 THINGS 21
ACTOR .013 TEAM .008 REHEARSALS .445 MUSIC 21 FIGHT .013 IMPORTANT .006 GAME .445 HELP 20 HORSE .013 BAT .006 ACTORS .439 FUN 19
KID .013 RUN .006 CHECKERS .431 READ 18 MUSIC .013 STAGE .005 MOLIERE .429 DON 18
HUMANS TOPICS (T=500)
RANK 9
Median rank of first associate
# Topics
300
500
700
90011
0013
0015
0017
00
Med
ian
Ra
nk
0
10
20
30
40
TOPICS
Latent Semantic Analysis(Landauer & Dumais, 1997)
word-document counts
high dimensional space
SVD
Each word is a single point in semantic space Similarity measured by cosine of angle between
word vectors
BALL
PLAY
STAGE
THEATER
GAME
Median rank of first associate
# Topics
300
500
700
90011
0013
0015
0017
00
Med
ian
Ra
nk
0
10
20
30
40
0
10
20
30
40 Cosine
Inner product
TOPICS LSA
Objections to spatial representations
Tversky (1977) on similarity: asymmetry violation of the triangle inequality rich neighborhood structure
Semantic association has the same properties asymmetry violation of the triangle inequality “small world” neighborhood structure
(Steyvers & Tenenbaum, 2005)
Topics as latent factors to group words
BASEBALL
BAT
BALL
GAME
STAGE THEATER
TOPIC
TOPIC
PLAY
BASEBALL
BAT
BALL
GAME
PLAY
STAGE THEATER
Note: there are no direct connections between words
What about collocations?
Why are these words related? DOW - JONES BUMBLE - BEE WHITE – HOUSE
Suggests at least two routes for association: Semantic Collocation
Integrate collocations into topic model
Related to paradigmatic/syntagmatic distinction
TOPIC MIXTURE
TOPIC
WORD
X
TOPIC
WORD
X
TOPIC
WORDIf x=0, sample a word from the topic
If x=1, sample a word from the distribution based on previous word
......
......
......
Collocation Topic Model
TOPIC MIXTURE
TOPIC
DOW
X=1
JONES
X=0
TOPIC
RISES
Example: “DOW JONES RISES”
JONES is more likely explained as a word following DOW than as word sampled from topic
Result: DOW_JONES recognized as collocation
......
......
......
Collocation Topic Model
Examples Topics from New York Times
WEEKDOW_JONES
POINTS10_YR_TREASURY_YIELD
PERCENTCLOSE
NASDAQ_COMPOSITESTANDARD_POOR
CHANGEFRIDAY
DOW_INDUSTRIALSGRAPH_TRACKS
EXPECTEDBILLION
NASDAQ_COMPOSITE_INDEXEST_02
PHOTO_YESTERDAYYEN10
500_STOCK_INDEX
WALL_STREETANALYSTS
INVESTORSFIRM
GOLDMAN_SACHSFIRMS
INVESTMENTMERRILL_LYNCH
COMPANIESSECURITIESRESEARCH
STOCKBUSINESSANALYST
WALL_STREET_FIRMSSALOMON_SMITH_BARNEY
CLIENTSINVESTMENT_BANKINGINVESTMENT_BANKERS
INVESTMENT_BANKS
SEPT_11WAR
SECURITYIRAQ
TERRORISMNATIONKILLED
AFGHANISTANATTACKS
OSAMA_BIN_LADENAMERICAN
ATTACKNEW_YORK_REGION
NEWMILITARY
NEW_YORKWORLD
NATIONALQAEDA
TERRORIST_ATTACKS
BANKRUPTCYCREDITORS
BANKRUPTCY_PROTECTIONASSETS
COMPANYFILED
BANKRUPTCY_FILINGENRON
BANKRUPTCY_COURTKMART
CHAPTER_11FILING
COOPERBILLIONS
COMPANIESBANKRUPTCY_PROCEEDINGS
DEBTSRESTRUCTURING
CASEGROUP
Terrorism Wall Street Firms
Stock Market
Bankruptcy
Context dependent collocations
Example: WHITE - HOUSE
In context of government, WHITE-HOUSE is a collocation
In context of colors of houses, WHITE - HOUSE is treated as separate words
Overview
I Problems of interest
II Probabilistic Topic Models
II Using topic models in machine learning and data mining
III Using topic models in psychology
IV Conclusion
Psychology Computer Science
Models for information retrieval might be helpful in understanding semantic memory
Research on semantic memory can be useful for developing better information retrieval algorithms
Software
Public-domain MATLAB toolbox for topic modeling on the Web:
http://psiexp.ss.uci.edu/research/programs_data/toolbox.htm
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100
2
4
6
8
10
12
14
16
Comparing topics from two runs
KL
dis
tan
ce
Topics from Run 1
Re-o
rdere
d T
op
ics
Ru
n 2
BEST KL = .46
WORST KL = 9.40
ACCOUNT .042 ACCOUNT .041ACCOUNTS .028 ACCOUNTS .026
CASH .025 BALANCE .025BALANCE .025 CASH .023AMOUNT .022 AMOUNT .022COLUMN .021 COLUMN .021BUSINESS .018 BUSINESS .017
TOTAL .017 CHECK .016JOURNAL .015 RECORD .016
CHECK .015 TOTAL .016RECORDED .015 ACCOUNTING .016
CREDIT .015 JOURNAL .015RECORD .015 PERIOD .015
GENERAL .014 CREDIT .014PERIOD .013 RECORDED .014
Run 1 Run 2
MONEY .094 MONEY .086GOLD .044 PAY .033POOR .034 BANK .027
FOUND .023 INCOME .027RICH .021 INTEREST .022
SILVER .020 TAX .021HARD .019 PAID .016
DOLLARS .018 TAXES .016GIVE .016 BANKS .015
WORTH .016 INSURANCE .015BUY .015 AMOUNT .011
WORKED .014 CREDIT .010LOST .013 DOLLARS .010
SOON .013 COST .008PAY .013 FUNDS .008
Run 1 Run 2
Extensions of Topic Models
Combining topics and syntax
Topic hierarchies
Topic segmentation no need for document boundaries
Modeling authors as well as documents who wrote this part of the paper?
Words can have high probability in multiple topics
PRINTINGPAPERPRINT
PRINTEDTYPE
PROCESSINK
PRESSIMAGE
PRINTERPRINTS
PRINTERSCOPY
COPIESFORM
OFFSETGRAPHICSURFACE
PRODUCEDCHARACTERS
PLAYPLAYSSTAGE
AUDIENCETHEATERACTORSDRAMA
SHAKESPEAREACTOR
THEATREPLAYWRIGHT
PERFORMANCEDRAMATICCOSTUMES
COMEDYTRAGEDY
CHARACTERSSCENESOPERA
PERFORMED
TEAMGAME
BASKETBALLPLAYERSPLAYERPLAY
PLAYINGSOCCERPLAYED
BALLTEAMSBASKET
FOOTBALLSCORECOURTGAMES
TRYCOACH
GYMSHOT
JUDGETRIAL
COURTCASEJURY
ACCUSEDGUILTY
DEFENDANTJUSTICE
EVIDENCEWITNESSES
CRIMELAWYERWITNESS
ATTORNEYHEARING
INNOCENTDEFENSECHARGE
CRIMINAL
HYPOTHESISEXPERIMENTSCIENTIFIC
OBSERVATIONSSCIENTISTS
EXPERIMENTSSCIENTIST
EXPERIMENTALTEST
METHODHYPOTHESES
TESTEDEVIDENCE
BASEDOBSERVATION
SCIENCEFACTSDATA
RESULTSEXPLANATION
STUDYTEST
STUDYINGHOMEWORK
NEEDCLASSMATHTRY
TEACHERWRITEPLAN
ARITHMETICASSIGNMENT
PLACESTUDIED
CAREFULLYDECIDE
IMPORTANTNOTEBOOK
REVIEW
(Based on TASA corpus)
Violation of triangle inequality
Can find associations that violate this:
A
B C
AC AB + BC
SOCCER FIELD MAGNETIC
No Triangle Inequality with Topics
SOCCER
MAGNETICFIELD
TOPIC 1 TOPIC 2
Topic structure easily explains violations of triangle inequality
Small-World Structure of Associations(Steyvers & Tenenbaum, 2005)
Properties:1) Short path lengths2) Clustering3) Power law degree distributions
Small world graphs arise elsewhere: internet, social relations, biology
BASEBALL
BAT
BALL
GAME
PLAY
STAGE THEATER
Power law degree distributions in Semantic Networks
Power law degree distribution some words are “hubs” in a semantic network
UNDIRECTED ASSOCIATIVE NETWORK
101 102
P(
k )
10-6
10-5
10-4
10-3
10-2
10-1
100
WORDNET
100 101 10210-6
10-5
10-4
10-3
10-2
10-1
100
ROGET'STHESAURUS
100 10110-3
10-2
10-1
100
kkk
k = #incoming links100 101 102
10-4
10-3
10-2
10-1
100
P(
k )
d=50 d=200d=400
k = #incoming links
101 102
10-4
10-3
10-2
10-1
100
P(
k )
Creating Association Networks
TOPICS MODEL:
• Connect i to j when P( w=j | w=i ) > threshold
LSA:
For each word, generate K associates by picking K nearest neighbors in semantic space
=-2.05
Associations in free recall
STUDY THESE WORDS: Bed, Rest, Awake, Tired, Dream, Wake, Snooze, Blanket, Doze, Slumber, Snore, Nap, Peace, Yawn, Drowsy
RECALL WORDS .....
FALSE RECALL: “Sleep” 61%
Recall as a reconstructive process
Reconstruct study list based on the stored “gist”
The gist can be represented by a distribution over topics
Under a single topic assumption: 1 1| | |n n
z
P w P w z P z w w
Retrieved wordStudy list
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2
BEDRESTTIRED
AWAKEWAKE
NAPDREAM
YAWNDROWSYBLANKETSNORE
SLUMBERPEACEDOZE
SLEEPNIGHT
ASLEEPMORNINGHOURS
SLEEPYEYESAWAKENED
Predictions for the “Sleep” list
STUDYLIST
EXTRALIST
(top 8)
w|1nwP
Hidden Markov Topics Model
Syntactic dependencies short range dependencies Semantic dependencies long-range
z z z z
w w w w
s s s s
Semantic state: generate words from topic model
Syntactic states: generate words from HMM
(Griffiths, Steyvers, Blei, & Tenenbaum, 2004)
HEART 0.2 LOVE 0.2SOUL 0.2TEARS 0.2JOY 0.2
z = 1 0.4
SCIENTIFIC 0.2 KNOWLEDGE 0.2WORK 0.2RESEARCH 0.2MATHEMATICS 0.2
z = 2 0.6
THE 0.6 A 0.3MANY 0.1
OF 0.6 FOR 0.3BETWEEN 0.1
0.9
0.1
0.2
0.8
0.7
0.3
Transition between semantic state and syntactic states
THE ………………………………
HEART 0.2 LOVE 0.2SOUL 0.2TEARS 0.2JOY 0.2
z = 1 0.4
SCIENTIFIC 0.2 KNOWLEDGE 0.2WORK 0.2RESEARCH 0.2MATHEMATICS 0.2
z = 2 0.6
x = 1
THE 0.6 A 0.3MANY 0.1
x = 3
OF 0.6 FOR 0.3BETWEEN 0.1
x = 2
0.9
0.1
0.2
0.8
0.7
0.3
Combining topics and syntax
THE LOVE……………………
HEART 0.2 LOVE 0.2SOUL 0.2TEARS 0.2JOY 0.2
z = 1 0.4
SCIENTIFIC 0.2 KNOWLEDGE 0.2WORK 0.2RESEARCH 0.2MATHEMATICS 0.2
z = 2 0.6
x = 1
THE 0.6 A 0.3MANY 0.1
x = 3
OF 0.6 FOR 0.3BETWEEN 0.1
x = 2
0.9
0.1
0.2
0.8
0.7
0.3
Combining topics and syntax
THE LOVE OF………………
HEART 0.2 LOVE 0.2SOUL 0.2TEARS 0.2JOY 0.2
z = 1 0.4
SCIENTIFIC 0.2 KNOWLEDGE 0.2WORK 0.2RESEARCH 0.2MATHEMATICS 0.2
z = 2 0.6
x = 1
THE 0.6 A 0.3MANY 0.1
x = 3
OF 0.6 FOR 0.3BETWEEN 0.1
x = 2
0.9
0.1
0.2
0.8
0.7
0.3
Combining topics and syntax
THE LOVE OF RESEARCH ……
HEART 0.2 LOVE 0.2SOUL 0.2TEARS 0.2JOY 0.2
z = 1 0.4
SCIENTIFIC 0.2 KNOWLEDGE 0.2WORK 0.2RESEARCH 0.2MATHEMATICS 0.2
z = 2 0.6
x = 1
THE 0.6 A 0.3MANY 0.1
x = 3
OF 0.6 FOR 0.3BETWEEN 0.1
x = 2
0.9
0.1
0.2
0.8
0.7
0.3
Combining topics and syntax
FOODFOODSBODY
NUTRIENTSDIETFAT
SUGARENERGY
MILKEATINGFRUITS
VEGETABLESWEIGHT
FATSNEEDS
CARBOHYDRATESVITAMINSCALORIESPROTEIN
MINERALS
MAPNORTHEARTHSOUTHPOLEMAPS
EQUATORWESTLINESEAST
AUSTRALIAGLOBEPOLES
HEMISPHERELATITUDE
PLACESLAND
WORLDCOMPASS
CONTINENTS
DOCTORPATIENTHEALTH
HOSPITALMEDICAL
CAREPATIENTS
NURSEDOCTORSMEDICINENURSING
TREATMENTNURSES
PHYSICIANHOSPITALS
DRSICK
ASSISTANTEMERGENCY
PRACTICE
BOOKBOOKS
READINGINFORMATION
LIBRARYREPORT
PAGETITLE
SUBJECTPAGESGUIDE
WORDSMATERIALARTICLE
ARTICLESWORDFACTS
AUTHORREFERENCE
NOTE
GOLDIRON
SILVERCOPPERMETAL
METALSSTEELCLAYLEADADAM
OREALUMINUM
MINERALMINE
STONEMINERALS
POTMININGMINERS
TIN
BEHAVIORSELF
INDIVIDUALPERSONALITY
RESPONSESOCIAL
EMOTIONALLEARNINGFEELINGS
PSYCHOLOGISTSINDIVIDUALS
PSYCHOLOGICALEXPERIENCES
ENVIRONMENTHUMAN
RESPONSESBEHAVIORSATTITUDES
PSYCHOLOGYPERSON
CELLSCELL
ORGANISMSALGAE
BACTERIAMICROSCOPEMEMBRANEORGANISM
FOODLIVINGFUNGIMOLD
MATERIALSNUCLEUSCELLED
STRUCTURESMATERIAL
STRUCTUREGREENMOLDS
Semantic topics
PLANTSPLANT
LEAVESSEEDSSOIL
ROOTSFLOWERS
WATERFOOD
GREENSEED
STEMSFLOWER
STEMLEAF
ANIMALSROOT
POLLENGROWING
GROW
GOODSMALL
NEWIMPORTANT
GREATLITTLELARGE
*BIG
LONGHIGH
DIFFERENTSPECIAL
OLDSTRONGYOUNG
COMMONWHITESINGLE
CERTAIN
THEHIS
THEIRYOURHERITSMYOURTHIS
THESEA
ANTHATNEW
THOSEEACH
MRANYMRSALL
MORESUCHLESS
MUCHKNOWN
JUSTBETTERRATHER
GREATERHIGHERLARGERLONGERFASTER
EXACTLYSMALLER
SOMETHINGBIGGERFEWERLOWER
ALMOST
ONAT
INTOFROMWITH
THROUGHOVER
AROUNDAGAINSTACROSS
UPONTOWARDUNDERALONGNEAR
BEHINDOFF
ABOVEDOWN
BEFORE
SAIDASKED
THOUGHTTOLDSAYS
MEANSCALLEDCRIED
SHOWSANSWERED
TELLSREPLIED
SHOUTEDEXPLAINEDLAUGHED
MEANTWROTE
SHOWEDBELIEVED
WHISPERED
ONESOMEMANYTWOEACHALL
MOSTANY
THREETHIS
EVERYSEVERAL
FOURFIVEBOTHTENSIX
MUCHTWENTY
EIGHT
HEYOU
THEYI
SHEWEIT
PEOPLEEVERYONE
OTHERSSCIENTISTSSOMEONE
WHONOBODY
ONESOMETHING
ANYONEEVERYBODY
SOMETHEN
Syntactic classes
BEMAKE
GETHAVE
GOTAKE
DOFINDUSESEE
HELPKEEPGIVELOOKCOMEWORKMOVELIVEEAT
BECOME
MODELALGORITHM
SYSTEMCASE
PROBLEMNETWORKMETHOD
APPROACHPAPER
PROCESS
ISWASHAS
BECOMESDENOTES
BEINGREMAINS
REPRESENTSEXISTSSEEMS
SEESHOWNOTE
CONSIDERASSUMEPRESENT
NEEDPROPOSEDESCRIBESUGGEST
USEDTRAINED
OBTAINEDDESCRIBED
GIVENFOUND
PRESENTEDDEFINED
GENERATEDSHOWN
INWITHFORON
FROMAT
USINGINTOOVER
WITHIN
HOWEVERALSOTHENTHUS
THEREFOREFIRSTHERENOW
HENCEFINALLY
#*IXTN-CFP
EXPERTSEXPERTGATING
HMEARCHITECTURE
MIXTURELEARNINGMIXTURESFUNCTION
GATE
DATAGAUSSIANMIXTURE
LIKELIHOODPOSTERIOR
PRIORDISTRIBUTION
EMBAYESIAN
PARAMETERS
STATEPOLICYVALUE
FUNCTIONACTION
REINFORCEMENTLEARNINGCLASSESOPTIMAL
*
MEMBRANESYNAPTIC
CELL*
CURRENTDENDRITICPOTENTIAL
NEURONCONDUCTANCE
CHANNELS
IMAGEIMAGESOBJECT
OBJECTSFEATURE
RECOGNITIONVIEWS
#PIXEL
VISUAL
KERNELSUPPORTVECTOR
SVMKERNELS
#SPACE
FUNCTIONMACHINES
SET
NETWORKNEURAL
NETWORKSOUPUTINPUT
TRAININGINPUTS
WEIGHTS#
OUTPUTS
NIPS Semantics
NIPS Syntax
Random sentence generation
LANGUAGE:[S] RESEARCHERS GIVE THE SPEECH[S] THE SOUND FEEL NO LISTENERS[S] WHICH WAS TO BE MEANING[S] HER VOCABULARIES STOPPED WORDS[S] HE EXPRESSLY WANTED THAT BETTER VOWEL
Topic Hierarchies In regular topic model, no relations between
topics
Alternative: hierarchical topic organization topic 1
topic 2 topic 3
topic 4 topic 5 topic 6 topic 7
Nested Chinese Restaurant Process Blei, Griffiths, Jordan, Tenenbaum (2004) Learn hierarchical structure, as well as topics
within structure
Example: Psych Review Abstracts
RESPONSESTIMULUS
REINFORCEMENTRECOGNITION
STIMULIRECALLCHOICE
CONDITIONING
SPEECHREADINGWORDS
MOVEMENTMOTORVISUALWORD
SEMANTIC
ACTIONSOCIALSELF
EXPERIENCEEMOTION
GOALSEMOTIONALTHINKING
GROUPIQ
INTELLIGENCESOCIAL
RATIONALINDIVIDUAL
GROUPSMEMBERS
SEXEMOTIONS
GENDEREMOTIONSTRESSWOMENHEALTH
HANDEDNESS
REASONINGATTITUDE
CONSISTENCYSITUATIONALINFERENCEJUDGMENT
PROBABILITIESSTATISTICAL
IMAGECOLOR
MONOCULARLIGHTNESS
GIBSONSUBMOVEMENTORIENTATIONHOLOGRAPHIC
CONDITIONINSTRESS
EMOTIONALBEHAVIORAL
FEARSTIMULATIONTOLERANCERESPONSES
AMODEL
MEMORYFOR
MODELSTASK
INFORMATIONRESULTSACCOUNT
SELFSOCIAL
PSYCHOLOGYRESEARCH
RISKSTRATEGIES
INTERPERSONALPERSONALITY
SAMPLING
MOTIONVISUAL
SURFACEBINOCULAR
RIVALRYCONTOUR
DIRECTIONCONTOURSSURFACES
DRUGFOODBRAIN
AROUSALACTIVATIONAFFECTIVEHUNGER
EXTINCTIONPAIN
THEOF
ANDTOINAIS
Generative Process
RESPONSESTIMULUS
REINFORCEMENTRECOGNITION
STIMULIRECALLCHOICE
CONDITIONING
SPEECHREADINGWORDS
MOVEMENTMOTORVISUALWORD
SEMANTIC
ACTIONSOCIALSELF
EXPERIENCEEMOTION
GOALSEMOTIONALTHINKING
GROUPIQ
INTELLIGENCESOCIAL
RATIONALINDIVIDUAL
GROUPSMEMBERS
SEXEMOTIONS
GENDEREMOTIONSTRESSWOMENHEALTH
HANDEDNESS
REASONINGATTITUDE
CONSISTENCYSITUATIONALINFERENCEJUDGMENT
PROBABILITIESSTATISTICAL
IMAGECOLOR
MONOCULARLIGHTNESS
GIBSONSUBMOVEMENTORIENTATIONHOLOGRAPHIC
CONDITIONINSTRESS
EMOTIONALBEHAVIORAL
FEARSTIMULATIONTOLERANCERESPONSES
AMODEL
MEMORYFOR
MODELSTASK
INFORMATIONRESULTSACCOUNT
SELFSOCIAL
PSYCHOLOGYRESEARCH
RISKSTRATEGIES
INTERPERSONALPERSONALITY
SAMPLING
MOTIONVISUAL
SURFACEBINOCULAR
RIVALRYCONTOUR
DIRECTIONCONTOURSSURFACES
DRUGFOODBRAIN
AROUSALACTIVATIONAFFECTIVEHUNGER
EXTINCTIONPAIN
THEOF
ANDTOINAIS
Markov chain Monte Carlo
Sample from a Markov chain constructed to converge to the target distribution
Allows sampling from unnormalized posterior
Can compute approximate statistics from intractable distributions
Gibbs sampling one such method, construct Markov chain with conditional distributions
Enron email: two example topics (T=100)
WORD PROB.
BUSH 0.0227
LAY 0.0193
MR 0.0183
WHITE 0.0153
ENRON 0.0150
HOUSE 0.0148
PRESIDENT 0.0131
ADMINISTRATION 0.0115
COMPANY 0.0090
ENERGY 0.0085
SENDER PROB.
NELSON, KIMBERLY (ETS) 0.3608
PALMER, SARAH 0.0997
DENNE, KAREN 0.0541
HOTTE, STEVE 0.0340
DUPREE, DIANNA 0.0282
ARMSTRONG, JULIE 0.0222
LOKEY, TEB 0.0194
SULLIVAN, LORA 0.0073
VILLARREAL, LILLIAN 0.0040
BAGOT, NANCY 0.0026
TOPIC 10
WORD PROB.
ANDERSEN 0.0241
FIRM 0.0134
ACCOUNTING 0.0119
SEC 0.0065
SETTLEMENT 0.0062
AUDIT 0.0054
CORPORATE 0.0053
FINANCIAL 0.0052
JUSTICE 0.0052
INFORMATION 0.0050
SENDER PROB.
HILTABRAND, LESLIE 0.1359
WELLS, TORI L. 0.0865
DUPREE, DIANNA 0.0825
ARMSTRONG, JULIE 0.0316
DENNE, KAREN 0.0208
SULLIVAN, LORA 0.0072
[email protected] 0.0026
WILSON, DANNY 0.0016
HU, SYLVIA 0.0013
MATHEWS, LEENA 0.0012
TOPIC 32
Enron email: two work unrelated topics
WORD PROB.
TRAVEL 0.0161
ROUNDTRIP 0.0124
SAVE 0.0118
DEALS 0.0097
HOTEL 0.0095
BOOK 0.0094
SALE 0.0089
FARES 0.0083
TRIP 0.0072
CITIES 0.0070
SENDER PROB.
TRAVELOCITY MEMBER SERVICES 0.0763
BESTFARES.COM HOT DEALS 0.0502
<[email protected]> 0.0315
LISTS.COOLVACATIONS.COM 0.0151
CHEAP TICKETS 0.0111
EXPEDIA FARE TRACKER 0.0106
TRAVELOCITY.COM 0.0096
[email protected] 0.0088
[email protected] 0.0066
LASTMINUTE.COM 0.0051
TOPIC 38
WORD PROB.
NEWS 0.0245
MAIL 0.0182
NYTIMES 0.0149
YORK 0.0128
PAGE 0.0095
TIMES 0.0090
HEADLINES 0.0079
BUSH 0.0077
DELIVERY 0.0070
HTML 0.0068
SENDER PROB.
THE NEW YORK TIMES DIRECT 0.3438
<[email protected]> 0.0104
THE ECONOMIST 0.0029
@TIMES - INSIDE NYTIMES.COM 0.0015
[email protected] 0.0011
AMAZON.COM DELIVERS BESTSELLERS 0.0009
NYTIMES.COM 0.0009
HYATT, JERRY 0.0008
NEWSLETTER_TEXT 0.0008
CHRIS LONG 0.0007
TOPIC 25
Clusters v. Topics
Hidden Markov Models in Molecular Biology: New Algorithms and ApplicationsPierre Baldi, Yves C Hauvin, Tim Hunkapiller, Marcella A. McClure
Hidden Markov Models (HMMs) can be applied to several important problems in molecular biology. We introduce a new convergent learning algorithm for HMMs that, unlike the classical Baum-Welch algorithm is smooth and can be applied on-line or in batch mode, with or without the usual Viterbi most likely path approximation. Left-right HMMs with insertion and deletion states are then trained to represent several protein families including immunoglobulins and kinases. In all cases, the models derived capture all the important statistical properties of the families and can be used efficiently in a number of important tasks such as multiple alignment, motif detection, andclassification.
[cluster 88]
model data models time neural figure state learning set parameters network probability number networks training function system algorithm hidden markov
[topic 10] state hmm markov sequence models hidden states probabilities sequences parameters transition probability training hmms hybrid model likelihood modeling
[topic 37] genetic structure chain protein population region algorithms human mouse selection fitness proteins search evolution generation function sequence sequences genes
Multiple TopicsOne Cluster
word
s
C = F Q
documentsw
ord
stopics
top
ics documents
Topic model
word
s
documents
C = U D VT
word
s
dims
dims
dim
s
dim
s
documents
LSA
normalizedco-occurrence matrix
mixtureweights
mixturecomponents
Testing violation of triangle inequality
Look for triplets of associates w1 w2 w3 such that
P( w2 | w1 ) >
P( w3 | w2 ) >
and measure P( w3 | w1 )
Vary threshold
Documents as Topics Mixtures:a Geometric Interpretation
P(word3)
P(word1)
0
1
1
1
P(word2)
P(word1)+P(word2)+P(word3) = 1
topic 1
topic 2
= observed= observeddocumentdocument
word pixeldocument image
sample each pixel froma mixture of topics
Generating Artificial Data: a visual example
10 “topics”with 5 x 5 pixels
Interpretable decomposition
• SVD gives a basis for the data, but not an interpretable one
• The true basis is not orthogonal, so rotation does no good
Topic model
SVD
INPUT: word-document counts (word order is irrelevant)
OUTPUT:topic assignments to each word P( zi )likely words in each topic P( w | z )likely topics in each document (“gist”) P( z | d )
Summary
Level 1 Level 2 Level 3 Level 4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57
Cancer biology, detection and diagnosis 1
AIDS Research 2
Cancer Research Centers 3
Cancer causation 4
Cancer prevention and control 5
Cancer treatment 6
NCI research manpower development 7
AIDS Research 8
Extramural research 9
Intramural research 10
Archaeology, archeometry, and ... 11
Behavioral and cognitive sciences - Other 12
Child learning and development 13
Cultural anthropology 14Environmental social and behavioral science 15
Geography and regional science 16
Human cognition and perception 17
Instrumentation 18
Linguistics 19
Physical anthropology 20
Social psychology 21
Africa, Near East, and South Asia 22
Americas 23
Central and Eastern Europe 24
East Asia and Pacific 25
International activities - Other 26
Japan and Korea 27
Western Europe 28
Decision, risk, and management science 29
Methodology, measures, and statistics 30
Economics 31
Ethics and values studies 32
Innovation and organizational change 33
Law and social science 34
Political science 35
Research on science and technology 36
Science and technology studies 37
Social and economic sciences - Other 38
Sociology 39
Transformations to quality organizations 40
Biological infrastructure - Other 41
Human resources 42
Instrumentation 43
Research resources 44
Ecological studies 45
Environmental biology - Other 46
Systematic & population biology 47
Developmental mechanisms 48Integrative biology and neuroscience - Other 49
Neuroscience 50
Physiology and ethology 51
Biochemical and biomolecular processes 52
Biomolecular structure & function 53
Cell biology 54
Genetics 55
Molecular and cellular biosciences - Other 56
Plant genome research (119) Plant genome research project 57
National Science
Foundation (10580)
Social, Behavioral,
and Economic Sciences
(SBE) (4584)
Biological Sciences
(BIO) (5996)
Behavioral and cognitive sciences (BCS) (1469)
International science and engineering (INT) (formerly International cooperative scientific activities) (1299)
Social and economic sciences (SES) (1816)
Biological infrastructure (BIR/DBI) (1061)
Environmental biology (DEB) (1609)
Integrative biology and neuroscience (IBN/BNS)
(1673)
Molecular and cellular biosciences (MCB/DCB)
(1534)
Dept of Health and
Human Services (11609)
National Institutes of
Health (11609)
National Cancer Institute (7574)
National Institute of Mental Health (4035)
Similarity between 55 funding programs
Example Topics extracted from NIH/NSF grants
brainfmri
imagingfunctional
mrisubjects
magneticresonance
neuroimagingstructural
schizophreniapatientsdeficits
schizophrenicpsychosissubjects
psychoticdysfunction
abnormalitiesclinical
memoryworking
memoriestasks
retrievalencodingcognitive
processingrecognition
performance
diseasead
alzheimerdiabetes
cardiovascularinsulin
vascularblood
clinicalindividuals