finding scientific topics tom griffiths stanford university mark steyvers uc irvine
TRANSCRIPT
Finding scientific topics
Tom GriffithsStanford University
Mark SteyversUC Irvine
Why map knowledge?
• Quickly grasp important themes in a new field
• Synthesize content of an existing field
• Discover targets for funding and research
Why map knowledge?
• Quickly grasp important themes in a new field
• Synthesize content of an existing field
• Discover targets for funding and research
INFORMATION OVERLOAD
Apoptosis + Plant Biology
Apoptosis + Medicine
Apoptosis + Medicine
Apoptosis + Medicine
Apoptosis + Medicine
Apoptosis + Medicine
probabilisticgenerative
model
Apoptosis + Medicine
statisticalinference
Apoptosis + Medicine
1. A generative model for documents
2. Discovering topics with Gibbs sampling
3. Results– Topics and classes– Mapping science– Topic dynamics
4. Future directions– Tagging abstracts
1. A generative model for documents
2. Discovering topics with Gibbs sampling
3. Results– Topics and classes– Mapping science– Topic dynamics
4. Future directions– Tagging abstracts
A generative model for documents
• Each document a mixture of topics
• Each word chosen from a single topic
• from parameters
• from parameters
(Blei, Ng, & Jordan, 2003)
A generative model for documents
HEART 0.2 LOVE 0.2SOUL 0.2TEARS 0.2JOY 0.2SCIENTIFIC 0.0KNOWLEDGE 0.0WORK 0.0RESEARCH 0.0MATHEMATICS 0.0
HEART 0.0 LOVE 0.0SOUL 0.0TEARS 0.0JOY 0.0 SCIENTIFIC 0.2KNOWLEDGE 0.2WORK 0.2RESEARCH 0.2MATHEMATICS 0.2
topic 1 topic 2
w P(w|z = 1) = (1) w P(w|z = 2) = (2)
Choose mixture weights for each document, generate “bag of words”
= {P(z = 1), P(z = 2)}
{0, 1}
{0.25, 0.75}
{0.5, 0.5}
{0.75, 0.25}
{1, 0}
MATHEMATICS KNOWLEDGE RESEARCH WORK MATHEMATICS RESEARCH WORK SCIENTIFIC MATHEMATICS WORK
SCIENTIFIC KNOWLEDGE MATHEMATICS SCIENTIFIC HEART LOVE TEARS KNOWLEDGE HEART
MATHEMATICS HEART RESEARCH LOVE MATHEMATICS WORK TEARS SOUL KNOWLEDGE HEART
WORK JOY SOUL TEARS MATHEMATICS TEARS LOVE LOVE LOVE SOUL
TEARS LOVE JOY SOUL LOVE TEARS SOUL SOUL TEARS JOY
A generative model for documents
• Called Latent Dirichlet Allocation (LDA)
• Introduced by Blei, Ng, and Jordan (2003), reinterpretation of PLSI (Hofmann, 2001)
z
w
zz
w w
wor
ds
documents
U D V
wor
ds
dims
dims
dim
s
vect
ors documents
SVD
wor
ds
documents
wor
ds
topics
topi
csdocuments
LDA
P(w
|z)
P(z)P(w)
(Dumais, Landauer)
1. A generative model for documents
2. Discovering topics with Gibbs sampling
3. Results– Topics and classes– Mapping science– Topic dynamics
4. Future directions– Tagging abstracts
Inverting the generative model
• Maximum likelihood estimation (EM)
• Variational EM (Blei, Ng & Jordan, 2003)
• Bayesian inference
Bayesian inference
• Sum in the denominator over Tn terms
• Full posterior only tractable to a constant
Markov chain Monte Carlo
• Sample from a Markov chain which converges to target distribution
• Allows sampling from an unnormalized posterior distribution
• Can compute approximate statistics from intractable distributions
pixel = word image = document
sample each pixel froma mixture of topics
A visual example: Bars
Interpretable decomposition
• SVD gives a basis for the data, but not an interpretable one
• The true basis is not orthogonal, so rotation does no good
Bayesian model selection
• How many topics do we need?
• A Bayesian would consider the posterior:
• Involves summing over assignments z
P(T|w) P(w|T) P(T)
Corpus (w)
P(
w |T
)
T = 10
T = 100
Bayesian model selection
Corpus (w)
P(
w |T
)
T = 10
T = 100
Bayesian model selection
Corpus (w)
P(
w |T
)
T = 10
T = 100
Bayesian model selection
Back to the bars
1. A generative model for documents
2. Discovering topics with Gibbs sampling
3. Results– Topics and classes– Mapping science– Topic dynamics
4. Future directions– Tagging abstracts
Corpus preprocessing
• Used all D = 28,154 abstracts from 1991-2001• Used any word occurring in at least five
abstracts, not on “stop” list (W = 20,551)• Segmentation by any delimiting character, total
of n = 3,026,970 word tokens in corpus• Also, PNAS class designations for 2001
(thanks to Kevin Boyack)
Running the algorithm
• Memory requirements linear in T(W+D), runtime proportional to nT
• T = 50, 100, 200, 300, 400, 500, 600, (1000)
• Ran 8 chains for each T, burn-in of 1000 iterations, 10 samples/chain at a lag of 100
• All runs completed in under 30 hours on BlueHorizon supercomputer at San Diego
How many topics?
FORCESURFACE
MOLECULESSOLUTIONSURFACES
MICROSCOPYWATERFORCES
PARTICLESSTRENGTHPOLYMER
IONICATOMIC
AQUEOUSMOLECULARPROPERTIES
LIQUIDSOLUTIONS
BEADSMECHANICAL
HIVVIRUS
INFECTEDIMMUNODEFICIENCY
CD4INFECTION
HUMANVIRAL
TATGP120
REPLICATIONTYPE
ENVELOPEAIDSREV
BLOODCCR5
INDIVIDUALSENV
PERIPHERAL
MUSCLECARDIAC
HEARTSKELETALMYOCYTES
VENTRICULARMUSCLESSMOOTH
HYPERTROPHYDYSTROPHIN
HEARTSCONTRACTION
FIBERSFUNCTION
TISSUERAT
MYOCARDIALISOLATED
MYODFAILURE
STRUCTUREANGSTROM
CRYSTALRESIDUES
STRUCTURESSTRUCTURALRESOLUTION
HELIXTHREE
HELICESDETERMINED
RAYCONFORMATION
HELICALHYDROPHOBIC
SIDEDIMENSIONALINTERACTIONS
MOLECULESURFACE
NEURONSBRAIN
CORTEXCORTICAL
OLFACTORYNUCLEUS
NEURONALLAYER
RATNUCLEI
CEREBELLUMCEREBELLAR
LATERALCEREBRAL
LAYERSGRANULELABELED
HIPPOCAMPUSAREAS
THALAMIC
A selection of topics
TUMORCANCERTUMORSHUMANCELLS
BREASTMELANOMA
GROWTHCARCINOMA
PROSTATENORMAL
CELLMETASTATICMALIGNANT
LUNGCANCERS
MICENUDE
PRIMARYOVARIAN
PARASITEPARASITES
FALCIPARUMMALARIA
HOSTPLASMODIUM
ERYTHROCYTESERYTHROCYTE
MAJORLEISHMANIA
INFECTEDBLOOD
INFECTIONMOSQUITOINVASION
TRYPANOSOMACRUZI
BRUCEIHUMANHOSTS
ADULTDEVELOPMENT
FETALDAY
DEVELOPMENTALPOSTNATAL
EARLYDAYS
NEONATALLIFE
DEVELOPINGEMBRYONIC
BIRTHNEWBORN
MATERNALPRESENTPERIOD
ANIMALSNEUROGENESIS
ADULTS
CHROMOSOMEREGION
CHROMOSOMESKB
MAPMAPPING
CHROMOSOMALHYBRIDIZATION
ARTIFICIALMAPPED
PHYSICALMAPS
GENOMICDNA
LOCUSGENOME
GENEHUMAN
SITUCLONES
MALEFEMALEMALES
FEMALESSEX
SEXUALBEHAVIOROFFSPRING
REPRODUCTIVEMATINGSOCIALSPECIES
REPRODUCTIONFERTILITY
TESTISMATE
GENETICGERM
CHOICESRY
STUDIESPREVIOUS
SHOWNRESULTSRECENTPRESENT
STUDYDEMONSTRATED
INDICATEWORK
SUGGESTSUGGESTED
USINGFINDINGS
DEMONSTRATEREPORT
INDICATEDCONSISTENT
REPORTSCONTRAST
A selection of topics
MECHANISMMECHANISMSUNDERSTOOD
POORLYACTION
UNKNOWNREMAIN
UNDERLYINGMOLECULAR
PSREMAINS
SHOWRESPONSIBLE
PROCESSSUGGESTUNCLEARREPORT
LEADINGLARGELYKNOWN
MODELMODELS
EXPERIMENTALBASED
PROPOSEDDATA
SIMPLEDYNAMICSPREDICTED
EXPLAINBEHAVIOR
THEORETICALACCOUNTTHEORY
PREDICTSCOMPUTER
QUANTITATIVEPREDICTIONSCONSISTENT
PARAMETERS
PARASITEPARASITES
FALCIPARUMMALARIA
HOSTPLASMODIUM
ERYTHROCYTESERYTHROCYTE
MAJORLEISHMANIA
INFECTEDBLOOD
INFECTIONMOSQUITOINVASION
TRYPANOSOMACRUZI
BRUCEIHUMANHOSTS
ADULTDEVELOPMENT
FETALDAY
DEVELOPMENTALPOSTNATAL
EARLYDAYS
NEONATALLIFE
DEVELOPINGEMBRYONIC
BIRTHNEWBORN
MATERNALPRESENTPERIOD
ANIMALSNEUROGENESIS
ADULTS
CHROMOSOMEREGION
CHROMOSOMESKB
MAPMAPPING
CHROMOSOMALHYBRIDIZATION
ARTIFICIALMAPPED
PHYSICALMAPS
GENOMICDNA
LOCUSGENOME
GENEHUMAN
SITUCLONES
MALEFEMALEMALES
FEMALESSEX
SEXUALBEHAVIOROFFSPRING
REPRODUCTIVEMATINGSOCIALSPECIES
REPRODUCTIONFERTILITY
TESTISMATE
GENETICGERM
CHOICESRY
STUDIESPREVIOUS
SHOWNRESULTSRECENTPRESENT
STUDYDEMONSTRATED
INDICATEWORK
SUGGESTSUGGESTED
USINGFINDINGS
DEMONSTRATEREPORT
INDICATEDCONSISTENT
REPORTSCONTRAST
A selection of topics
MECHANISMMECHANISMSUNDERSTOOD
POORLYACTION
UNKNOWNREMAIN
UNDERLYINGMOLECULAR
PSREMAINS
SHOWRESPONSIBLE
PROCESSSUGGESTUNCLEARREPORT
LEADINGLARGELYKNOWN
MODELMODELS
EXPERIMENTALBASED
PROPOSEDDATA
SIMPLEDYNAMICSPREDICTED
EXPLAINBEHAVIOR
THEORETICALACCOUNTTHEORY
PREDICTSCOMPUTER
QUANTITATIVEPREDICTIONSCONSISTENT
PARAMETERS
1. A generative model for documents
2. Discovering topics with Gibbs sampling
3. Results– Topics and classes– Mapping science– Topic dynamics
4. Future directions– Tagging abstracts
Topics and classes
• PNAS authors provide class designations– major: Biological, Physical, Social Sciences– minor: 33 separate disciplines*
• Find topics diagnostic of classes– validate “reality” of classes– show topics pick out meaningful structure
(classes, and the the relations between them)
210SYNAPTICNEURONS
POSTSYNAPTICHIPPOCAMPAL
SYNAPSESLTP
PRESYNAPTICTRANSMISSIONPOTENTIATION
PLASTICITYEXCITATORY
RELEASEDENDRITIC
PYRAMIDALHIPPOCAMPUS
DENDRITESCA1
STIMULATIONTERMINALS
SYNAPSE
201RESISTANCERESISTANT
DRUGDRUGS
SENSITIVEMDR
MULTIDRUGSUSCEPTIBLE
SELECTEDGLYCOPROTEIN
SENSITIVITYPGP
AGENTSCONFERS
MDR1CYTOTOXICCONFERRED
CHEMOTHERAPEUTICEFFLUX
INCREASED
280SPECIES
SELECTIONEVOLUTION
GENETICPOPULATIONSPOPULATIONVARIATIONNATURAL
EVOLUTIONARYFITNESS
ADAPTIVERATES
THEORYTRAITS
DIVERSITYEXPECTEDNEUTRALEVOLVED
COMPETITIONHISTORY
222CORTEXBRAIN
SUBJECTSTASK
AREASREGIONS
FUNCTIONALLEFT
MEMORYTEMPORALIMAGING
PREFRONTALCEREBRAL
TASKSFRONTAL
AREATOMOGRAPHY
EMISSIONPOSITRONCORTICAL
2SPECIESGLOBALCLIMATE
CO2WATER
ENVIRONMENTALYEARS
MARINECARBON
DIVERSITYOCEAN
EXTINCTIONTERRESTRIALCOMMUNITYABUNDANCE
EARTHECOLOGICAL
CHANGETIME
ECOSYSTEM
39THEORY
TIMESPACEGIVEN
PROBLEMSHAPESIMPLE
DIMENSIONALPAPER
NUMBERCASE
LOCALTERMS
SYMMETRYRANDOM
EQUATIONCLASSICAL
COMPLEXITYNUMERICALPROPERTIES
1. A generative model for documents
2. Discovering topics with Gibbs sampling
3. Results– Topics and classes– Mapping science– Topic dynamics
4. Future directions– Tagging abstracts
Mapping science
• Topics provide dimensionality reduction
• Some applications require visualization (and even lower dimensionality)
• Low-dimensional representation from methods for analysis of compositional data
1. A generative model for documents
2. Discovering topics with Gibbs sampling
3. Results– Topics and classes– Mapping science– Topic dynamics
4. Future directions– Tagging abstracts
Topic dynamics
• We have the distribution over topics for abstracts from 1991 to 2001
• Analysis of dynamics:– perform linear trend analysis for each topic– “hot topics” go up, “cold topics” go down
Cold topics Hot topics
Cold topics Hot topics
2SPECIESGLOBALCLIMATE
CO2WATER
ENVIRONMENTALYEARS
MARINECARBON
DIVERSITYOCEAN
EXTINCTIONTERRESTRIALCOMMUNITYABUNDANCE
134MICE
DEFICIENTNORMAL
GENENULL
MOUSETYPE
HOMOZYGOUSROLE
KNOCKOUTDEVELOPMENT
GENERATEDLACKINGANIMALSREDUCED
179APOPTOSIS
DEATHCELL
INDUCEDBCL
CELLSAPOPTOTIC
CASPASEFAS
SURVIVALPROGRAMMED
MEDIATEDINDUCTIONCERAMIDE
EXPRESSION
Cold topics Hot topics
2SPECIESGLOBALCLIMATE
CO2WATER
ENVIRONMENTALYEARS
MARINECARBON
DIVERSITYOCEAN
EXTINCTIONTERRESTRIALCOMMUNITYABUNDANCE
134MICE
DEFICIENTNORMAL
GENENULL
MOUSETYPE
HOMOZYGOUSROLE
KNOCKOUTDEVELOPMENT
GENERATEDLACKINGANIMALSREDUCED
179APOPTOSIS
DEATHCELL
INDUCEDBCL
CELLSAPOPTOTIC
CASPASEFAS
SURVIVALPROGRAMMED
MEDIATEDINDUCTIONCERAMIDE
EXPRESSION
37CDNA
AMINOSEQUENCE
ACIDPROTEIN
ISOLATEDENCODING
CLONEDACIDS
IDENTITYCLONE
EXPRESSEDENCODES
RATHOMOLOGY
289KDA
PROTEINPURIFIED
MOLECULARMASS
CHROMATOGRAPHYPOLYPEPTIDE
GELSDS
BANDAPPARENTLABELED
IDENTIFIEDFRACTIONDETECTED
75ANTIBODY
ANTIBODIESMONOCLONAL
ANTIGENIGG
MABSPECIFICEPITOPEHUMANMABS
RECOGNIZEDSERA
EPITOPESDIRECTED
NEUTRALIZING
1. A generative model for documents
2. Discovering topics with Gibbs sampling
3. Results– Topics and classes– Mapping science– Topic dynamics
4. Future directions– Tagging abstracts
Future directions
• Including different kinds of knowledge– citations (Hofmann & Cohn, 2001)– author, title, keywords, other fields– word order information
• An example: scientific syntax and semantics
Scientific syntax and semantics
z
w
zz
w w
xxx
semantics: probabilistic topics
syntax: probabilistic regular grammar
Factorization of language based onstatistical dependency patterns:
long-range, document specific,dependencies
short-range dependencies constantacross all documents
HEART 0.2 LOVE 0.2SOUL 0.2TEARS 0.2JOY 0.2
z = 1 0.4
SCIENTIFIC 0.2 KNOWLEDGE 0.2WORK 0.2RESEARCH 0.2MATHEMATICS 0.2
z = 2 0.6
x = 1
THE 0.6 A 0.3MANY 0.1
x = 3
OF 0.6 FOR 0.3BETWEEN 0.1
x = 2
0.9
0.1
0.2
0.8
0.7
0.3
HEART 0.2 LOVE 0.2SOUL 0.2TEARS 0.2JOY 0.2
SCIENTIFIC 0.2 KNOWLEDGE 0.2WORK 0.2RESEARCH 0.2MATHEMATICS 0.2
THE 0.6 A 0.3MANY 0.1
OF 0.6 FOR 0.3BETWEEN 0.1
0.9
0.1
0.2
0.8
0.7
0.3
THE ………………………………
z = 1 0.4 z = 2 0.6
x = 1
x = 3
x = 2
HEART 0.2 LOVE 0.2SOUL 0.2TEARS 0.2JOY 0.2
SCIENTIFIC 0.2 KNOWLEDGE 0.2WORK 0.2RESEARCH 0.2MATHEMATICS 0.2
THE 0.6 A 0.3MANY 0.1
OF 0.6 FOR 0.3BETWEEN 0.1
0.9
0.1
0.2
0.8
0.7
0.3
THE LOVE……………………
z = 1 0.4 z = 2 0.6
x = 1
x = 3
x = 2
HEART 0.2 LOVE 0.2SOUL 0.2TEARS 0.2JOY 0.2
SCIENTIFIC 0.2 KNOWLEDGE 0.2WORK 0.2RESEARCH 0.2MATHEMATICS 0.2
THE 0.6 A 0.3MANY 0.1
OF 0.6 FOR 0.3BETWEEN 0.1
0.9
0.1
0.2
0.8
0.7
0.3
THE LOVE OF………………
z = 1 0.4 z = 2 0.6
x = 1
x = 3
x = 2
HEART 0.2 LOVE 0.2SOUL 0.2TEARS 0.2JOY 0.2
SCIENTIFIC 0.2 KNOWLEDGE 0.2WORK 0.2RESEARCH 0.2MATHEMATICS 0.2
THE 0.6 A 0.3MANY 0.1
OF 0.6 FOR 0.3BETWEEN 0.1
0.9
0.1
0.2
0.8
0.7
0.3
THE LOVE OF RESEARCH ……
z = 1 0.4 z = 2 0.6
x = 1
x = 3
x = 2
Semantic topics29 46 51 71 115 125
AGE SELECTION LOCI TUMOR MALE MEMORYLIFE POPULATION LOCUS CANCER FEMALE LEARNING
AGING SPECIES ALLELES TUMORS MALES BRAINOLD POPULATIONS ALLELE BREAST FEMALES TASK
YOUNG GENETIC GENETIC HUMAN SPERM CORTEXCRE EVOLUTION LINKAGE CARCINOMA SEX SUBJECTS
AGED SIZE POLYMORPHISM PROSTATE SEXUAL LEFTSENESCENCE NATURAL CHROMOSOME MELANOMA MATING RIGHTMORTALITY VARIATION MARKERS CANCERS REPRODUCTIVE SONG
AGES FITNESS SUSCEPTIBILITY NORMAL OFFSPRING TASKSCR MUTATION ALLELIC COLON PHEROMONE HIPPOCAMPAL
INFANTS PER POLYMORPHIC LUNG SOCIAL PERFORMANCESPAN NUCLEOTIDE POLYMORPHISMS APC EGG SPATIALMEN RATES RESTRICTION MAMMARY BEHAVIOR PREFRONTAL
WOMEN RATE FRAGMENT CARCINOMAS EGGS COGNITIVESENESCENT HYBRID HAPLOTYPE MALIGNANT FERTILIZATION TRAINING
LOXP DIVERSITY GENE CELL MATERNAL TOMOGRAPHYINDIVIDUALS SUBSTITUTION LENGTH GROWTH PATERNAL FRONTAL
CHILDREN SPECIATION DISEASE METASTATIC FERTILITY MOTORNORMAL EVOLUTIONARY MICROSATELLITE EPITHELIAL GERM EMISSION
Syntactic classes
REMAINED
5 8 14 25 26 30 33IN ARE THE SUGGEST LEVELS RESULTS BEEN
FOR WERE THIS INDICATE NUMBER ANALYSIS MAYON WAS ITS SUGGESTING LEVEL DATA CAN
BETWEEN IS THEIR SUGGESTS RATE STUDIES COULDDURING WHEN AN SHOWED TIME STUDY WELLAMONG REMAIN EACH REVEALED CONCENTRATIONS FINDINGS DIDFROM REMAINS ONE SHOW VARIETY EXPERIMENTS DOES
UNDER REMAINED ANY DEMONSTRATE RANGE OBSERVATIONS DOWITHIN PREVIOUSLY INCREASED INDICATING CONCENTRATION HYPOTHESIS MIGHT
THROUGHOUT BECOME EXOGENOUS PROVIDE DOSE ANALYSES SHOULDTHROUGH BECAME OUR SUPPORT FAMILY ASSAYS WILLTOWARD BEING RECOMBINANT INDICATES SET POSSIBILITY WOULD
INTO BUT ENDOGENOUS PROVIDES FREQUENCY MICROSCOPY MUSTAT GIVE TOTAL INDICATED SERIES PAPER CANNOT
INVOLVING MERE PURIFIED DEMONSTRATED AMOUNTS WORK
THEYAFTER APPEARED TILE SHOWS RATES EVIDENCE ALSO
ACROSS APPEAR FULL SO CLASS FINDINGAGAINST ALLOWED CHRONIC REVEAL VALUES MUTAGENESIS BECOME
WHEN NORMALLY ANOTHER DEMONSTRATES AMOUNT OBSERVATION MAGALONG EACH EXCESS SUGGESTED SITES MEASUREMENTS LIKELY
Abstract tagging
• Highlight important words in text, to reduce demands on information users
• Can be done to identify different content:– words assigned to most prevalent topic reveal
important themes (see the paper!)– with syntactic/semantic factorization, we can
highlight words that determine semantic content
(PNAS, 1991, vol. 88, 4874-4876)
A23 generalized49 fundamental11 theorem20 of4 natural46 selection46 is32 derived17 for5 populations46 incorporating22 both39 genetic46 and37 cultural46 transmission46. The14 phenotype15 is32 determined17 by42 an23 arbitrary49 number26 of4 multiallelic52 loci40 with22 two39-factor148 epistasis46 and37 an23 arbitrary49 linkage11 map20, as43 well33 as43 by42 cultural46 transmission46 from22 the14 parents46. Generations46 are8 discrete49 but37 partially19 overlapping24, and37 mating46 may33 be44 nonrandom17 at9 either39 the14 genotypic46 or37 the14 phenotypic46 level46 (or37 both39). I12 show34 that47 cultural46 transmission46 has18 several39 important49 implications6 for5 the14 evolution46 of4 population46 fitness46, most36 notably4 that47 there41 is32 a23 time26 lag7 in22 the14 response28 to31 selection46 such9 that47 the14 future137 evolution46 depends29 on21 the14 past24 selection46 history46 of4 the14 population46.
(graylevel = “semanticity”, the probability of using LDA over HMM)
(PNAS, 1996, vol. 93, 14628-14631)
The14 ''shape7'' of4 a23 female115 mating115 preference125 is32 the14 relationship7 between4 a23 male115 trait15 and37 the14 probability7 of4 acceptance21 as43 a23 mating115 partner20, The14 shape7 of4 preferences115 is32 important49 in5 many39 models6 of4 sexual115 selection46, mate115 recognition125, communication9, and37 speciation46, yet50 it41 has18 rarely19 been33 measured17 precisely19, Here12 I9 examine34 preference7 shape7 for5 male115 calling115 song125 in22 a23 bushcricket*13 (katydid*48). Preferences115 change46 dramatically19 between22 races46 of4 a23 species15, from22 strongly19 directional11 to31 broadly19 stabilizing45 (but50 with21 a23 net49 directional46 effect46), Preference115 shape46 generally19 matches10 the14 distribution16 of4 the14 male115 trait15, This41 is32 compatible29 with21 a23 coevolutionary46 model20 of4 signal9-preference115 evolution46, although50 it41 does33 nor37 rule20 out17 an23 alternative11 model20, sensory125 exploitation150. Preference46 shapes40 are8 shown35 to31 be44 genetic11 in5 origin7.
(PNAS, 1996, vol. 93, 14628-14631)
The14 ''shape7'' of4 a23 female115 mating115 preference125 is32 the14 relationship7 between4 a23 male115 trait15 and37 the14 probability7 of4 acceptance21 as43 a23 mating115 partner20, The14 shape7 of4 preferences115 is32 important49 in5 many39 models6 of4 sexual115 selection46, mate115 recognition125, communication9, and37 speciation46, yet50 it41 has18 rarely19 been33 measured17 precisely19, Here12 I9 examine34 preference7 shape7 for5 male115 calling115 song125 in22 a23 bushcricket*13 (katydid*48). Preferences115 change46 dramatically19 between22 races46 of4 a23 species15, from22 strongly19 directional11 to31 broadly19 stabilizing45 (but50 with21 a23 net49 directional46 effect46), Preference115 shape46 generally19 matches10 the14 distribution16 of4 the14 male115 trait15. This41 is32 compatible29 with21 a23 coevolutionary46 model20 of4 signal9-preference115 evolution46, although50 it41 does33 nor37 rule20 out17 an23 alternative11 model20, sensory125 exploitation150. Preference46 shapes40 are8 shown35 to31 be44 genetic11 in5 origin7.
Conclusion
• Probabilistic generative models can reveal the structure of knowledge domains
• We can use these models to – identify important themes– synthesize content– discover targets for funding and research– reduce the demands on information users
Gibbs sampling
For variables z = z1, z2, …, zn
Draw zi(t+1) from P(zi|z-i, w)
z-i = z1(t+1), z2
(t+1),…, zi-1(t+1), zi+1
(t), …, zn(t)
Gibbs sampling
• Need full conditional distributions for variables
• Since we only sample z we need
number of times word w assigned to topic j
number of times topic j used in document d
Gibbs sampling
i wi di zi123456789
101112...
50
MATHEMATICSKNOWLEDGE
RESEARCHWORK
MATHEMATICSRESEARCH
WORKSCIENTIFIC
MATHEMATICSWORK
SCIENTIFICKNOWLEDGE
.
.
.JOY
111111111122...5
221212212111...2
iteration1
Gibbs sampling
i wi di zi zi123456789
101112...
50
MATHEMATICSKNOWLEDGE
RESEARCHWORK
MATHEMATICSRESEARCH
WORKSCIENTIFIC
MATHEMATICSWORK
SCIENTIFICKNOWLEDGE
.
.
.JOY
111111111122...5
221212212111...2
?
iteration1 2
Gibbs sampling
i wi di zi zi123456789
101112...
50
MATHEMATICSKNOWLEDGE
RESEARCHWORK
MATHEMATICSRESEARCH
WORKSCIENTIFIC
MATHEMATICSWORK
SCIENTIFICKNOWLEDGE
.
.
.JOY
111111111122...5
221212212111...2
?
iteration1 2
Gibbs sampling
i wi di zi zi123456789
101112...
50
MATHEMATICSKNOWLEDGE
RESEARCHWORK
MATHEMATICSRESEARCH
WORKSCIENTIFIC
MATHEMATICSWORK
SCIENTIFICKNOWLEDGE
.
.
.JOY
111111111122...5
221212212111...2
?
iteration1 2
Gibbs sampling
i wi di zi zi123456789
101112...
50
MATHEMATICSKNOWLEDGE
RESEARCHWORK
MATHEMATICSRESEARCH
WORKSCIENTIFIC
MATHEMATICSWORK
SCIENTIFICKNOWLEDGE
.
.
.JOY
111111111122...5
221212212111...2
2?
iteration1 2
Gibbs sampling
i wi di zi zi123456789
101112...
50
MATHEMATICSKNOWLEDGE
RESEARCHWORK
MATHEMATICSRESEARCH
WORKSCIENTIFIC
MATHEMATICSWORK
SCIENTIFICKNOWLEDGE
.
.
.JOY
111111111122...5
221212212111...2
21?
iteration1 2
Gibbs sampling
i wi di zi zi123456789
101112...
50
MATHEMATICSKNOWLEDGE
RESEARCHWORK
MATHEMATICSRESEARCH
WORKSCIENTIFIC
MATHEMATICSWORK
SCIENTIFICKNOWLEDGE
.
.
.JOY
111111111122...5
221212212111...2
211?
iteration1 2
Gibbs sampling
i wi di zi zi123456789
101112...
50
MATHEMATICSKNOWLEDGE
RESEARCHWORK
MATHEMATICSRESEARCH
WORKSCIENTIFIC
MATHEMATICSWORK
SCIENTIFICKNOWLEDGE
.
.
.JOY
111111111122...5
221212212111...2
2112?
iteration1 2
Gibbs sampling
i wi di zi zi zi123456789
101112...
50
MATHEMATICSKNOWLEDGE
RESEARCHWORK
MATHEMATICSRESEARCH
WORKSCIENTIFIC
MATHEMATICSWORK
SCIENTIFICKNOWLEDGE
.
.
.JOY
111111111122...5
221212212111...2
211222212212...1
…
222122212222...1
iteration1 2 … 1000