indexing and searching timed media: the role of mutual information models

Indexing and Searching Timed Media: the Role of Mutual Information Models

Tony Davis (StreamSage/Comcast)

IPAM, UCLA

1 Oct. 2007

A bit about what we do… StreamSage (now a part of Comcast) focuses on indexing and

retrieval of “timed” media (video and audio, aka “multimedia” or “broadcast media”)

A variety of applications, but now centered on cable TV

This is joint work with many members of the research team:Shachi Dave Abby ElbowDavid Houghton I-Ju LaiHemali Majithia Phil RennertKevin Reschke Robert RubinoffPinaki Sinha Goldee Udani

Overview Theme: use of term association models to address challenges of

timed media

Problems addressed: Retrieving all and only the relevant portions of timed media for a

query Lexical semantics (WSD, term expansion, compositionality of

multi-word units) Ontology enhancement Topic detection, clustering, and similarity among documents

Automatic indexing and retrieval of streaming media

Streaming media presents particular difficulties Its timed nature makes navigation cumbersome, so the system

must extract relevant intervals of documents rather than present a list of documents to the user

Speech recognition is error-prone, so the system must compensate for noisy input

We use a mix of statistical and symbolic NLP Various modules factor into calculating relevant intervals for

each term:1. Word sense disambiguation 2. Query expansion 3. Anaphor resolution 4. Name recognition 5. Topic segmentation and identification

Statistical and NLP Foundations: the COW and YAKS Models

Two types of models

COW (Co-Occurring Words) Based on proximity of terms, using a sliding window

YAKS (Yet Another Knowledge Source) Based on grammatical relationships between terms

The COW Model

Large mutual information model of word co-occurrence

MI(X,Y) = P(X,Y)/P(X)P(Y)

Thus, COW values greater than 1 indicate correlation (tendency to co-occur); less than 1, anticorrelation

Values are adjusted for centrality (salience)

Two main COW models: New York Times, based on 325 million words (about 6 years)

of text Wikipedia, more recent, roughly the same amount of text also have specialized COW models (for medicine, business,

and others), as well as models for other languages

The COW (Co-Occurring Words) Model

The COW model is at the core of what we do

Relevance interval construction

Document segmentation and topic identification

Word sense disambiguation (large-scale and unsupervised, based on clustering COWs of an ambiguous word)

Ontology construction Determining semantic relatedness of terms Determining specificity of a term

The COW (Co-Occurring Words) Model An example: top 10 COWs of railroad//N

post//J 165.554

shipper//N 123.568

freight//N 121.375

locomotive//N 119.602

rail//N 73.7922

railway//N 64.6594

commuter//N 63.4978

pickups//N 48.3637

train//N 44.863

burlington//N 41.4952

Multi-Word Units (MWUs) Why do we care about MWUs? Because they act like single words

in many cases, but also:

MWUs are often powerful disambiguators of words within them (see, e.g., Yarowsky (1995), Pederson (2002) for wsd methods that exploit this):

‘fuel tank’, ‘fish tank’, ‘tank tread’ ‘indoor pool’, ‘labor pool’, ‘pool table’

Useful in query expansion ‘Dept. of Agriculture’ ‘Agriculture Dept.’ ‘hookworm in dogs’ ‘canine hookworm’

Provide many terms that can be added to ontologies ‘commuter railroad’, ‘peanut butter’

Multi-Word Units (MWUs) in our models

MWUs in our system

We extract nominal MWUs, using a simple procedure based on POS-tagging:

Example:

1. ({N,J}) ({N,J}) N

2. N Prep (‘the’) ({N,J}) N where Prep is ‘in’, ‘on’, ‘of’, ‘to’, ‘by’, ‘with’, ‘without’, ‘for’, or

‘against’

For the most common 100,000 or so MWUs in our corpus, we calculate COW values, as we do for words

COWs of MWUs An example: top ten COWs of ‘commuter railroad’

post//J 1234.47

pickups//N 315.005

rail//N 200.839

sanitation//N 186.99

weekday//N 135.609

transit//N 134.329

commuter//N 119.435

subway//N 86.6837

transportation//N 86.487

railway//N 86.2851

COWs of MWUs Another example: top ten COWs of ‘underground railroad’

abolitionist//N 845.075

slave//N 401.732

gourd//N 266.538

runaway//J 226.163

douglass//N 170.459

slavery//N 157.654

harriet//N 131.803

quilt//N 109.241

quaker//N 94.6592

historic//N 86.0395

The YAKS model Motivations

COW values reflect simple co-occurrence or association, but no particular relationship beyond that

For some purposes, it’s useful to measure the association between two terms in a particular syntactic relationship

Construction Parse a lot of text (the same 325 million words of New York Times

used to build our NYT COW model); however, long sentences (>25 words) were discarded, as parsing them was slow and error-prone

The parser’s output provides information about grammatical relations between words in a clause; to measure the association of a verb (say ‘drink’) and a noun as its object (say ‘beer’), we consider the set of all verb-object pairs, and calculate mutual information over that set

Calculate MI for broader semantic classes of terms, e.g.: food, substance. Semantic classes were taken from Cambridge International Dictionary of English (CIDE); there are about 2000 of them, arranged in a shallow hierarchy

YAKS Examples Some objects of ‘eat’

OBJ head=eat

arg1=hamburger 139.249

arg1=pretzel 90.359

arg1=:Food 18.156

arg1=:Substance 7.89

arg1=:Sound 0.324

arg1=:Place 0.448

Relevance Intervals

Relevance Intervals (RIs) Each RI is a contiguous segment of audio/video deemed relevant to a term

1. RIs are calculated for all content words (after lemmatization) and multi-word expressions

2. RI basis: sentence containing the term

3. Each RI is expanded forward and backward to capture relevant material, using the techniques including: Topic boundary detection by changes in COW values across sentences Topic boundary detection via discourse markers Synonym-based query expansion Anaphor resolution

4. Nearby RIs for the same term are merged

5. Each RI is assigned a magnitude, reflecting its likely importance to a user searching on that term, based on the number of occurrences of the term in the RI, and the COW values of other words in the RI with the term

Relevance Intervals: an Example Index term: squatter

Among the sentences containing this term are these two, near each other:Paul Bew is professor of Irish politics at Queens University in Belfast.In South Africa the government is struggling to contain a growing demand for land from its black citizens.Authorities have vowed to crack down and arrest squatters illegally occupying land near Johannesburg.In a most serious incident today more than 10,000 black South Africans have seized government and privately-owned property.Hundreds were arrested earlier this week and the government hopes to move the rest out in the next two days.NPR’s Kenneth Walker has a report.Thousands of squatters in a suburb outside Johannesburg cheer loudly as their leaders deliver angry speeches against whites and landlessness in South Africa.“Must give us a place…”

We build an RI for squatter around each of these sentences…


Among the sentences containing this term are these two, near each other:Paul Bew is professor of Irish politics at Queens University in Belfast.In South Africa the government is struggling to contain a growing demand for land from its black citizens. [cow-expand]Authorities have vowed to crack down and arrest squatters illegally occupying land near Johannesburg.In a most serious incident today more than 10,000 black South Africans have seized government and privately-owned property. [cow-expand]Hundreds were arrested earlier this week and the government hopes to move the rest out in the next two days.NPR’s Kenneth Walker has a report.Thousands of squatters in a suburb outside Johannesburg cheer loudly as their leaders deliver angry speeches against whites and landlessness in South Africa.“Must give us a place…” [topic segment boundary]

We build an RI for squatter around each of these sentences…


Among the sentences containing this term are these two, near each other:Paul Bew is professor of Irish politics at Queens University in Belfast. [topic segment boundary]In South Africa the government is struggling to contain a growing demand for land from its black citizens. [cow-expand]Authorities have vowed to crack down and arrest squatters illegally occupying land near Johannesburg.In a most serious incident today more than 10,000 black South Africans have seized government and privately-owned property. [cow-expand]Hundreds were arrested earlier this week and the government hopes to move the rest out in the next two days. [merge nearby intervals]NPR’s Kenneth Walker has a report. [merge nearby intervals]Thousands of squatters in a suburb outside Johannesburg cheer loudly as their leaders deliver angry speeches against whites and landlessness in South Africa.“Must give us a place…” [topic segment boundary]

Two occurrence of squatter produce a complete merged interval.

Relevance Intervals and Virtual Documents

The set of all the RIs for a term in a document constitutes the virtual document for that term

In effect, the VD for a term is intended to approximate a document that would have been produced had the authors focused solely on that term

A VD is assigned a magnitude equal to the highest magnitude of the RIs in it, with a bonus if more than one RI has a similarly high magnitude

Merging Ris for multiple terms

Iran

Russia

Occurrence of Original Term

Note that this can only be done at query time,So it needs to be fairly quick and simple.

Merging RIs

Iran

Russia

ActivationSpreading


Merging RIs

Iran

Russia

Russia and Iran


Evaluating RIs and VDs

Evaluation of retrieval effectiveness in timed media raises further issues: Building a gold-standard is painstaking, and potentially more subjective It’s necessary to measure how closely the system’s Ris match the gold

standard’s What’s a reasonable baseline?

We created a gold standard of about 2300 VDs with about 200 queries on about 50 documents (NPR, CNN, ABC, and business webcasts), and rated each RI in a VD on a scale of 1 (highly relevant) to 3 (marginally relevant).

Testing of the system was performed on speech recognizer output


Measure amounts of extraneous and missed content

ideal RI missed

system RI

extraneous


Comparison of percentages of median extraneous and missed content over all queries between system using COWs and system using only sentences with query terms present

relevance most relevant relevant most relevant

and relevant

marginally

relevant

systemextra miss extra miss extra miss extra miss

With

COWs 9.3 12.7 39.2 27.5 29.9 21.6 61.7 15.0

Without

COWs 0.0 57.7 0.3 64.7 0.0 63.4 18.8 60.2

MWUs and compositionality

MWUs, idioms, and compositionality

Several partially independent factors are in play here (Calzolari, et al. 2002):

1. reduced syntactic and semantic transparency;

2. reduced or lack of compositionality;

3. more or less frozen or fixed status;

4. possible violation of some otherwise general syntactic patterns or rules;

5. a high degree of lexicalization (depending on pragmatic factors);

6. a high degree of conventionality.

MWUs, idioms, and compositionality

In addition, there are two kinds of “mixed” cases Ambiguous MWUs, with one meaning compositional and the other not:

‘end of the tunnel’, ‘underground railroad’ “Normal” use of some component words, but not others:

‘flea market’ (a kind of market)

‘peanut butter’ (a spread made from peanuts)

Automatic detection of non-compositionality

Previous work Lin (1999): “based on the hypothesis that when a phrase is non-

compositional, its mutual information differs significantly from the mutual informations [sic] of phrases obtained by substituting one of the words in the phrase with a similar word.”

For instance, the distribution of ‘peanut’ and ‘butter’ should differ from that of ‘peanut’ and ‘margarine’

Results are not very good yet, because semantically-related words often have quite different distributions, and many compositional collocations are “institutionalized”, so that substituting words within them will change distributional statistics.


Previous work Baldwin, et al. (2002): “use latent semantic analysis to determine the

similarity between a multiword expression and its constituent words”; “higher similarities indicate greater decomposability”

“Our expectation is that for constituent word-MWE pairs with higher LSA similarities, there is a greater likelihood of the MWE being a hyponym of the constituent word.” (for head words of MWEs)

“correlate[s] moderately with WordNet-based hyponymy values.”


We use the COW model for a related approach to the problem COWs (and COW values) of an MWU and its component words will be

more alike if the MWU is compositional. We use a measure of occurrences of a component word near an MWU as

another criterion of compositionality The more often words in the MWU appear near it, but not as a part of it, the

more likely it is that the MWU is compositional.

COW pair sum measure

Get the top n COWs of an MWU, and of one of its component words. For each pair of COWs (one from each of these lists), find their COW value.

railroad commuter railroad

post//J post//J

shipper//N pickups//N

freight//N rail//N

COW pair sum measure

Get the top n COWs of an MWU, and of one of its component words. For each pair of COWs (one from each of these lists), find their COW value. Then sum up these values. This provides a measure of how similar the contexts

in which the MWU and its component word appear are.

Feature overlap measure

Get the top n COWs (and values) of an MWU, and of one of its component words.

For each COW with a value greater than some threshold, treat that COW as a feature of the term.

Then compute the overlap coefficient (Jaccard coefficient); for two sets of features A and B:

| A B |

| A B |

Occurrence-based measure

For each occurrence of an MWU, determine if a given component word occurs in a window around that occurrence, but not as part of that MWU.

Calculate the proportion of occurrences for which this the case, compared to all occurrences of the MWU.

Testing the measures

We extracted all MWUs tagged as idiomatic in the Cambridge International Dictionary of English (about 1000 expressions).

There are about 112 of these that conform to our MWU patterns and occur with sufficient frequency in our corpus that we have calculated COWs for them.

fashion victim

flea market

flip side

Testing the measures

We then searched the 100,000 MWUs for which we have COW values, choosing compositional MWUs containing the same terms.

In some cases, this is difficult or impossible, as no appropriate MWUs are present. About 144 MWUs are on the compositional list.

fashion victim fashion designer crime victim

flea market [flea collar] market share

flip side [coin flip] side of the building

Results: basic statistics

The idiomatic and compositional sets are quite different in aggregate, though there is a large variance:

COW pair sum Feature overlap Occurrence measure

Non-idiomatic

mean 575.478

s.d. 861.754

mean 0.297

s.d. 0.256

mean 37.877

s.d. 23.470

Idiomatic

mean -236.92

s.d. 502.436

mean 0.109

s.d. 0.180

mean 16.954

s.d. 16.637

Results: discriminating the two sets

How well does each measure discriminate between idioms and non-idioms?

COW pair sum

negative positive

Non-idiomatic 75 213

Idiomatic 178 46



feature overlap

< 0.12 >= 0.12


Idiomatic 175 49



occurrence-based measure

< 25% >= 25%


Idiomatic 174 50

Results: discriminating the two sets Can we do better combining the measures? We used the decision-tree software C5.0 to check

Rule: if COW pair sum <= -216.739 or

COW pair sum <= 199.215 and occ. measure < 27.74%

then idiomatic; otherwise non-idiomatic

yes no


Idiomatic 184 40

Results: discriminating the two sets Some cases are “split”—classified as idiomatic with respect to one component

word but not the other word:

bear hug is idiomatic w.r.t. bear but not hug

flea market is idiomatic w.r.t. flea but not market Other methods to improve performance on this task

MWUs often come in semantic “clusters”:

‘almond tree’, ‘peach tree’, ‘blackberry bush’, ‘pepper plant’, etc. Corresponding components in these MWUs can be localized in a small area

of WordNet (Barrett, Davis, and Dorr (2001)) or UMLS (Rosario, Hearst, and Fillmore (2002)).

“Outliers” that don’t fit the pattern are potentially idiomatic or non-compositional (’plane tree’ but ’rubber tree’, which is compositional).

Clustering and topic detection

Clustering by similarities among segments

Content of a segment is represented by its topically salient terms.

The COW model is used to calculate a similarity measure for each pair of segments.

Clustering on the resulting matrix of similarities (using the CLUTO package) yields topically distinct clusters of results.

Clustering Results

Example: crane

10 segments form 2 well-defined clusters, one relating to birds, the other to machinery (and cleanup of the WTC debris in particular)

Clustering Results Example: crane

Cluster Labeling

Topic labels for clusters improve usability.

Candidate cluster labels can be obtained from: Topic terms of segments in a cluster Multi-word units containing the query term(s) Outside sources (taxonomies, Wikipedia, …)

Topics through latent Dirichlet allocation

LDA (Blei, Ng, and Jordan 2003) models documents as probability distribution over a set of underlying topics, with each topic defined as a probability distribution over terms.

We’ve generated topic models for the SR output of about 20,000 news programs (from 400 to1000 topics).

Nouns, verbs, and MWUs only; all other words discarded Impressionistically, the “best” topics seem to be those in which a

third or more of the probability mass is in 10-50 terms.

Topics through latent Dirichlet allocationAre there better ways to gauge topic coherence (and relatedness)? We’ve tried two

COW sum measure (Wikipedia COW table on 20 most probably words in a topic

Average “distance” between CIDE semantic domain codes for term (also top 20, using the similarity measure of Wu and Palmer 1994)

Excerpt of CIDE hierarchy:43 Building and Civil Engineering

66 Buildings68 Buildings: names and types of

754 Houses and homes755 Public buildings365 Rooms

194 Furniture and Fittings834 Bathroom fixtures and fittings811 Curtains and wallpaper804 Tables805 Chairs and seats

Topics through latent Dirichlet allocation Results are quite different (these are with 1000 topics)

COW sum measure

Rank in WikiCow= 1 Rank in cide= 232return home war talk stress month post-traumatic rack deal think stress$disorder night psychological face understand pt experience combat series mental$health

Rank in WikiCow= 2 Rank in cide= 773research cell embryo stem$cell disease destroy embryonic$stem scientist parkinson researcher federal funding potential cure today human$embryo medical human$life california create

Rank in WikiCow= 3 Rank in cide= 924giuliani new$hampshire run talk gingrich february former$new thompson social mormon massachusetts mayor iowa newt opening york choice rudy$giuliani jim committee


COW sum measure

Rank in WikiCow= 4 Rank in cide= 650

palestinian israeli prime$minister israelis arafat west$bank jerusalem peace settlement gaza$strip government violence israeli$soldier hama leader move force security peace$process downtown


japan japanese tokyo connecticut lieberman lucy run lose joe$lieberman support lemont disaster war twenty-first$century anti-war define peterson$cbs senator$joe future figure


california schwarzenegger union break office view arnold$schwarzenegger tuesday politician budget rosie poll agenda o'donnell political battle california$governor help rating maria


CIDE distance measure

Rank in cide= 1 Rank in WikiCow= 278play show broadway actor stage star performance theatre perform audience act career tony sing character production young role review performer

Rank in cide= 2 Rank in WikiCow= 476war u.n. refugee united$nation international defend support kill u.s. week flee allow chief board cost remain desperate innocent humanitarian conference

Rank in cide= 3 Rank in WikiCow= 151competition win team skate compete figure stand finish performance gold watch champion hard-on score skater talk meter lose fight gold$medal


CIDE distance measure

Rank in cide= 4 Rank in WikiCow= 858

set move show pass finish mind send-up expect outcome address defence person responsibility wish independent system woman salute term pres


character play think kind film actor mean mind scene act guy script role good$morning real read nice wonderful stuff interesting


warren purpose rick hunter week drive hundred influence allow poor peace night leader train sign walk training goal team rally

YAKS and Ontology Enhancement

YAKS (Yet another knowledge source) What is YAKS: Like the COW model: measures how much more (or less) frequently

words occur in certain environments than would be expected by chance.

Unlike the COW model: does not measure co-occurrence within a fixed window size. Measures how often words are in one of the syntactic relations tracked by the model.

Example: Subject(eat,dog) would be an entry in the YAKS table recording how many times the verb eat occurs in the corpus with dog as its subject.

Syntax and Semantics: YAKS YAKS infrastructure: Corpus: New York Times corpus (six years), all sentences under 25

words (~60% of sentences, ~50% of words); about 160 million words.

Need good parsing. We used the Xerox Linguistics Environment (XLE) parser from PARC, based on lexical-functional grammar.

Syntax and Semantics: YAKS Sample YAKS relations: (YAKS tracks 13 different relations.)

SUBJ(V, X)

OBJ(V, X)

COMP(V, X)

POBJ(P, X)

CONJ(C, X, Y)

X is subject of verb V.

X is direct object of verb V.

Verb V and complementizer X.

Preposition P and object X.

Coordinating conjunction C, and two conjuncts (either nouns or verbs) X and Y.

“Dan ate soup”: SUBJ(eat, Dan) “Dan ate soup”: OBJ(eat, soup)

“Dan believes that soup is good”: COMP(believe, that)

“Eli sat on the ground”: POBJ(on, the ground)

“Dan ate bread and soup”: CONJ(and, bread, soup)

Meaning Example Relation

For ontology extension, we use only the OBJ relation.

YAKS in ontology enhancement

hat

shirt

dress

seed items

Items from one neighborhoodin the ontology

sock


wash

wear Which verbs are most associated with these nouns, in a large corpus?

hat

shirt

dresssock

irontake off


wash

wear

Whaich other nouns are most associated as objects of those verbs?

hat

shirt

dresssock

irontake off

jacket

sweater

blouse

pantsclothes

pajamas

new candidates for that area of the ontology

YAKS in ontology enhancementYAKS technique:The noun-verb-noun cycle reflects the selectional restrictions of verbs that are strongly associated with the seed nouns.This lets us use the information in the statistical properties of the large corpus to find nouns that are semantically close to your seeds in the ontology.This is why we use OBJ (the object relation) for this YAKS technique. (SUBJ found the same information, but with slightly more noise.)

YAKS in ontology enhancementYAKS technique: finds new items for a neighborhood, but where exactly do they go?

belongs in here . . .X


But where?X

?

?

?

? ? ? ? ?

?

?

YAKS in ontology enhancement Techniques to precisely locate a candidate term in the ontology: Wikipedia check (already described)

(Does the Dog page contain “dog is a mammal”? Does the Mammal page contain “mammal is a dog”?)

Yahoo pattern technique

YAKS in ontology enhancement Yahoo check:

1. Use known pattern information to form a Yahoo search.• If we suspect that a dog is-a mammal, search Yahoo for “dog is

a mammal” and “mammal is a dog”. Also search using other is-a patterns (e.g., from Hearst 1994).

2. Use relative counts (absolute counts hard to interpret).

YAKS in ontology enhancement: evaluation Four levels of is-a: Beef, meat, food, substance

1. Seed pairs were beef-meat, meat-food, and food-substance.

2. YAKS induction and siblings added to produce candidate pool.

3. Candidate list was then pared using a version of the Wikipedia check:

YAKS in ontology enhancement: evaluation Beef, meat, food, substance — Wikipedia check: Where might candidate term T fit, relative to seed term S?

Count mentions of T on the S page, MT, and mentions of S on the T page, MS. Weight doubly mentions in the first paragraph. Subtract: MT - MS .

If T is an S, then MT - MS should be >0.

YAKS in ontology enhancement: evaluation Beef, meat, food, substance — Results: After YAKS induction and expansion via siblings, had >800 candidate

terms to add to the ontology. Wikipedia check pared this to 38 terms, each with its best is-a

relationship. Of these, on hand inspection, all but two were very high quality.

YAKS in ontology enhancement: evaluation YAKS testing: MILO transportation ontology

1. In each trial, seed terms were three items from near the bottom of the ontology, linked by is-a. For instance, sedan, car, and vehicle.

Selecting terms from the bottom of the ontology decreases the lexicalization problems inherent in MILO content.

• Induced new candidate terms using the YAKS technique, adding siblings as well.

YAKS in ontology enhancement: evaluation3. Candidate list was then pared using a Yahoo is-a pattern check:

A. For a candidate term T, compare it to each of the three seeds

– i.e., is it a sedan? Is it a car? Is it a vehicle?

B. Do this via a Yahoo search for Hearst is-a patterns involving T and each of the seeds. This trial used the patterns Y such as X and X and other Y.

– i.e., search for “. . . sedans such as T . . .”, “. . . cars such as T . . .”, “. . . vehicles such as T . . .”, “. . . T and other sedans ...”, and so on.

YAKS in ontology enhancement: evaluation3. Yahoo pattern check, continued:

C. Does the candidate have a significant number of Yahoo hits for one of the seeds with one of the is-a patterns?

A. Is there a significant difference between that highest number of hits and the number of hits with some other seed?

B. That is, is there a clear indication that the candidate is-a for one particular seed?

D. If not, discard this candidate.

E. If so, then this candidate most likely is-a member of the class named by that seed.

YAKS in ontology enhancement: evaluation Results — MILO transportation ontology:

candidate Yahoo hits for is-a: resultsedan car vehicle

truck

van

automobile

minivan

BMW

Jeep

Corolla

convertible

0 532 152000

0 95 5590

0 63 14700

1 76 561

3 360 66

0 3600 849

6 66 3

0 25 76

a truck is a vehicle

a van is a vehicle

an automobile is a vehicle

a minivan is a vehicle

a BMW is a car

a Jeep is a car

a Corolla is a car

a convertible is a vehicle

Conclusions and future work

Conclusions and future work

MI models can help compensate for the noisiness in timed media search and retrieval.

MI models can help in building knowledge sources from large text collections.

We’re still looking for better ways to combine handcrafted semantic resources with statistical ones.

indexing and searching timed media: the role of mutual information models

Documents

broadcast media

retrieval of timed media

cow values greater

single words

term association models

term expansion

multiword units mwuswhy

tank treadindoor pool