Transcript
Page 1: Categorizing Multimedia Documents Using Associated Text

Categorizing Multimedia Documents Using Associated Text

Thesis Proposal

By Carl Sable

Page 2: Categorizing Multimedia Documents Using Associated Text

Indoor vs. Outdoor

Denver Summit of Eight leaders begin their first official meeting in the Denver Public Library, June 21. They are clockwise from the top: Russian President Boris Yeltsin, U.S. President Bill Clinton, French President Jacques Chirac, Canadian Prime Minister Jean Chretien, Italian Prime Minister Romano Prodi, EU President Willem Kok, EC President Jacques Santer, British Prime Minister Tony Blair, Japanese Prime Minister Ryutaro Hashimoto and German Chancellor Helmut Kohl.

Villagers look at the broken tail-end of the Fokker 28 Biman Bangladesh Airlines jet December 23, a day after it crash-landed near the town of Sylhet, in northeastern Bangladesh. All 89 passengers and crew survived the accident, mostly with minor injuries. Most of the passengers were expatriate Bangladeshis returning home from London.

Page 3: Categorizing Multimedia Documents Using Associated Text

Event Categories

Politics Struggle

Disaster CrimeOther

Page 4: Categorizing Multimedia Documents Using Associated Text

Disaster Images

Workers Responding Affected People

Wreckage Other

Page 5: Categorizing Multimedia Documents Using Associated Text

Overview of Talk

I. Contributions.

II. Corpora.

III. Two previous systems.

IV. Harder categories.

V. Interactions between categories.

VI. Video.

VII. Schedule.

VIII. Conclusions.

Page 6: Categorizing Multimedia Documents Using Associated Text

Contributions

• Use of text categorization to categorize multimedia documents.

• Introduction of novel techniques.

• Use of NLP to handle tough categories.

• Exploration of interactions between various sets of categories.

Page 7: Categorizing Multimedia Documents Using Associated Text

Manual Categorization Tool

Page 8: Categorizing Multimedia Documents Using Associated Text

Reuters

• Common corpus for comparing methods.

• Over 10,000 articles, 90 topic categories.

• Binary categorization.

5

grain, wheat, corn, barley, oat, sorghum

9earn

448gold, acq, platinum

http://www.research.att.com/~lewis/reuters21578.html

Page 9: Categorizing Multimedia Documents Using Associated Text

Lots of Previous Literature

• “Bag of words” with weights.– Term frequency (TF).– Inverse document frequency (IDF).

• Variety of methods: Rocchio, k-nearest neighbors (KNN), naïve Bayes (NB), support vector machines (SVMs).

• Systems trained with labeled examples.

Page 10: Categorizing Multimedia Documents Using Associated Text

Density Estimation

• Start with advanced Rocchio system.

• For each test document, compute similarity to every category.

• Find all documents from training set with similar category similarities.

• Use categories of close training documents to predict categories of test documents.

Page 11: Categorizing Multimedia Documents Using Associated Text

Example

85, 35, 25, 95, 20

100, 75, 20, 30, 5

60, 95, 20, 30, 5

90, 25, 50, 110, 25

40, 30, 80, 25, 40

80, 45, 20, 75, 10

Category score vectorsfor training documents:Category score vector

for test document:

20.092.5

106.4

27.4

91.4

36.7

Predictions:Rocchio: StruggleDE: Crime (Probability .679)

100, 40, 30, 90, 10

Struggle

Politics

Disaster

Crim

e

Other

Distances:

679.0

7.361

4.271

0.201

7.361

0.201

(Crime)

(Struggle)

(Disaster)

(Struggle)

(Politics)

(Crime)

Actual Categories:

Page 12: Categorizing Multimedia Documents Using Associated Text

Bin System (AT&T)

• Group words with similar “features” together into a common “bin”.

• Based on training data, empirically estimate a term weight for words in each bin.– Smoothing, works well even if there is not

enough data for individual words.– Doesn’t assume simple relationships between

features.

Page 13: Categorizing Multimedia Documents Using Associated Text

Sample Words

Indoor Indicators“conference”“bed”

Outdoor Indicators“airplane”“earthquake”

Ambiguous“Gore”“ceremony”

Page 14: Categorizing Multimedia Documents Using Associated Text

Determine Bins for “airplane”

• Per category bins based on IDF and category counts.

• IDF(“airplane”) = 5.4.

• Examine first half of training data:– Appears in 0 indoor documents.– Appears in 2 outdoor documents.

Page 15: Categorizing Multimedia Documents Using Associated Text

Lambdas for “airplane”

• Determined at the bin level.

• Examine second half of training data:410*11.2)indoor|nobservatio(P 310*90.2)outdoor|nobservatio(P

)indoor|nobservatio(Plog2indoor )outdoor|nobservatio(Plog2outdoor

78.3outdoorindoor

Page 16: Categorizing Multimedia Documents Using Associated Text

Sample Words with Scores

Indoor Indicators“conference”

+5.91“bed”

+4.58

Outdoor Indicators“airplane”

-3.78“earthquake”

-4.86

Ambiguous“Gore”

+0.74“ceremony”

-0.32

Page 17: Categorizing Multimedia Documents Using Associated Text

Results

• Both systems did OK on Reuters.

• DE performed best for Indoor vs. Outdoor.

• Bin system performed best for Events.

Page 18: Categorizing Multimedia Documents Using Associated Text

Standard Evaluation Metrics (1)

• Per category measures:– Simple accuracy or error measures are

misleading for binary categorization.– Precision and recall.– F-measure, average precision, and

break-even point (BEP) combine precision and recall.

• Macro-averaging vs. micro-averaging.– Macro treats all categories equal,

micro treats all documents equal.– Macro usually lower since small

categories are hard.

Yes iscorrect

No iscorrect

AssignedYES

a b

AssignedNO

c d

p = a / (a + b)

r = a / (a + c)

contingency table:

rp

r*p*2F1

Page 19: Categorizing Multimedia Documents Using Associated Text

Results for Reuters

Micro-F1

0.7000

0.7500

0.8000

0.8500

0.9000

SVM KNN LSF NNet NB

Rocchio Columbia Combo Bin

Macro-F1

0.35000.40000.45000.50000.5500

SVM KNN LSF NNet NB

Rocchio Columbia Combo Bin

Page 20: Categorizing Multimedia Documents Using Associated Text

Standard Evaluation Metrics (2)

• Mutually exclusive categories:– Each test document has only one correct label.– Each test document assigned only one label.

• Performance measured by overall accuracy:

sprediction total#

spredictioncorrect #Accuracy

Page 21: Categorizing Multimedia Documents Using Associated Text

Results for Indoor vs. Outdoor

80.0%

81.0%

82.0%

83.0%

84.0%

85.0%

86.0%

87.0% Bin

Columbia

Rocchio

SVM

KNN

• Columbia system using density estimation shows best performance.

• Even beats SVMs.• System using bins

very respectable.

Page 22: Categorizing Multimedia Documents Using Associated Text

Results for Event Categories

82.0%

83.0%

84.0%

85.0%

86.0%

87.0%

88.0%

89.0%Bin

Columbia

Rocchio

KNN

• System using bins shows best performance.

• Columbia system respectable.

Page 23: Categorizing Multimedia Documents Using Associated Text

Improving Bin Method

• Experiment with more advanced binning rules.

• Fall back to single word term weights for frequent words.

Page 24: Categorizing Multimedia Documents Using Associated Text

Why are Disaster Images Hard?

• Small corpus (124 training images, 124 test images).

• Most words not important.

• Important words associated with test images have likely never occurred in training set.

Page 25: Categorizing Multimedia Documents Using Associated Text

Approach

• Extract important information (e.g. subjects and verbs).

• Compare test words to training words.• Previously seen words indicate strong evidence.• Consider large, unsupervised corpus of

subject/verb pairs.• Add evidence to categories for any verb or subject

ever paired with new subject or verb.

Page 26: Categorizing Multimedia Documents Using Associated Text

Subjects and VerbsSubject Verb Category

“two blocks” “have suffered” Wreckage

“aviation inspectors” “search” Workers Responding

“Moslems” “pray” Affected People

“rescue workers” “remove” Workers Responding

“this house” “is underwater” Wreckage

“passenger jet” “crashed” Other

“volunteers” “help clear” Workers Responding

“fuselage” “can be seen” Wreckage

“President Clinton” “hugs” Affected People

“officer” “tries on” Workers Responding

Page 27: Categorizing Multimedia Documents Using Associated Text

First Try

• Very simple subject/verb extraction.

• Rather small unsupervised corpus.

• Very simple similarity metric.

System Results

Rocchio 57.3%

Density Estimation 58.9%

First Attempt 59.7% (67.9% when prior evidence)

Page 28: Categorizing Multimedia Documents Using Associated Text

Results of Simple ExtractionSubject Verb Category

“blocks” “suffered” Wreckage

“aviation” “inspectors” Workers Responding

“moslems” “pray” Affected People

“rescue” “remove” Workers Responding

“house” “is” Wreckage

“passenger” “shown” Other

“volunteers” “including” Workers Responding

“fuselage” “be” Wreckage

“” “hugs” Affected People

“” “officer” Workers Responding

Page 29: Categorizing Multimedia Documents Using Associated Text

Doing It Right

• Better subject/verb extraction with parser.

• Much larger unsupervised corpus.

• Better similarity metric.

• Maybe stemming.

Page 30: Categorizing Multimedia Documents Using Associated Text

Previous Interactions

• Combining Pcut and DE for Reuters.

• Combining text system and image feature system for Indoor vs. Outdoor.

• Using number of people information to improve Indoor vs. Outdoor probabilities.

Page 31: Categorizing Multimedia Documents Using Associated Text

Those Results

Micro-average F1

Macro-average F1

Pcut 71.0 % 50.1%

DE 83.0% 40.5%

Combo 82.2% 51.2%Accuracy

Image Only 82.4%

Text Only 83.3%

Combo 86.2%

Standard Evaluation

Alternate Evaluation

Without # of people 79.9% 75.0%

With # of people 80.2% 77.2%

Pcut + DE for Reuters

Text + Image for In/Out

Adding # of People for In/Out

Page 32: Categorizing Multimedia Documents Using Associated Text

Future Interactions

• Improve accuracy.

• Determine new information.

• Likely use density estimation or bins.

Page 34: Categorizing Multimedia Documents Using Associated Text

Closed Captioned Video

• Apply image system to video.

• New categories.

• New modality, new challenges.669 726 "HEADLINE NEWS" -- I'M KIMBERLEY KENNEDY, IN FOR DAVID GOODNOW.

750 878 THE FIRE HAS BEEN PUT OUT, BUT SO HAVE THE HOPES OF THOUSANDS OF PEOPLE COUNTING ON A VACATION TO MEXICO THIS WEEK.

930 1087 THE CARNIVAL CRUISE LINER "ECSTACY" WAS ONLY TWO MILES INTO ITS TRIP FROM MIAMI TO COZUMEL WHEN A FIRE BROKE OUT IN A LAUNDRY ROOM.

1134 1291 THE COAST GUARD CAME TO THE RESCUE AND DOUSED THE FLAMES AS HUNDREDS OF PASSENGERS DONNING LIFE JACKETS WATCHED ON.

1331 1388 THE FIRE CHARRED THREE LOWER DECKS BEFORE FIREFIGHTERS BROUGHT IT UNDER CONTROL.

...

Page 35: Categorizing Multimedia Documents Using Associated Text

Schedule

Page 36: Categorizing Multimedia Documents Using Associated Text

Conclusions

• Categorization of multimedia data.

• Applying novel techniques.

• Using NLP for hard categories.

• Exploring interactions between systems and categories.


Top Related