categorizing multimedia documents using associated text thesis proposal by carl sable

36
Categorizing Multimedia Documents Using Associated Text Thesis Proposal By Carl Sable

Post on 21-Dec-2015

218 views

Category:

Documents


1 download

TRANSCRIPT

Categorizing Multimedia Documents Using Associated Text

Thesis Proposal

By Carl Sable

Indoor vs. Outdoor

Denver Summit of Eight leaders begin their first official meeting in the Denver Public Library, June 21. They are clockwise from the top: Russian President Boris Yeltsin, U.S. President Bill Clinton, French President Jacques Chirac, Canadian Prime Minister Jean Chretien, Italian Prime Minister Romano Prodi, EU President Willem Kok, EC President Jacques Santer, British Prime Minister Tony Blair, Japanese Prime Minister Ryutaro Hashimoto and German Chancellor Helmut Kohl.

Villagers look at the broken tail-end of the Fokker 28 Biman Bangladesh Airlines jet December 23, a day after it crash-landed near the town of Sylhet, in northeastern Bangladesh. All 89 passengers and crew survived the accident, mostly with minor injuries. Most of the passengers were expatriate Bangladeshis returning home from London.

Event Categories

Politics Struggle

Disaster CrimeOther

Disaster Images

Workers Responding Affected People

Wreckage Other

Overview of Talk

I. Contributions.

II. Corpora.

III. Two previous systems.

IV. Harder categories.

V. Interactions between categories.

VI. Video.

VII. Schedule.

VIII. Conclusions.

Contributions

• Use of text categorization to categorize multimedia documents.

• Introduction of novel techniques.

• Use of NLP to handle tough categories.

• Exploration of interactions between various sets of categories.

Manual Categorization Tool

Reuters

• Common corpus for comparing methods.

• Over 10,000 articles, 90 topic categories.

• Binary categorization.

5

grain, wheat, corn, barley, oat, sorghum

9earn

448gold, acq, platinum

http://www.research.att.com/~lewis/reuters21578.html

Lots of Previous Literature

• “Bag of words” with weights.– Term frequency (TF).– Inverse document frequency (IDF).

• Variety of methods: Rocchio, k-nearest neighbors (KNN), naïve Bayes (NB), support vector machines (SVMs).

• Systems trained with labeled examples.

Density Estimation

• Start with advanced Rocchio system.

• For each test document, compute similarity to every category.

• Find all documents from training set with similar category similarities.

• Use categories of close training documents to predict categories of test documents.

Example

85, 35, 25, 95, 20

100, 75, 20, 30, 5

60, 95, 20, 30, 5

90, 25, 50, 110, 25

40, 30, 80, 25, 40

80, 45, 20, 75, 10

Category score vectorsfor training documents:Category score vector

for test document:

20.092.5

106.4

27.4

91.4

36.7

Predictions:Rocchio: StruggleDE: Crime (Probability .679)

100, 40, 30, 90, 10

Struggle

Politics

Disaster

Crim

e

Other

Distances:

679.0

7.361

4.271

0.201

7.361

0.201

(Crime)

(Struggle)

(Disaster)

(Struggle)

(Politics)

(Crime)

Actual Categories:

Bin System (AT&T)

• Group words with similar “features” together into a common “bin”.

• Based on training data, empirically estimate a term weight for words in each bin.– Smoothing, works well even if there is not

enough data for individual words.– Doesn’t assume simple relationships between

features.

Sample Words

Indoor Indicators“conference”“bed”

Outdoor Indicators“airplane”“earthquake”

Ambiguous“Gore”“ceremony”

Determine Bins for “airplane”

• Per category bins based on IDF and category counts.

• IDF(“airplane”) = 5.4.

• Examine first half of training data:– Appears in 0 indoor documents.– Appears in 2 outdoor documents.

Lambdas for “airplane”

• Determined at the bin level.

• Examine second half of training data:410*11.2)indoor|nobservatio(P 310*90.2)outdoor|nobservatio(P

)indoor|nobservatio(Plog2indoor )outdoor|nobservatio(Plog2outdoor

78.3outdoorindoor

Sample Words with Scores

Indoor Indicators“conference”

+5.91“bed”

+4.58

Outdoor Indicators“airplane”

-3.78“earthquake”

-4.86

Ambiguous“Gore”

+0.74“ceremony”

-0.32

Results

• Both systems did OK on Reuters.

• DE performed best for Indoor vs. Outdoor.

• Bin system performed best for Events.

Standard Evaluation Metrics (1)

• Per category measures:– Simple accuracy or error measures are

misleading for binary categorization.– Precision and recall.– F-measure, average precision, and

break-even point (BEP) combine precision and recall.

• Macro-averaging vs. micro-averaging.– Macro treats all categories equal,

micro treats all documents equal.– Macro usually lower since small

categories are hard.

Yes iscorrect

No iscorrect

AssignedYES

a b

AssignedNO

c d

p = a / (a + b)

r = a / (a + c)

contingency table:

rp

r*p*2F1

Results for Reuters

Micro-F1

0.7000

0.7500

0.8000

0.8500

0.9000

SVM KNN LSF NNet NB

Rocchio Columbia Combo Bin

Macro-F1

0.35000.40000.45000.50000.5500

SVM KNN LSF NNet NB

Rocchio Columbia Combo Bin

Standard Evaluation Metrics (2)

• Mutually exclusive categories:– Each test document has only one correct label.– Each test document assigned only one label.

• Performance measured by overall accuracy:

sprediction total#

spredictioncorrect #Accuracy

Results for Indoor vs. Outdoor

80.0%

81.0%

82.0%

83.0%

84.0%

85.0%

86.0%

87.0% Bin

Columbia

Rocchio

SVM

KNN

• Columbia system using density estimation shows best performance.

• Even beats SVMs.• System using bins

very respectable.

Results for Event Categories

82.0%

83.0%

84.0%

85.0%

86.0%

87.0%

88.0%

89.0%Bin

Columbia

Rocchio

KNN

• System using bins shows best performance.

• Columbia system respectable.

Improving Bin Method

• Experiment with more advanced binning rules.

• Fall back to single word term weights for frequent words.

Why are Disaster Images Hard?

• Small corpus (124 training images, 124 test images).

• Most words not important.

• Important words associated with test images have likely never occurred in training set.

Approach

• Extract important information (e.g. subjects and verbs).

• Compare test words to training words.• Previously seen words indicate strong evidence.• Consider large, unsupervised corpus of

subject/verb pairs.• Add evidence to categories for any verb or subject

ever paired with new subject or verb.

Subjects and VerbsSubject Verb Category

“two blocks” “have suffered” Wreckage

“aviation inspectors” “search” Workers Responding

“Moslems” “pray” Affected People

“rescue workers” “remove” Workers Responding

“this house” “is underwater” Wreckage

“passenger jet” “crashed” Other

“volunteers” “help clear” Workers Responding

“fuselage” “can be seen” Wreckage

“President Clinton” “hugs” Affected People

“officer” “tries on” Workers Responding

First Try

• Very simple subject/verb extraction.

• Rather small unsupervised corpus.

• Very simple similarity metric.

System Results

Rocchio 57.3%

Density Estimation 58.9%

First Attempt 59.7% (67.9% when prior evidence)

Results of Simple ExtractionSubject Verb Category

“blocks” “suffered” Wreckage

“aviation” “inspectors” Workers Responding

“moslems” “pray” Affected People

“rescue” “remove” Workers Responding

“house” “is” Wreckage

“passenger” “shown” Other

“volunteers” “including” Workers Responding

“fuselage” “be” Wreckage

“” “hugs” Affected People

“” “officer” Workers Responding

Doing It Right

• Better subject/verb extraction with parser.

• Much larger unsupervised corpus.

• Better similarity metric.

• Maybe stemming.

Previous Interactions

• Combining Pcut and DE for Reuters.

• Combining text system and image feature system for Indoor vs. Outdoor.

• Using number of people information to improve Indoor vs. Outdoor probabilities.

Those Results

Micro-average F1

Macro-average F1

Pcut 71.0 % 50.1%

DE 83.0% 40.5%

Combo 82.2% 51.2%Accuracy

Image Only 82.4%

Text Only 83.3%

Combo 86.2%

Standard Evaluation

Alternate Evaluation

Without # of people 79.9% 75.0%

With # of people 80.2% 77.2%

Pcut + DE for Reuters

Text + Image for In/Out

Adding # of People for In/Out

Future Interactions

• Improve accuracy.

• Determine new information.

• Likely use density estimation or bins.

Closed Captioned Video

• Apply image system to video.

• New categories.

• New modality, new challenges.669 726 "HEADLINE NEWS" -- I'M KIMBERLEY KENNEDY, IN FOR DAVID GOODNOW.

750 878 THE FIRE HAS BEEN PUT OUT, BUT SO HAVE THE HOPES OF THOUSANDS OF PEOPLE COUNTING ON A VACATION TO MEXICO THIS WEEK.

930 1087 THE CARNIVAL CRUISE LINER "ECSTACY" WAS ONLY TWO MILES INTO ITS TRIP FROM MIAMI TO COZUMEL WHEN A FIRE BROKE OUT IN A LAUNDRY ROOM.

1134 1291 THE COAST GUARD CAME TO THE RESCUE AND DOUSED THE FLAMES AS HUNDREDS OF PASSENGERS DONNING LIFE JACKETS WATCHED ON.

1331 1388 THE FIRE CHARRED THREE LOWER DECKS BEFORE FIREFIGHTERS BROUGHT IT UNDER CONTROL.

...

Schedule

Conclusions

• Categorization of multimedia data.

• Applying novel techniques.

• Using NLP for hard categories.

• Exploring interactions between systems and categories.