categorizing multimedia documents using associated text

of 36/36
Categorizing Multimedia Documents Using Associated Text Thesis Proposal By Carl Sable

Post on 12-Jan-2016

32 views

Category:

Documents

2 download

Embed Size (px)

DESCRIPTION

Categorizing Multimedia Documents Using Associated Text. Thesis Proposal By Carl Sable. Indoor vs. Outdoor. - PowerPoint PPT Presentation

TRANSCRIPT

  • Categorizing Multimedia Documents Using Associated TextThesis ProposalBy Carl Sable

  • Indoor vs. OutdoorDenver Summit of Eight leaders begin their first official meeting in the Denver Public Library, June 21. They are clockwise from the top: Russian President Boris Yeltsin, U.S. President Bill Clinton, French President Jacques Chirac, Canadian Prime Minister Jean Chretien, Italian Prime Minister Romano Prodi, EU President Willem Kok, EC President Jacques Santer, British Prime Minister Tony Blair, Japanese Prime Minister Ryutaro Hashimoto and German Chancellor Helmut Kohl.Villagers look at the broken tail-end of the Fokker 28 Biman Bangladesh Airlines jet December 23, a day after it crash-landed near the town of Sylhet, in northeastern Bangladesh. All 89 passengers and crew survived the accident, mostly with minor injuries. Most of the passengers were expatriate Bangladeshis returning home from London.

  • Event CategoriesPoliticsStruggleDisasterCrimeOther

  • Disaster ImagesWorkers RespondingAffected PeopleWreckageOther

  • Overview of TalkContributions.Corpora.Two previous systems.Harder categories.Interactions between categories.Video.Schedule.Conclusions.

  • ContributionsUse of text categorization to categorize multimedia documents.Introduction of novel techniques.Use of NLP to handle tough categories.Exploration of interactions between various sets of categories.

  • Manual Categorization Tool

  • ReutersCommon corpus for comparing methods.Over 10,000 articles, 90 topic categories.Binary categorization.5grain, wheat, corn, barley, oat, sorghum9earn448gold, acq, platinumhttp://www.research.att.com/~lewis/reuters21578.html

  • Lots of Previous LiteratureBag of words with weights.Term frequency (TF).Inverse document frequency (IDF).Variety of methods: Rocchio, k-nearest neighbors (KNN), nave Bayes (NB), support vector machines (SVMs).Systems trained with labeled examples.

  • Density EstimationStart with advanced Rocchio system.For each test document, compute similarity to every category.Find all documents from training set with similar category similarities.Use categories of close training documents to predict categories of test documents.

  • Example85, 35, 25, 95, 20100, 75, 20, 30, 560, 95, 20, 30, 590, 25, 50, 110, 2540, 30, 80, 25, 4080, 45, 20, 75, 10Category score vectorsfor training documents:Category score vectorfor test document:20.092.5106.427.491.436.7Predictions:Rocchio: StruggleDE: Crime (Probability .679)100, 40, 30, 90, 10Struggle Politics Disaster Crime Other Distances:(Crime)(Struggle)(Disaster)(Struggle)(Politics)(Crime)Actual Categories:

  • Bin System (AT&T)Group words with similar features together into a common bin.Based on training data, empirically estimate a term weight for words in each bin.Smoothing, works well even if there is not enough data for individual words.Doesnt assume simple relationships between features.

  • Sample WordsIndoor IndicatorsconferencebedOutdoor IndicatorsairplaneearthquakeAmbiguousGoreceremony

  • Determine Bins for airplanePer category bins based on IDF and category counts.IDF(airplane) = 5.4.Examine first half of training data:Appears in 0 indoor documents.Appears in 2 outdoor documents.

  • Lambdas for airplaneDetermined at the bin level.Examine second half of training data:

  • Sample Words with ScoresIndoor Indicatorsconference+5.91bed+4.58Outdoor Indicatorsairplane-3.78earthquake-4.86AmbiguousGore+0.74ceremony-0.32

  • ResultsBoth systems did OK on Reuters.DE performed best for Indoor vs. Outdoor.Bin system performed best for Events.

  • Standard Evaluation Metrics (1)Per category measures:Simple accuracy or error measures are misleading for binary categorization.Precision and recall.F-measure, average precision, and break-even point (BEP) combine precision and recall.Macro-averaging vs. micro-averaging.Macro treats all categories equal, micro treats all documents equal.Macro usually lower since small categories are hard.p = a / (a + b)r = a / (a + c)

    contingency table:

  • Results for Reuters

    Chart1

    0.85990.85670.84980.82780.79560.70960.82980.82150.7984

    SVM

    KNN

    LSF

    NNet

    NB

    Rocchio

    Columbia

    Combo

    Bin

    Micro-F1

    Sheet1

    methodmiRmiPmiF1maF1error

    SVM0.81200.91370.85990.52510.00365

    KNN0.83390.88070.85670.52420.00385

    LSF0.85070.84890.84980.50080.00414

    NNet0.78420.87850.82780.37650.00447

    NB0.76880.82450.79560.38860.00544

    Rocchio0.71210.70720.70960.50140.00803

    Columbia0.78930.87480.82980.40520.00446

    Combo0.80480.83900.82150.51180.00482

    Bin0.80530.79150.79840.45610.00561

    Sheet2

    methodmiF1

    SVM0.8599

    KNN0.8567

    LSF0.8498

    NNet0.8278

    NB0.7956

    Rocchio0.7096

    Columbia0.8298

    Combo0.8215

    Bin0.7984

    Sheet2

    SVM

    KNN

    LSF

    NNet

    NB

    Rocchio

    Columbia

    Combo

    Bin

    Sheet3

    Chart2

    0.52510.52420.50080.37650.38860.50140.40520.51180.4561

    SVM

    KNN

    LSF

    NNet

    NB

    Rocchio

    Columbia

    Combo

    Bin

    Macro-F1

    Sheet1

    methodmiRmiPmiF1maF1error

    SVM0.81200.91370.85990.52510.00365

    KNN0.83390.88070.85670.52420.00385

    LSF0.85070.84890.84980.50080.00414

    NNet0.78420.87850.82780.37650.00447

    NB0.76880.82450.79560.38860.00544

    Rocchio0.71210.70720.70960.50140.00803

    Columbia0.78930.87480.82980.40520.00446

    Combo0.80480.83900.82150.51180.00482

    Bin0.80530.79150.79840.45610.00561

    Sheet2

    methodmaF1

    SVM0.5251

    KNN0.5242

    LSF0.5008

    NNet0.3765

    NB0.3886

    Rocchio0.5014

    Columbia0.4052

    Combo0.5118

    Bin0.4561

    Sheet2

    SVM

    KNN

    LSF

    NNet

    NB

    Rocchio

    Columbia

    Combo

    Bin

    Sheet3

  • Standard Evaluation Metrics (2)Mutually exclusive categories:Each test document has only one correct label.Each test document assigned only one label.Performance measured by overall accuracy:

  • Results for Indoor vs. OutdoorColumbia system using density estimation shows best performance.Even beats SVMs.System using bins very respectable.

  • Results for Event CategoriesSystem using bins shows best performance.Columbia system respectable.

  • Improving Bin MethodExperiment with more advanced binning rules.Fall back to single word term weights for frequent words.

  • Why are Disaster Images Hard?Small corpus (124 training images, 124 test images).Most words not important.Important words associated with test images have likely never occurred in training set.

  • ApproachExtract important information (e.g. subjects and verbs).Compare test words to training words.Previously seen words indicate strong evidence.Consider large, unsupervised corpus of subject/verb pairs.Add evidence to categories for any verb or subject ever paired with new subject or verb.

  • Subjects and Verbs

  • First TryVery simple subject/verb extraction.Rather small unsupervised corpus.Very simple similarity metric.

    SystemResultsRocchio57.3%Density Estimation58.9%First Attempt59.7% (67.9% when prior evidence)

  • Results of Simple Extraction

  • Doing It RightBetter subject/verb extraction with parser.Much larger unsupervised corpus.Better similarity metric.Maybe stemming.

  • Previous InteractionsCombining Pcut and DE for Reuters.Combining text system and image feature system for Indoor vs. Outdoor.Using number of people information to improve Indoor vs. Outdoor probabilities.

  • Those ResultsPcut + DE for ReutersText + Image for In/OutAdding # of People for In/Out

    Micro-average F1Macro-average F1Pcut71.0 %50.1%DE83.0%40.5%Combo82.2%51.2%

    AccuracyImage Only82.4%Text Only83.3%Combo86.2%

    Standard EvaluationAlternate EvaluationWithout # of people79.9%75.0%With # of people80.2%77.2%

  • Future InteractionsImprove accuracy.Determine new information.Likely use density estimation or bins.

  • For ExampleIndoor + Disaster OutdoorIndoor + Politics Meeting / Press Conference

  • Closed Captioned VideoApply image system to video.New categories.New modality, new challenges.669 726 "HEADLINE NEWS" -- I'M KIMBERLEY KENNEDY, IN FOR DAVID GOODNOW.750 878 THE FIRE HAS BEEN PUT OUT, BUT SO HAVE THE HOPES OF THOUSANDS OF PEOPLE COUNTING ON A VACATION TO MEXICO THIS WEEK.930 1087 THE CARNIVAL CRUISE LINER "ECSTACY" WAS ONLY TWO MILES INTO ITS TRIP FROM MIAMI TO COZUMEL WHEN A FIRE BROKE OUT IN A LAUNDRY ROOM.1134 1291 THE COAST GUARD CAME TO THE RESCUE AND DOUSED THE FLAMES AS HUNDREDS OF PASSENGERS DONNING LIFE JACKETS WATCHED ON. 1331 1388 THE FIRE CHARRED THREE LOWER DECKS BEFORE FIREFIGHTERS BROUGHT IT UNDER CONTROL. ...

  • Schedule

  • ConclusionsCategorization of multimedia data.Applying novel techniques.Using NLP for hard categories.Exploring interactions between systems and categories.