multimedia retrieval
DESCRIPTION
Multimedia Retrieval. Outline. Audio Retrieval Spoken information Music Document Image Analysis and Retrieval Video Retrieval. A Taxonomy of Audio. Sound. Music. Speech. Other?. ?. Jazz. Country. Sports Announcer. Male. Rock. Classical. Female. Disco. Hip Hop. Choir. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Multimedia Retrieval](https://reader033.vdocuments.mx/reader033/viewer/2022061416/56814395550346895db01250/html5/thumbnails/1.jpg)
Multimedia Retrieval
![Page 2: Multimedia Retrieval](https://reader033.vdocuments.mx/reader033/viewer/2022061416/56814395550346895db01250/html5/thumbnails/2.jpg)
Outline
• Audio Retrieval • Spoken information• Music
• Document Image Analysis and Retrieval• Video Retrieval
![Page 3: Multimedia Retrieval](https://reader033.vdocuments.mx/reader033/viewer/2022061416/56814395550346895db01250/html5/thumbnails/3.jpg)
A Taxonomy of Audio
Sound
Music Other?Speech
Classical
Country
Disco Hip Hop
Jazz
RockSportsAnnouncer
Female
Male
Orchestra
StringQuartet
Choir
Piano
?
![Page 4: Multimedia Retrieval](https://reader033.vdocuments.mx/reader033/viewer/2022061416/56814395550346895db01250/html5/thumbnails/4.jpg)
Spoken Document Retrieval
![Page 5: Multimedia Retrieval](https://reader033.vdocuments.mx/reader033/viewer/2022061416/56814395550346895db01250/html5/thumbnails/5.jpg)
Spoken Document Retrieval
![Page 6: Multimedia Retrieval](https://reader033.vdocuments.mx/reader033/viewer/2022061416/56814395550346895db01250/html5/thumbnails/6.jpg)
Acoustic Modeling
Describes the sounds thatmake up speech
Lexicon
Describes which sequences of speech
sounds make upvalid words
Language Model
Describes the likelihoodof various sequences of
words being spoken
Speech Recognition
Speech Recognition Knowledge Sources
![Page 7: Multimedia Retrieval](https://reader033.vdocuments.mx/reader033/viewer/2022061416/56814395550346895db01250/html5/thumbnails/7.jpg)
Speech Recognition in Brief
Pronunciation Lexicon
Signal Processing
PhoneticProbabilityEstimator(Acoustic
Model)
Decoder(Language
Model)WordsSpeech
Grammar
![Page 8: Multimedia Retrieval](https://reader033.vdocuments.mx/reader033/viewer/2022061416/56814395550346895db01250/html5/thumbnails/8.jpg)
Hints For Better Recognition
• Topical information• News of the day• Image information ?
• Goal: improve the estimation p(word|acoustic_sig)• Main idea:
p(word|acoustic_sign) p(word|acoustic_signal, X)
What could be X?
![Page 9: Multimedia Retrieval](https://reader033.vdocuments.mx/reader033/viewer/2022061416/56814395550346895db01250/html5/thumbnails/9.jpg)
Hints For Better Recognition
• Topical information• News of the day• Image information
• Lip reading• Video Optical Character
Recognition (VOCR)
• Goal: improve the estimation p(word|acoustic_sig)• Main idea:
p(word|acoustic_sign) p(word|acoustic_signal, X)
What could be X?
![Page 10: Multimedia Retrieval](https://reader033.vdocuments.mx/reader033/viewer/2022061416/56814395550346895db01250/html5/thumbnails/10.jpg)
Speech Recognition AccuracyWord Error Rate
BenchmarkLab
TV Studio
DialogNews
Documentary
Commercials
0
10
20
30
40
50
60
70
80
90
100
![Page 11: Multimedia Retrieval](https://reader033.vdocuments.mx/reader033/viewer/2022061416/56814395550346895db01250/html5/thumbnails/11.jpg)
Information Retrieval Precision vs. Speech Accuracy
Word Error Rate
% of Text IR
100
90
80
70
60
50
40
30
Rel
ativ
e P
reci
sio
n
0 10 20 30 40 50 60 70 80
Indexing and Search of Multimodal Information, Hauptmann, A., Wactlar, H. Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP-97), Munich, Germany, April 1997.
A rather small degradation in retrieval when word error rate is small than 30%
![Page 12: Multimedia Retrieval](https://reader033.vdocuments.mx/reader033/viewer/2022061416/56814395550346895db01250/html5/thumbnails/12.jpg)
Spoken Document Retrieval
• Segmentation issue• Continuous speech data without story boundaries
• Typical segmentation approaches
Overlapping windows (30 sec for each segment)
Automatic detection of speaker changes
![Page 13: Multimedia Retrieval](https://reader033.vdocuments.mx/reader033/viewer/2022061416/56814395550346895db01250/html5/thumbnails/13.jpg)
Spoken Document Retrieval:Document Expansion
• Motivation: documents are erroneous• Goal: apply expansion techniques to reduce the
impacts of recognition errors in spoken documents• Similar to query expansion
![Page 14: Multimedia Retrieval](https://reader033.vdocuments.mx/reader033/viewer/2022061416/56814395550346895db01250/html5/thumbnails/14.jpg)
Spoken Document Retrieval:Document Expansion
• Motivation: documents are erroneous• Goal: apply expansion techniques to reduce the
impacts of recognition errors in spoken documents• Similar to query expansion
Clean Doc Collection (web docs)
Speech Recognized Transcript
doc1
doc2
doc3
doc4
Find common
words in top ranked docs
![Page 15: Multimedia Retrieval](https://reader033.vdocuments.mx/reader033/viewer/2022061416/56814395550346895db01250/html5/thumbnails/15.jpg)
Spoken Document Retrieval:Document Expansion
• Motivation: documents are erroneous• Goal: apply expansion techniques to reduce the
impacts of recognition errors in spoken documents• Similar to query expansion
• Treat each speech document as a query
• Find clean documents that are relevant to speech documents
• Expand each speech document with the common words in the top ranked clean documents.
![Page 16: Multimedia Retrieval](https://reader033.vdocuments.mx/reader033/viewer/2022061416/56814395550346895db01250/html5/thumbnails/16.jpg)
Document Expansion (Sighal & Piereira, 1999)
![Page 17: Multimedia Retrieval](https://reader033.vdocuments.mx/reader033/viewer/2022061416/56814395550346895db01250/html5/thumbnails/17.jpg)
A Taxonomy of Audio
Sound
Music Other?Speech
Classical
Country
Disco Hip Hop
Jazz
RockSportsAnnouncer
Female
Male
Orchestra
StringQuartet
Choir
Piano
?
![Page 18: Multimedia Retrieval](https://reader033.vdocuments.mx/reader033/viewer/2022061416/56814395550346895db01250/html5/thumbnails/18.jpg)
Music Information Retrieval
![Page 19: Multimedia Retrieval](https://reader033.vdocuments.mx/reader033/viewer/2022061416/56814395550346895db01250/html5/thumbnails/19.jpg)
Music Retrieval
• A textual retrieval approach• Using meta data: titles, artists, genres, …
• Content-based music retrieval• Query by audio• Query by score document/segment
![Page 20: Multimedia Retrieval](https://reader033.vdocuments.mx/reader033/viewer/2022061416/56814395550346895db01250/html5/thumbnails/20.jpg)
Content-based Music Retrieval
Short-termAutocorrelation
NoteSegmentation
Mid-level Representation
Similarity Comparison
Query results(Ranked song list)
Songs Database
Midi message Extraction
Microphone Signal input
Sampling
11KHz
CenterClipping
Off-line processing
On-line processing
67 64 65 62 60 (Midi representation)
-3 1 -3 -2
![Page 21: Multimedia Retrieval](https://reader033.vdocuments.mx/reader033/viewer/2022061416/56814395550346895db01250/html5/thumbnails/21.jpg)
Content-based Music Retrieval
: 1 1 2 0 -2 0 1 2 0 : -3 1 1 2
• N-gram representation
1 1 2 C1 1 1
1 2 0 C2 2 0
2 0 –2 C3 1 0
0 –2 0 C4 1 0
-3 1 1 C5 0 1
• A vector representation for each music document• A typical information retrieval problem
![Page 22: Multimedia Retrieval](https://reader033.vdocuments.mx/reader033/viewer/2022061416/56814395550346895db01250/html5/thumbnails/22.jpg)
Document Image Analysis and Retrieval
![Page 23: Multimedia Retrieval](https://reader033.vdocuments.mx/reader033/viewer/2022061416/56814395550346895db01250/html5/thumbnails/23.jpg)
Document Image Analysis
• Recognize text (OCR)• convert page images to Unicode
• machine-printed, handwritten
• Analyze page layout geometry• a 2-D problem (unlike speech, text)
• good ‘language-free’ algorithms
• Capture logical structure• output marked-up text (XML, etc)
• exploit non-textual clues
![Page 24: Multimedia Retrieval](https://reader033.vdocuments.mx/reader033/viewer/2022061416/56814395550346895db01250/html5/thumbnails/24.jpg)
Video/Image OCR Block Diagram
Text Area
Detection
Text Area
Preprocessing
Commercial
OCR
Video orImage
UTF8 Text
![Page 25: Multimedia Retrieval](https://reader033.vdocuments.mx/reader033/viewer/2022061416/56814395550346895db01250/html5/thumbnails/25.jpg)
Text Detection
![Page 26: Multimedia Retrieval](https://reader033.vdocuments.mx/reader033/viewer/2022061416/56814395550346895db01250/html5/thumbnails/26.jpg)
• Low resolution (as low as 10 pixel height/character)
• limited by NTSC (352x248) /PAL/SECAM TV standard
• Complex background
• Character Hue and Brightness similar to background
Video OCR
![Page 27: Multimedia Retrieval](https://reader033.vdocuments.mx/reader033/viewer/2022061416/56814395550346895db01250/html5/thumbnails/27.jpg)
VOCR Preprocessing Problems
![Page 28: Multimedia Retrieval](https://reader033.vdocuments.mx/reader033/viewer/2022061416/56814395550346895db01250/html5/thumbnails/28.jpg)
Video Frames(1/2 s intervals)
Filtered Frames AND-ed Frames
![Page 29: Multimedia Retrieval](https://reader033.vdocuments.mx/reader033/viewer/2022061416/56814395550346895db01250/html5/thumbnails/29.jpg)
![Page 30: Multimedia Retrieval](https://reader033.vdocuments.mx/reader033/viewer/2022061416/56814395550346895db01250/html5/thumbnails/30.jpg)
OCR Document Retrieval
• Task: find OCR recognized document relevant to a information need
• Challenge: erroneous documents
needs to handle with word errors
![Page 31: Multimedia Retrieval](https://reader033.vdocuments.mx/reader033/viewer/2022061416/56814395550346895db01250/html5/thumbnails/31.jpg)
OCR Document Retrieval
• Correction based approaches• Find potential word errors and replace each with the
most likely correct one
• Partial matching approaches• Word a set of n-grams
• Word matches n-gram matches
![Page 32: Multimedia Retrieval](https://reader033.vdocuments.mx/reader033/viewer/2022061416/56814395550346895db01250/html5/thumbnails/32.jpg)
Video Retrieval
![Page 33: Multimedia Retrieval](https://reader033.vdocuments.mx/reader033/viewer/2022061416/56814395550346895db01250/html5/thumbnails/33.jpg)
Video Retrieval - Application of Diverse Technologies
• Speech understanding for automatically derived transcripts
• Image understanding for video “paragraphing”; face, text and other object recognition
• Natural language for query expansion, topic detection and content summarization
• Human computer interaction for video display, navigation and reuse
• Integration overcomes limitation of each
![Page 34: Multimedia Retrieval](https://reader033.vdocuments.mx/reader033/viewer/2022061416/56814395550346895db01250/html5/thumbnails/34.jpg)
Introduction to TREC Video Retrieval Track
• NIST TREC Video Track web site: http://www-nlpir.nist.gov/projects/trecvid/
• Video Retrieval Track started in 2001• Investigation of content-based retrieval from digital video
• Focus on the shot as the unit of information retrieval rather than the scene or story/segment/clip
![Page 35: Multimedia Retrieval](https://reader033.vdocuments.mx/reader033/viewer/2022061416/56814395550346895db01250/html5/thumbnails/35.jpg)
The TRECVID Collections
2001 - 11 hours, 74 queries, 8000 shots
2002 - 40 hours, 25 queries, 14000 shotsVideo from the Internet Archive between the ‘50’s and ’70’s
Advertising, educational, industrial and amateur films
Common shot boundaries
2003 – 56 hours, 25 queries, 32000 shots1998 Broadcast News (CNN, ABC, CSpan)
+ Common Speech Recognition
+ Common Annotations
2004 – 61 hours, 24 queries, 33000 shotsMore 1998 Broadcast News
![Page 36: Multimedia Retrieval](https://reader033.vdocuments.mx/reader033/viewer/2022061416/56814395550346895db01250/html5/thumbnails/36.jpg)
Sample Query and Target
Query: Find pictures of Harry Hertz, Director
of the National Quality Program, NIST
Speech: We’re looking for people that have a broad range of expertise that have business knowledge that have knowledge on quality management on quality improvement and in particular …
OCR:H,arry Hertz a Director aro 7 wa-,i,,ty Program,Harry Hertz a Director
![Page 37: Multimedia Retrieval](https://reader033.vdocuments.mx/reader033/viewer/2022061416/56814395550346895db01250/html5/thumbnails/37.jpg)
System Architecture (Trec Video Track 2001)
• Combine video, audio and text retrieval scores
Query
Text Image Audio
Text Score Image Score Audio Score
RetrievalAgents
Final Score
![Page 38: Multimedia Retrieval](https://reader033.vdocuments.mx/reader033/viewer/2022061416/56814395550346895db01250/html5/thumbnails/38.jpg)
ARR Recall
ASR Transcripts 1.84% 13.2%
VOCR 5.93% 7.52%
Image Retrieval 14.99% 24.45%
Combine 18.9% 28.25%
Results for TREC01