the speech recognition virtual kitchen
DESCRIPTION
TRANSCRIPT
The Speech Recognition Virtual Kitchen
Florian Metze and Eric Fosler-Lussier
INTERSPEECH 2012
Multimedia Retrieval and Summarization
“Traditional” Multimedia Retrieval and Summarization
Select frames and shots that are most informative
Save user time by avoiding repetitions etc. (BBC Rushes Summarization)
Recent Advances in Natural Language Processing
Replace “extractive” summarization of text with “abstractive” techniques
Use Statistical Machine Translation as a general technique to convert long “foreign” symbol sequence into concise English text
Would this not apply nicely to Multi-media?
Easily have huge amounts of data
“Skimming”, “tagging” with keywords, or “liking” clearly doesn’t do justice to relevance, complexity and potential of Multi-media
What’s Next?
Generate more detailed synopses, add temporal aspects, properties
Add more modalities (sounds, etc.)
“What is in these videos?”
Text could summarize multiple videos at once
Attract interest to (groups of) videos
“Why is this video relevant? Or different?”
Text can relate a retrieved video to the query
Text can potentially flag false alarms, outliers
Thank You!
Feature Definition
Event name: Changing a vehicle tire
Definition: One or more people work to replace a
tire on a vehicle
Explication: A vehicle is any device, motorized or not, used to transport people and/or other items. Tires are ring-shaped inflated
objects, usually made of rubber, that fit over the wheel of a vehicle. The
process for replacing a tire includes removing the existing tire and
installing the new tire onto the wheel of the vehicle. Tires typically are
replaced because they are damaged or worn down. If a tire is damaged and
loses air pressure as a result, it is called a "flat tire". Generally the
driver of the vehicle with a flat tire will stop the vehicle as soon as
possible and replace the affected tire with a temporary tire called a
"spare tire”, which may be stored elsewhere on/in the vehicle. In other
cases, the tire may be changed not by the vehicle operator, but by a
professional (e.g. a mechanic) who may use dedicated tools and work in a
repair shop or similar setting.
Evidential description:
scene: garage, outdoors, street, parking lot
objects/people: tire, lug wrench, hubcap,
vehicle (car, bike, lawnmower, etc), tire jack
activities: removing hubcap, turning lugwrench,
unscrewing bolts, pulling rim out of tire
audio: narration of the process; sounds of
tools being used; street/traffic noise;
background noises from repair shop
Extract candidates for relevant
objects from “Event Kit”
Determine salient objects from
MED features
Intersect both sets
Use ontologies to resolve
synonyms, etc
Combine data-driven and
knowledge based sources
MER Approach: Feature Extraction
What to mention:
Take visual evidence (for 100s of classes) for video
Re-rank using manually determined “importance”
How to mention:
Present as corroborating or contraindicative evidence
Place additional constraints
Similar for ASR hypotheses
Based on unigrams for now
Move from “hand-engineered” to automatic methods
Now: similar to Tf/ Idf measure, BOW features
Future: Bipartite graph matching to determine “good” concepts
Birthday
Vehicle unstuck
Flash mob
Vehicle Tire
INTERSPEECH AFTERPARTY “Speech Recognition Virtual Kitchen”
Broadway 3 & 4
4:30pm on Thursday, September 13
We want your input to grow this
idea further – show your support
Come and see more demos of VMs
Discuss with potential users or
content providers from outside the
speech community
Present your own ideas in a short
presentation(?)
http://www.speechkitchen.org/
http://speechkitchen.cse.ohio-state.edu/