lecture #32 www search. review: data organization kinds of things to organize –menu items –text...
TRANSCRIPT
Lecture #32
WWW Search
Review: Data Organization
• Kinds of things to organize– Menu items– Text– Images– Sound– Videos– Records (I.e. a person’s name, address, & phone
number, or a car’s year, make, & model)
Review: Data Organization
• Three ways to find things:– Lists (in-order search, binary search)– Trees (balance number of branches with time to
decide which is correct branch)– Search
WWW Search
Search issues
• How do we say what we want?– I want a story about pigs– I want a picture of a rooster– How many televisions were sold in Vietnam
during 2000?– Find a movie like this one
• How does the computer find what we said?
Things to search for
• Records
• Text
• Images
• Audio
• Video
Records
• Car– Price– Miles– Year– Make– Doors
• Queries• Price < 6000 & Miles<100000• Make == Toyota & Year > 1993
Queries
• Make == Toyota & Year >1993
Queries
• Make == Toyota & Year >1993
Queries
• Year >1993 or Price < $3,000
Queries
• Year >1993 or Price < $3,000
Databases
• Large collections of records
• Accessed by queries
Things to search for
• RecordsText
• Images
• Audio
• Video
Text searching
• How do I say what I want?– Type some phrase
• I want a story about pigs
• How will the computer match this?– What is text?
• An array of characters
– What can can a computer do with text?• Match characters
Text searching
• People think in words not characters
• How do I convert an array of characters into an array of words?– Collect together sequences of letters– How do I know if character C is a letter?
• C>=“a” & C<=“z” | C>=“A” & C<=“Z”
Convert to words
• Because people think in words
Every document is an array of words
• I want a story about pigs
• How will I find the right documents?– Find all documents that have the word “pigs”
Searching text
• How will I find pigs fast?– Create an index of all words
• With each word store the name or address of each document that contains that word
– Search the index for “pigs”• Return the list of documents
• Use a binary search on the word list (50,000 words)
Problems
• What if a document has the word “Pig” but not “pigs”?
• Normalize– Case - make all words lower case
• Pig -> pig
– Stemming - remove all suffixes and prefixes before putting a word into the index
• pigs -> pig• piggy -> pig
Problems
• I want a story about pigs?– How does the computer know to search for
pigs?• It doesn’t
– How does the computer know what a story is?• It doesn’t
Searching
• I want a story about pigs
• Pick out the important words and search for them– Which words are important?
– D = number of times a word appears in a document– A = average number of times a word appears in all
documents
– Importance = D/A• Why?
How do we create an index of all documents on the Web?
• Try = a list of URLs• Seen = all URLs you have seen
While (Try is not empty){ Page = take a URL from Try
Words = all the “important” words in Pageadd Page to the index using all of WordsLinks = all URLs in Pagefor every Link that is not in Seen add Link to Try and to Seen
}
Other ways to find important words and important documents
• A Document is important if many other documents point to it
• A word is important in document D if that word occurs frequently in documents that link to document D.
Images
• What will I say when searching for an image?– I want a rooster picture– Draw a picture of a rooster?
Search by picture?
?
Is this possible? If so, how?
What’s in a picture?• Computers don’t understand the contents of
images
• To a computer an image is a bunch of colored pixels
I want a picture of a rooster
• Label all of the pictures
• How does Google Images do it?– File name of the picture “rooster-crossingSt.jpg”– Words around the picture in the HTML
• Use “Safe Search” and set filters appropriately (http://www.youtube.com/watch?v=maWx-ApkBCs)
Audio
• Talking– Use speech recognition to convert audio to text
– With each recognized word keep track of where in the audio it was recognized.
• Build an index using the recognized text– Normalize based on how words sound rather
than are spelled.
Video
• Where in “Casablanca” does Bogart say “Play it again Sam” ?
– he never does, he just says “play it”
• How can the computer find that?– Transcribe the audio– Speech recognition on the audio
Video
• Does Woody ever kiss Bo Peep?
• Exactly what color is a kiss?
Video
• Does Woody ever kiss Bo Peep?
• Annotate every frame with who is in the frame and search for frames with both Woody and Bo Peep.
So what’s with this?
Or this?
Is Woody cheating?
Search• Records
– Queries• < > = And Or
• Text– Normalized words (case, stemming, thesaurus)
• Images– Add words
• Audio– Transcribe or recognize as words
• Video– Transcribe– Annotate
“Re-Search” Directions in Image Recognition, Search and Retrieval
From R. Szeliski, Computer Vision Algorithms and Application, Course Notes CSE 576, U. Washington
Face Detection – Viola & Jones
Face DetectionIn Commercial Digital Cameras
Train on- 1000’s of faces- Millions of non-faces
Face Recognition(Eigenfaces [Turk and Pentland 1991])
N
N
N2
0 7125068 2104412853
Project image into higher-dimensional space
“Recognize” by grouping unknown image with closest training example
Face Recognition(Picasa - Google)
• Image search/organization• Automatically finds, crops and groups images of
the same person from a collection of photos• Allows user feedback (trainable) - user can
indicate if it found the wrong person.
From R. Szeliski, Computer Vision Algorithms and Application, Course Notes CSE 576, U. Washington
Create visual “words” from image features.
Face/Object Recognition/Search:Feature-Based Technology
ObjectObject Bag of Bag of “words”*“words”*
Extract Extract FeaturesFeatures
*Li Fei-Fei (Princeton)
From R. Szeliski, Computer Vision Algorithms and Application, Course Notes CSE 576, U. Washington
Do this for multiple objects
Face/Object Recognition/Search:Feature-Based Technology
*Li Fei-Fei (Princeton)
From R. Szeliski, Computer Vision Algorithms and Applications, p. 605
How to get matching images/documents?:
Use “word” frequencies = where nid = # times word i occurs in document d nd = total # words in document d
Then combine word frequency with inverse document frequency weighting to downweight words that occur frequently (D = # of occurrences; A = average # of occurrences)
Face/Object Recognition/Search:Bag of Words
From R. Szeliski, Computer Vision Algorithms and Application, Course Notes CSE 576, U. Washington
Drop word features through a “vocabulary tree” to classify
Face/Object Recognition/Search:Feature-Based Technology
*Li Fei-Fei (Princeton)