lecture #32 www search. review: data organization kinds of things to organize –menu items –text...

Lecture #32

WWW Search

Review: Data Organization

• Kinds of things to organize– Menu items– Text– Images– Sound– Videos– Records (I.e. a person’s name, address, & phone

number, or a car’s year, make, & model)

Review: Data Organization

• Three ways to find things:– Lists (in-order search, binary search)– Trees (balance number of branches with time to

decide which is correct branch)– Search

WWW Search

Search issues

• How do we say what we want?– I want a story about pigs– I want a picture of a rooster– How many televisions were sold in Vietnam

during 2000?– Find a movie like this one

• How does the computer find what we said?

Things to search for

• Records

• Text

• Images

• Audio

• Video

Records

• Car– Price– Miles– Year– Make– Doors

• Queries• Price < 6000 & Miles<100000• Make == Toyota & Year > 1993

Queries

• Make == Toyota & Year >1993

Queries

• Year >1993 or Price < $3,000

Databases

• Large collections of records

• Accessed by queries

Things to search for

• RecordsText

• Images

• Audio

• Video

Text searching

• How do I say what I want?– Type some phrase

• I want a story about pigs

• How will the computer match this?– What is text?

• An array of characters

– What can can a computer do with text?• Match characters

Text searching

• People think in words not characters

• How do I convert an array of characters into an array of words?– Collect together sequences of letters– How do I know if character C is a letter?

• C>=“a” & C<=“z” | C>=“A” & C<=“Z”

Convert to words

• Because people think in words

Every document is an array of words


• How will I find the right documents?– Find all documents that have the word “pigs”

Searching text

• How will I find pigs fast?– Create an index of all words

• With each word store the name or address of each document that contains that word

– Search the index for “pigs”• Return the list of documents

• Use a binary search on the word list (50,000 words)

Problems

• What if a document has the word “Pig” but not “pigs”?

• Normalize– Case - make all words lower case

• Pig -> pig

– Stemming - remove all suffixes and prefixes before putting a word into the index

• pigs -> pig• piggy -> pig

Problems

• I want a story about pigs?– How does the computer know to search for

pigs?• It doesn’t

– How does the computer know what a story is?• It doesn’t

Searching


• Pick out the important words and search for them– Which words are important?

– D = number of times a word appears in a document– A = average number of times a word appears in all

documents

– Importance = D/A• Why?

How do we create an index of all documents on the Web?

• Try = a list of URLs• Seen = all URLs you have seen

While (Try is not empty){ Page = take a URL from Try

Words = all the “important” words in Pageadd Page to the index using all of WordsLinks = all URLs in Pagefor every Link that is not in Seen add Link to Try and to Seen

}

Other ways to find important words and important documents

• A Document is important if many other documents point to it

• A word is important in document D if that word occurs frequently in documents that link to document D.

Images

• What will I say when searching for an image?– I want a rooster picture– Draw a picture of a rooster?

Search by picture?

?

Is this possible? If so, how?

What’s in a picture?• Computers don’t understand the contents of

images

• To a computer an image is a bunch of colored pixels

I want a picture of a rooster

• Label all of the pictures

• How does Google Images do it?– File name of the picture “rooster-crossingSt.jpg”– Words around the picture in the HTML

• Use “Safe Search” and set filters appropriately (http://www.youtube.com/watch?v=maWx-ApkBCs)

http://www.youtube.com/watch?v=maWx-ApkBCs



Audio

• Talking– Use speech recognition to convert audio to text

– With each recognized word keep track of where in the audio it was recognized.

• Build an index using the recognized text– Normalize based on how words sound rather

than are spelled.

Video

• Where in “Casablanca” does Bogart say “Play it again Sam” ?

– he never does, he just says “play it”

• How can the computer find that?– Transcribe the audio– Speech recognition on the audio

Video

• Does Woody ever kiss Bo Peep?

• Exactly what color is a kiss?

Video

• Does Woody ever kiss Bo Peep?

• Annotate every frame with who is in the frame and search for frames with both Woody and Bo Peep.

So what’s with this?

Or this?

Is Woody cheating?

Search• Records

– Queries• < > = And Or

• Text– Normalized words (case, stemming, thesaurus)

• Images– Add words

• Audio– Transcribe or recognize as words

• Video– Transcribe– Annotate

“Re-Search” Directions in Image Recognition, Search and Retrieval

From R. Szeliski, Computer Vision Algorithms and Application, Course Notes CSE 576, U. Washington

Face Detection – Viola & Jones

Face DetectionIn Commercial Digital Cameras

Train on- 1000’s of faces- Millions of non-faces

Face Recognition(Eigenfaces [Turk and Pentland 1991])

N

N

N2

0 7125068 2104412853

Project image into higher-dimensional space

“Recognize” by grouping unknown image with closest training example

Face Recognition(Picasa - Google)

• Image search/organization• Automatically finds, crops and groups images of

the same person from a collection of photos• Allows user feedback (trainable) - user can

indicate if it found the wrong person.


Create visual “words” from image features.

Face/Object Recognition/Search:Feature-Based Technology

ObjectObject Bag of Bag of “words”*“words”*

Extract Extract FeaturesFeatures

*Li Fei-Fei (Princeton)


Do this for multiple objects



From R. Szeliski, Computer Vision Algorithms and Applications, p. 605

How to get matching images/documents?:

Use “word” frequencies = where nid = # times word i occurs in document d nd = total # words in document d

Then combine word frequency with inverse document frequency weighting to downweight words that occur frequently (D = # of occurrences; A = average # of occurrences)

Face/Object Recognition/Search:Bag of Words


Drop word features through a “vocabulary tree” to classify



lecture #32 www search. review: data organization kinds of things to organize –menu items –text...

Documents

word list

computer match

order search

array of wordsi

array of characterswhat

word pigssearching texthow

character c

cars year