searching on the www the google phenomena snyder p119-141

23
Searching on the WWW The Google Phenomena Snyder p119-141

Post on 19-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Searching on the WWWThe Google Phenomena

Snyder p119-141

Searching The best place to

look for something is where it’s likely to be found

Key to finding information.

Searching A lot of information

can be found on the WWW

But, that is the flaw of the WWW:

Too much Information No organization No structure

Organizing Information Classification Hierarchy Categories with

sub-categories What are some

problems with Hierarchies?

Problem With Hierarchies I want to find the

requirements for the Minor in IS

Siena College

Academics Financial Aid Athletics

Degree Requirements

Forms

Schools

Science

CS

Arts

IS Minor

Webs or Networks Multiple Paths

Siena College

Academics Financial Aid Athletics

Degree Requirements

Forms

Schools

Science

CS

Arts

IS Minor

Problem With Web Networks Less Important to Get

things in the correct category

Information architects don’t worry too much about

Classification Organization

Problem With Web Networks As different Networks of

Information are connected Excessive redundant links

emerge Different organization

strategies class A mess is created

Search Engines to the Rescue Alternative to searching via navigation http://www.marketwatch.com/newsimages/mi

sc/search_engines_timeline.pdf

How Search Engines Work

1. Web Crawling – program (robot) surfs from hyper-link to hyper-link accumulating pages

2. Web Indexed – each accumulated page is added to a database.

URL of web page is stored Each word, occurrences, and sometimes

position are stored.

How Search Engines Work

3. User Search – actually searches the index database

4. Sophisticated Algorithms are used to retrieve and rank pages that “match” the user search.

Step 1: Web CrawlingHardest task.5 - 10 million new web pages are added

to the Internet every day.Robots need to know where to start

lookingYou need the help and cooperation of

web page creators.

Step 2: Web IndexingEach URL consists of a list of words... URL1 word5 word74 word195 word456 URL2 word7 word82 word135 URL3 word5 word74 word165 word288 URL4 word21 word59 URL5 word25 word74 word188 word432 URL6 word7 word186 word430 URL7 word2 word398 URL8 word34 word39 word84 word193 ...

Step 2: Web IndexingInverted Index: Each word consists of a list of URLs word1 URL19 URL39 URL82 URL91 word2 URL27 URL41 URL66 URL67 word3 URL49 URL75 URL65 word4 URL29 URL89 word5 URL12 URL48 URL66 word6 URL53 URL73 URL123 URL144 word7 URL3 URL41 URL77 ...

Step 3: User Search Searching the index database must be quick. The database is sorted by key words (primary

index) The English language has about 600,000

words Luckily, only about a tenth them are widely

used The database server needs to store the

primary index in memory (RAM).

Step 4: Ranking the results Searches on common words can return

millions of pages. Ordering or ranking becomes more important

as the data increases Intuitive measures

Number of occurances of search words Search word in title, keyword, etc “Importance” of web page User feedback.

Search Engine Issues Logical statements AND, OR, etc. Phrases “Grilled Cheese” Images – Dali Example Dishonesty – XXX Example Differences in Vocabulary - IBM-Issue

Search Engines (Catch 22) Search Engine Companies make money by

placing ads. More searches = bigger audience =

more $$$ from ads

Best Thing: Get as many people to use your search engine as possible

Worst Thing: What if everyone exclusively uses search engines to search the WWW?

How Google became the bestPageRank algorithm

(based on the Clever Algorithm)PageRank is a measure of importance.

Links from important pages improves your PageRank

1.2 1.6 2.4 3.1 2.0 4.6

3.8 12.1

PageRank continued http://www.iprcom.com/papers/pagerank/

Simplistic Explanation: Initially all pages have the same

PageRank An iterative process increases the

page rank of all pages based on direct links first (highly weighted

*1) then, one hop links then, two hop links ... then, ten hop links (very low

weighting *0.001) ...

The algorithm ignores cycles

The algorithm does not reward cliques

Eventually, the page ranks will stabilize (stop increasing) once you’ve considered

Until the page ranks stablize

PageRank intuition ESPN.com is highly ranked because

Several other highly ranked pages point to it Millions of low ranked pages point to it

Any page connected-to or part-of ESPN.com will benefit from this.

Intuitively, ESPN is an information authority on sports.

PageRank intuition Breimer.net is poorly ranked because

Very few pages point to it. None to be exact.

The page is not an authority on “Breimer” until other pages acknowledge its existence via a link.