searching on the www the google phenomena snyder p119-141
Post on 19-Dec-2015
217 views
TRANSCRIPT
Searching The best place to
look for something is where it’s likely to be found
Key to finding information.
Searching A lot of information
can be found on the WWW
But, that is the flaw of the WWW:
Too much Information No organization No structure
Organizing Information Classification Hierarchy Categories with
sub-categories What are some
problems with Hierarchies?
Problem With Hierarchies I want to find the
requirements for the Minor in IS
Siena College
Academics Financial Aid Athletics
Degree Requirements
Forms
Schools
Science
CS
Arts
IS Minor
Webs or Networks Multiple Paths
Siena College
Academics Financial Aid Athletics
Degree Requirements
Forms
Schools
Science
CS
Arts
IS Minor
Problem With Web Networks Less Important to Get
things in the correct category
Information architects don’t worry too much about
Classification Organization
Problem With Web Networks As different Networks of
Information are connected Excessive redundant links
emerge Different organization
strategies class A mess is created
Search Engines to the Rescue Alternative to searching via navigation http://www.marketwatch.com/newsimages/mi
sc/search_engines_timeline.pdf
How Search Engines Work
1. Web Crawling – program (robot) surfs from hyper-link to hyper-link accumulating pages
2. Web Indexed – each accumulated page is added to a database.
URL of web page is stored Each word, occurrences, and sometimes
position are stored.
How Search Engines Work
3. User Search – actually searches the index database
4. Sophisticated Algorithms are used to retrieve and rank pages that “match” the user search.
Step 1: Web CrawlingHardest task.5 - 10 million new web pages are added
to the Internet every day.Robots need to know where to start
lookingYou need the help and cooperation of
web page creators.
Step 2: Web IndexingEach URL consists of a list of words... URL1 word5 word74 word195 word456 URL2 word7 word82 word135 URL3 word5 word74 word165 word288 URL4 word21 word59 URL5 word25 word74 word188 word432 URL6 word7 word186 word430 URL7 word2 word398 URL8 word34 word39 word84 word193 ...
Step 2: Web IndexingInverted Index: Each word consists of a list of URLs word1 URL19 URL39 URL82 URL91 word2 URL27 URL41 URL66 URL67 word3 URL49 URL75 URL65 word4 URL29 URL89 word5 URL12 URL48 URL66 word6 URL53 URL73 URL123 URL144 word7 URL3 URL41 URL77 ...
Step 3: User Search Searching the index database must be quick. The database is sorted by key words (primary
index) The English language has about 600,000
words Luckily, only about a tenth them are widely
used The database server needs to store the
primary index in memory (RAM).
Step 4: Ranking the results Searches on common words can return
millions of pages. Ordering or ranking becomes more important
as the data increases Intuitive measures
Number of occurances of search words Search word in title, keyword, etc “Importance” of web page User feedback.
Search Engine Issues Logical statements AND, OR, etc. Phrases “Grilled Cheese” Images – Dali Example Dishonesty – XXX Example Differences in Vocabulary - IBM-Issue
Search Engines (Catch 22) Search Engine Companies make money by
placing ads. More searches = bigger audience =
more $$$ from ads
Best Thing: Get as many people to use your search engine as possible
Worst Thing: What if everyone exclusively uses search engines to search the WWW?
How Google became the bestPageRank algorithm
(based on the Clever Algorithm)PageRank is a measure of importance.
Links from important pages improves your PageRank
1.2 1.6 2.4 3.1 2.0 4.6
3.8 12.1
PageRank continued http://www.iprcom.com/papers/pagerank/
Simplistic Explanation: Initially all pages have the same
PageRank An iterative process increases the
page rank of all pages based on direct links first (highly weighted
*1) then, one hop links then, two hop links ... then, ten hop links (very low
weighting *0.001) ...
The algorithm ignores cycles
The algorithm does not reward cliques
Eventually, the page ranks will stabilize (stop increasing) once you’ve considered
Until the page ranks stablize
PageRank intuition ESPN.com is highly ranked because
Several other highly ranked pages point to it Millions of low ranked pages point to it
Any page connected-to or part-of ESPN.com will benefit from this.
Intuitively, ESPN is an information authority on sports.