web search engines and information retrieval on the world-wide web torsten suel cis department...

Web Search Engines and

Information Retrieval on the World-Wide Web

Torsten SuelCIS Department

[email protected]://cis.poly.edu/suel

Overview:• introduction and motivation

• research: improving cluster-based search engines

• research: future peer-to-peer search engine architectures

Web search engines:

1. Introduction and Motivation

Basic structure of a search engine:

Crawler

disks

Index

indexing

Search.comQuery: “computer”

look up

1. Introduction and Motivation (cont.)

• coverage (need to cover large part of the web)

• good ranking (in the case of broad queries)

• freshness (need to update content)

• user load (up to 10000 queries/sec - Google)

• manipulation (sites want to be listed first)

Challenges for search engines:

need to crawl and store massive data sets

smart information retrieval techniques

frequent recrawling of content

many queries on massive data

most techniques will be exploited quickly


• more than 3 billion web pages and 10 million web sites

• need to crawl, store, and process terabytes of data

• 10000 queries / second (Google)

• cluster of more than 5000 Linux servers (Google)

• “planetary-scale web service”

(google, hotmail, yahoo, aol web caches, akamai)

• proprietary code and secret recipes


Other types of web search tools

• Web directories (yahoo, open directory project)

• Specialized search engines (cora, citeseer, achoo, findlaw)

• Local search engines (for one site)

• Meta search engines (dogpile, mamma, search.com)

• Personal search assistants (alexa, google toolbar)

• Image search (ditto, visoo)

• Database search (completeplanet, brightplanet)


http://www.yahoo.com/

http://dmoz.org/

http://citeseer.nj.nec.com/

http://www.achoo.com/

http://www.findlaw.com/

http://www.findlaw.com/

http://www.dogpile.com/

http://www.mamma.com/

http://www.search.com/

http://www.alexa.com/

• trademark and copyright enforcement - track down mp3 and video files

- track down images with logos (Cobion)

• comparison shopping and auction bots• competitive intelligence• national security: monitoring certain websites

Data collection, extraction & mining tools

• Example: Whizbang job database:

- collects job announcements on company web sites

- focused crawling to track down job annoucements

- sorts job announcements by type, locations, etc.


algorithms

systemsinformation retrieval

databases

machine learning

natural languageprocessin

g

AI


• efficiency and scaling with query load - per-node performance - scaling cluster size

• data size and scaling with the web - data acquisition: crawling and refresh - index size and performance - index updates

• better ranking for improved results - link-based ranking

- topic- and context-specific ranking

2. Cluster-Based Search Engines

Research Challenges:

Polybot crawler: (with Vlad Shkapenyuk)

• scalable web crawler• runs on cluster of servers• 300 pages/sec (and beyond)

Storage and Indexing: (Alex Okulov and Xiaohui Long)

high-speedLAN or SAN

• storing and indexing terabytes on network of workstations • fast compression techniques for storage• index performance and index updates• index partitioning

Linux servers with several

disks each

• Ragerank (Brin&Page/Google)

“significance of a page

depends on significance

of those referencing it”

• improving link-based ranking• integration of term- and link-based methods

Link-based ranking (Yenyu Chen and Qingqing Gan)

Future Search Engines and Search Tools• expect powerful user interfaces beyond browser - browsing assistants - search and navigation tools

• many more search engine accesses• most access programmatic in nature• idea: split search engine into upper and lower tier - lower tier: crawling, indexing, index queries (dumb, big data) - upper tier: ranking, interface, analysis (smart stuff)

• idea: lower layer as highly distributed substrate to support search and navigation tools - open and agnostic “let a thousand flowers bloom”

- scalable “let a million queries fly”

2. Peer-to-peer Search Engine Architectures

P2P web search architecture:

• thousands of powerful machines all over the internet• machines can join or leave• agnostic: can implement many IR methods on top

searchengine

searchengine

searchengine

searchengine

West Exploration and Search Technology Lab:

• about 10 grad and undergrad students• more information: http://cis.poly.edu/westlab• courses on web search, IR, web protocols

Showcase slides at http://cis.poly.edu/showcase/

web search engines and information retrieval on the world-wide web torsten suel cis department...

Documents