web search engines and information retrieval on the world-wide web torsten suel cis department...

15
Web Search Engines and Information Retrieval on the World- Wide Web Torsten Suel CIS Department [email protected] http://cis.poly.edu/suel erview: ntroduction and motivation esearch: improving cluster-based search engines esearch: future peer-to-peer search engine architec

Upload: melina-randall

Post on 25-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Web Search Engines and Information Retrieval on the World-Wide Web Torsten Suel CIS Department suel@poly.edu  Overview: introduction

Web Search Engines and

Information Retrieval on the World-Wide Web

Torsten SuelCIS Department

[email protected]://cis.poly.edu/suel

Overview:• introduction and motivation

• research: improving cluster-based search engines

• research: future peer-to-peer search engine architectures

Page 2: Web Search Engines and Information Retrieval on the World-Wide Web Torsten Suel CIS Department suel@poly.edu  Overview: introduction

Web search engines:

1. Introduction and Motivation

Page 3: Web Search Engines and Information Retrieval on the World-Wide Web Torsten Suel CIS Department suel@poly.edu  Overview: introduction

Basic structure of a search engine:

Crawler

disks

Index

indexing

Search.comQuery: “computer”

look up

1. Introduction and Motivation (cont.)

Page 4: Web Search Engines and Information Retrieval on the World-Wide Web Torsten Suel CIS Department suel@poly.edu  Overview: introduction

• coverage (need to cover large part of the web)

• good ranking (in the case of broad queries)

• freshness (need to update content)

• user load (up to 10000 queries/sec - Google)

• manipulation (sites want to be listed first)

Challenges for search engines:

need to crawl and store massive data sets

smart information retrieval techniques

frequent recrawling of content

many queries on massive data

most techniques will be exploited quickly

1. Introduction and Motivation (cont.)

Page 5: Web Search Engines and Information Retrieval on the World-Wide Web Torsten Suel CIS Department suel@poly.edu  Overview: introduction

• more than 3 billion web pages and 10 million web sites

• need to crawl, store, and process terabytes of data

• 10000 queries / second (Google)

• cluster of more than 5000 Linux servers (Google)

• “planetary-scale web service”

(google, hotmail, yahoo, aol web caches, akamai)

• proprietary code and secret recipes

1. Introduction and Motivation (cont.)

Page 6: Web Search Engines and Information Retrieval on the World-Wide Web Torsten Suel CIS Department suel@poly.edu  Overview: introduction

Other types of web search tools

• Web directories (yahoo, open directory project)

• Specialized search engines (cora, citeseer, achoo, findlaw)

• Local search engines (for one site)

• Meta search engines (dogpile, mamma, search.com)

• Personal search assistants (alexa, google toolbar)

• Image search (ditto, visoo)

• Database search (completeplanet, brightplanet)

1. Introduction and Motivation (cont.)

Page 7: Web Search Engines and Information Retrieval on the World-Wide Web Torsten Suel CIS Department suel@poly.edu  Overview: introduction

• trademark and copyright enforcement - track down mp3 and video files

- track down images with logos (Cobion)

• comparison shopping and auction bots• competitive intelligence• national security: monitoring certain websites

Data collection, extraction & mining tools

• Example: Whizbang job database:

- collects job announcements on company web sites

- focused crawling to track down job annoucements

- sorts job announcements by type, locations, etc.

1. Introduction and Motivation (cont.)

Page 8: Web Search Engines and Information Retrieval on the World-Wide Web Torsten Suel CIS Department suel@poly.edu  Overview: introduction

algorithms

systemsinformation retrieval

databases

machine learning

natural languageprocessin

g

AI

1. Introduction and Motivation (cont.)

Page 9: Web Search Engines and Information Retrieval on the World-Wide Web Torsten Suel CIS Department suel@poly.edu  Overview: introduction

• efficiency and scaling with query load - per-node performance - scaling cluster size

• data size and scaling with the web - data acquisition: crawling and refresh - index size and performance - index updates

• better ranking for improved results - link-based ranking

- topic- and context-specific ranking

2. Cluster-Based Search Engines

Research Challenges:

Page 10: Web Search Engines and Information Retrieval on the World-Wide Web Torsten Suel CIS Department suel@poly.edu  Overview: introduction

Polybot crawler: (with Vlad Shkapenyuk)

• scalable web crawler• runs on cluster of servers• 300 pages/sec (and beyond)

Page 11: Web Search Engines and Information Retrieval on the World-Wide Web Torsten Suel CIS Department suel@poly.edu  Overview: introduction

Storage and Indexing: (Alex Okulov and Xiaohui Long)

high-speedLAN or SAN

• storing and indexing terabytes on network of workstations • fast compression techniques for storage• index performance and index updates• index partitioning

Linux servers with several

disks each

Page 12: Web Search Engines and Information Retrieval on the World-Wide Web Torsten Suel CIS Department suel@poly.edu  Overview: introduction

• Ragerank (Brin&Page/Google)

“significance of a page

depends on significance

of those referencing it”

• improving link-based ranking• integration of term- and link-based methods

Link-based ranking (Yenyu Chen and Qingqing Gan)

Page 13: Web Search Engines and Information Retrieval on the World-Wide Web Torsten Suel CIS Department suel@poly.edu  Overview: introduction

Future Search Engines and Search Tools• expect powerful user interfaces beyond browser - browsing assistants - search and navigation tools

• many more search engine accesses• most access programmatic in nature• idea: split search engine into upper and lower tier - lower tier: crawling, indexing, index queries (dumb, big data) - upper tier: ranking, interface, analysis (smart stuff)

• idea: lower layer as highly distributed substrate to support search and navigation tools - open and agnostic “let a thousand flowers bloom”

- scalable “let a million queries fly”

2. Peer-to-peer Search Engine Architectures

Page 14: Web Search Engines and Information Retrieval on the World-Wide Web Torsten Suel CIS Department suel@poly.edu  Overview: introduction

P2P web search architecture:

• thousands of powerful machines all over the internet• machines can join or leave• agnostic: can implement many IR methods on top

searchengine

searchengine

searchengine

searchengine

Page 15: Web Search Engines and Information Retrieval on the World-Wide Web Torsten Suel CIS Department suel@poly.edu  Overview: introduction

West Exploration and Search Technology Lab:

• about 10 grad and undergrad students• more information: http://cis.poly.edu/westlab• courses on web search, IR, web protocols

Showcase slides at http://cis.poly.edu/showcase/