web crawling fall 2011 dr. lillian n. cassel. overview of the class purpose: course description –...

43
Web Crawling Fall 2011 Dr. Lillian N. Cassel

Upload: douglas-hoover

Post on 11-Jan-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Web Crawling Fall 2011 Dr. Lillian N. Cassel. Overview of the class Purpose: Course Description – How do they do that? Many web applications, from Google

Web Crawling

Fall 2011Dr. Lillian N. Cassel

Page 2: Web Crawling Fall 2011 Dr. Lillian N. Cassel. Overview of the class Purpose: Course Description – How do they do that? Many web applications, from Google

Overview of the class• Purpose: Course Description

– How do they do that? Many web applications, from Google to travel sites to resource collections, present results found by crawling the Web to find specific materials of interest to the application theme. Crawling the Web involves technical issues, politeness conventions, characterization of materials, decisions about the breadth and depth of a search, and choices about what to present and how to display results. This course will explore all of these issues. In addition, we will address what happens after you crawl the web and acquire a collection of pages. You will decide on the questions, but some possibilities might include these: What summer jobs are advertised on web sites in your favorite area? What courses are offered in most (or few) computer science departments? What theatres are showing what movies? etc? Students will develop a web site built by crawling at least some part of the web to find appropriate materials, categorize them, and display them effectively. Prerequisites: some programming experience: CSC 1051 or the equivalent.

Page 3: Web Crawling Fall 2011 Dr. Lillian N. Cassel. Overview of the class Purpose: Course Description – How do they do that? Many web applications, from Google

Tonight’s class• Review the syllabus and general plan• Get some information about student

experience• Look at some background about the web

and information handling

Note: We will touch on several topics that we will explore in greater depth later in the semester

Page 4: Web Crawling Fall 2011 Dr. Lillian N. Cassel. Overview of the class Purpose: Course Description – How do they do that? Many web applications, from Google

Overview -2• Note:– Students will develop a web site built by

crawling at least some part of the web to find appropriate materials, categorize them, and display them effectively. • You will need to choose a project, or at least a

general area of interest, soon.

– Prerequisites: some programming experience: CSC 1051 or the equivalent.• I need to know what programming experience you

have.

Page 5: Web Crawling Fall 2011 Dr. Lillian N. Cassel. Overview of the class Purpose: Course Description – How do they do that? Many web applications, from Google

Overview – 3• Web presence

– www.csc.villanova.edu/~cassel/5930Crawl/593-WebCrawling-Syllabus.html

– Blackboard site• We will use it for the things it is good for

– Keep in touch with both the web site and the Blackboard site

– Also – be sure to watch your e-mail. Any announcements or general information will go there – to the address on the class list

Page 6: Web Crawling Fall 2011 Dr. Lillian N. Cassel. Overview of the class Purpose: Course Description – How do they do that? Many web applications, from Google

The Web• Beginnings – Information Webs– Vannevar Bush As We May Think– Hypertext. Ted Nelson and others

• The World Wide Web– Origins and Structure• Directed graph• Possibility of cycles

– Characteristics

Page 7: Web Crawling Fall 2011 Dr. Lillian N. Cassel. Overview of the class Purpose: Course Description – How do they do that? Many web applications, from Google

Web Document Collection• No design/co-ordination

• Distributed content creation, linking, democratization of publishing

• Content includes truth, lies, obsolete information, contradictions …

• Unstructured (text, html, …), semi-structured (XML, annotated photos), structured (Databases)…

• Scale much larger than previous text collections … but corporate records are catching up

• Growth – slowed down from initial “volume doubling every few months” but still expanding

• Content can be dynamically generated

The Web

Page 8: Web Crawling Fall 2011 Dr. Lillian N. Cassel. Overview of the class Purpose: Course Description – How do they do that? Many web applications, from Google

Web Operation• Basic Client Server model– The http protocol

• HyperText Transfer Protocol– Few simple commands that allow communication between

the server and an application requesting something from the server – usually a browser, but not always.

– Server• The site where the content resides. • Most of the web is served up by Apache and its

byproducts.

– Client• The program requesting something from the server.• Browsers most often, but also web crawlers and other

applications.

Page 9: Web Crawling Fall 2011 Dr. Lillian N. Cassel. Overview of the class Purpose: Course Description – How do they do that? Many web applications, from Google

HTTP: GET and POST• GET <path> HTTP/<version>– Requests that the server send the specific page at

<path> back to the requestor.– The version number allows compatible

communication– Server sends header and the requested file (page).– Additional requests can follow.

• POST– Similar to a GET but allows additional information

to be sent to the server.– Useful for purchases or page edits.

Page 10: Web Crawling Fall 2011 Dr. Lillian N. Cassel. Overview of the class Purpose: Course Description – How do they do that? Many web applications, from Google

HEAD• HEAD <path> HTTP/<version>• Useful for checking whether a previously

fetched web page has changed.• The request results in header information,

but not the page itself.• Response:– Confirm http version compatibility– Date: – Server: – Last-Modified:

Page 12: Web Crawling Fall 2011 Dr. Lillian N. Cassel. Overview of the class Purpose: Course Description – How do they do that? Many web applications, from Google

Search• Search engines, whether general engines

like Google or Yahoo do not crawl the web looking for results after receiving a query. – That would take much too long and provide

unacceptable performance• Search engines actually search a carefully

constructed database with indices created for efficiently locating content

Page 13: Web Crawling Fall 2011 Dr. Lillian N. Cassel. Overview of the class Purpose: Course Description – How do they do that? Many web applications, from Google

Architecture of a Search Engine

Ref: Manning Introduction to Information Retrieval

Page 14: Web Crawling Fall 2011 Dr. Lillian N. Cassel. Overview of the class Purpose: Course Description – How do they do that? Many web applications, from Google

Crawling in Context

• So, we see that crawling is just one step in a complex process of acquiring information from the Web to use in any application.

• Usually, we will want to sort through the information we found to get the most relevant part for our use. So, the example of a search engine is relevant.

Page 15: Web Crawling Fall 2011 Dr. Lillian N. Cassel. Overview of the class Purpose: Course Description – How do they do that? Many web applications, from Google

Our goals• We will crawl the Web.

• We will also store the results of the crawl so that we can retrieve what we want when we want it.

• We will learn about fundamental concepts of information retrieval, including index creation.

Page 16: Web Crawling Fall 2011 Dr. Lillian N. Cassel. Overview of the class Purpose: Course Description – How do they do that? Many web applications, from Google

Crawling• A web crawler (aka a spider) is a program

– Starts with one or more URL – the seed• Other URLs will be found in the pages pointed to by the seed

URLs. They will be the starting point for further crawling

– Uses the standard protocols for requesting a resource from a server• Requirements for respecting server policies• Politeness

– Parses the resource obtained• Obtains additional URLs from the fetched page

– Implements policies about duplicate content – Recognizes and eliminates duplicate or unwanted URLs– Adds found URLs to the queue and continues from the

request to server step

Page 17: Web Crawling Fall 2011 Dr. Lillian N. Cassel. Overview of the class Purpose: Course Description – How do they do that? Many web applications, from Google

The Depth of the Web• A URL gives access to a web page.• That page may have links to other pages.• Some pages are generated only when

information is provided through a form.• These pages cannot be discovered just

by crawling.• The surface web is huge.• The deeper web is unfathomable.

Page 18: Web Crawling Fall 2011 Dr. Lillian N. Cassel. Overview of the class Purpose: Course Description – How do they do that? Many web applications, from Google

Anatomy of a URL• http://www.csc.villanova.edu/~cassel• That is a web page.• Three parts– http – the protocol to use for retrieving the

page– www.csc.villanova.edu -- the name of the

domain• csc is a subdomain of the villanova domain

– ~cassel• Abbreviation for index.html in the directory cassel

at the machine associated with www.csc.villanova.edu

Page 19: Web Crawling Fall 2011 Dr. Lillian N. Cassel. Overview of the class Purpose: Course Description – How do they do that? Many web applications, from Google

The major domain categories• Generic categories:

– .net -- Originally restricted to major participants in maintaining the Internet. Now open.

– .org -- Generally non profit organizations, including professional organizations such as acm.org

– .com -- Commercial organizations such as amazon.com, etc.– .edu -- Restricted to higher education (post secondary) institutions. High

schools and elementary schools are not allowed to use it.– .gov – Government organizations, such as nsf.gov– .mil – Military sites– .int – International

• Country Codes– .us– .uk

• Uses second level domains such as ac.uk or co.uk

– And other country designations. Who is .tv?• Newer ones: .biz, .name, etc.• All regulated by the Internet Assigned Numbers Authoriity (IANA)

Page 20: Web Crawling Fall 2011 Dr. Lillian N. Cassel. Overview of the class Purpose: Course Description – How do they do that? Many web applications, from Google

If not http:// then what?• Other protocols can be specified in the

request to a server:– file:// local file on the current host– ftp:// use the ftp protocol to fetch the file– Etc.

Page 21: Web Crawling Fall 2011 Dr. Lillian N. Cassel. Overview of the class Purpose: Course Description – How do they do that? Many web applications, from Google

Domain categories• The domain categories serve to partition the universe of

domain names.• Domain Name Servers (DNS) do lookup to translate a

domain name to an IP address.• An IP address locates a particular machine and makes a

communication path known.– Most common still: 32 bit IPv4 addresses– Newer: 128 bit IPv6

• A server will typically have many programs running, several listening for network connections. – A port number (16 bits) identifies the specific process for the

desired connection.– Default port for web connections: 80– If other than 80, it must be specified in the URL

Page 22: Web Crawling Fall 2011 Dr. Lillian N. Cassel. Overview of the class Purpose: Course Description – How do they do that? Many web applications, from Google

Crawler features• A crawler must be– Robust: Survive spider traps. Websites that

fool a spider into fetching large or limitless numbers of pages within the domain.• Some deliberate; some errors in site design

– Polite: Crawlers can interfere with the normal operation of a web site. Servers have policies, both implicit and explicit, about the allowed frequency of visits by crawlers. Responsible crawlers obey these. Others become recognized and rejected outright.

Ref: Manning Introduction to Information Retrieval

Page 23: Web Crawling Fall 2011 Dr. Lillian N. Cassel. Overview of the class Purpose: Course Description – How do they do that? Many web applications, from Google

Crawler features• A crawler should be

– Distributed: able to execute on multiple systems– Scalable: The architecture should allow additional

machines to be added as needed– Efficient: Performance is a significant issue if crawling a

large web– Useful: Quality standards should determine which

pages to fetch– Fresh: Keep the results up-to-date by crawling pages

repeatedly in some organized schedule– Extensible: Modular, well crafter architecture allows the

crawler to expand to handle new formats, protocols, etc.

Ref: Manning Introduction to Information Retrieval

Page 24: Web Crawling Fall 2011 Dr. Lillian N. Cassel. Overview of the class Purpose: Course Description – How do they do that? Many web applications, from Google

Scale • A one month crawl of a billion pages requires

fetching several hundred pages per second• It is easy to lose sight of the numbers when

dealing with data sources on the scale of the Web.– 30 days * 24 hours/day * 60 minutes/hour * 60 seconds/minute

= 2,592,000 seconds– 1,000,000,000 pages/2,592,000 seconds = 385.8 pages/second

• Note that those numbers assume that the crawling is continuous

Ref: Manning Introduction to Information Retrieval

Page 25: Web Crawling Fall 2011 Dr. Lillian N. Cassel. Overview of the class Purpose: Course Description – How do they do that? Many web applications, from Google

Google Search

See http://video.google.com/videoplay?docid=-1243280683715323550&hl=en#

Marissa Mayer of Google on how a search happens at Google.

Page 26: Web Crawling Fall 2011 Dr. Lillian N. Cassel. Overview of the class Purpose: Course Description – How do they do that? Many web applications, from Google

Basic Crawl Architecture

WWW

DNS

Parse

Contentseen?

DocFP’s

DupURLelim

URLset

URL Frontier

URLfilter

robotsfilters

Fetch

Ref: Manning Introduction to Information Retrieval

Page 27: Web Crawling Fall 2011 Dr. Lillian N. Cassel. Overview of the class Purpose: Course Description – How do they do that? Many web applications, from Google

Crawler Architecture• Modules:– The URL frontier (the queue of URLs still to be

fetched, or fetched again)– A DNS resolution module (The translation

from a URL to a web server to talk to)– A fetch module (use http to retrieve the page)– A parsing module to extract text and links

from the page– A duplicate elimination module to recognize

links already seen

Ref: Manning Introduction to Information Retrieval

Page 28: Web Crawling Fall 2011 Dr. Lillian N. Cassel. Overview of the class Purpose: Course Description – How do they do that? Many web applications, from Google

Crawling threads• With so much space to explore, so many

pages to process, a crawler will often consist of many threads, each of which cycles through the same set of steps we just saw. There may be multiple threads on one processor or threads may be distributed over many nodes in a distributed system.

Page 29: Web Crawling Fall 2011 Dr. Lillian N. Cassel. Overview of the class Purpose: Course Description – How do they do that? Many web applications, from Google

Politeness• Not optional.• Explicit

– Specified by the web site owner– What portions of the site may be crawled and what portions

may not be crawled• robots.txt file

• Implicit– If no restrictions are specified, still restrict how often you hit a

single site.– You may have many URLs from the same site. Too much traffic

can interfere with the site’s operation. Crawler hits are much faster than ordinary traffic – could overtax the server. (Constitutes a denial of service attack) Good web crawlers do not fetch multiple pages from the same server at one time.

Page 30: Web Crawling Fall 2011 Dr. Lillian N. Cassel. Overview of the class Purpose: Course Description – How do they do that? Many web applications, from Google

Robots.txt• Protocol nearly as old as the web– www.robotstxt.org/wc/norobots.html– File: URL/robots.txt• Contains the access restrictions

– Example:

User-agent: *Disallow: /yoursite/temp/

User-agent: searchengineDisallow:

All robots (spiders/crawlers)

Robot named searchengine only

Nothing disallowed

Page 31: Web Crawling Fall 2011 Dr. Lillian N. Cassel. Overview of the class Purpose: Course Description – How do they do that? Many web applications, from Google

Processing robots.txt• First line:– User-agent – identifies to whom the instruction

applies. * = everyone; otherwise, specific crawler name

– Disallow: or Allow: provides path to exclude or include in robot access.

• Once the robots.txt file is fetched from a site, it does not have to be fetched every time you return to the site.– Just takes time, and uses up hits on the server– Cache the robots.txt file for repeated reference

Page 32: Web Crawling Fall 2011 Dr. Lillian N. Cassel. Overview of the class Purpose: Course Description – How do they do that? Many web applications, from Google

Robots <META> tag• robots.txt provides information about

access to a directory.• A given file may have an html meta tag

that directs robot behavior• A responsible crawler will check for that

tag and obey its direction.• Ex:

– <META NAME=“ROBOTS” CONTENT = “INDEX, NOFOLLOW”>– OPTIONS: INDEX, NOINDEX, FOLLOW, NOFOLLOW

See http://www.w3.org/TR/html401/appendix/notes.html#h-B.4.1.2 and http://www.robotstxt.org/meta.html

Page 33: Web Crawling Fall 2011 Dr. Lillian N. Cassel. Overview of the class Purpose: Course Description – How do they do that? Many web applications, from Google

Crawling• Pick a URL from the frontier• Fetch the document at the URL• Parse the URL– Extract links from it to other docs (URLs)

• Check if URL has content already seen– If not, add to indices

• For each extracted URL– Ensure it passes certain URL filter tests– Check if it is already in the frontier (duplicate

URL elimination)Ref: Manning Introduction to Information Retrieval

Which one?

E.g., only crawl .edu, obey robots.txt, etc.

Page 34: Web Crawling Fall 2011 Dr. Lillian N. Cassel. Overview of the class Purpose: Course Description – How do they do that? Many web applications, from Google

Basic Crawl Architecture

Ref: Manning Introduction to Information Retrieval

WWW

DNS

Parse

Contentseen?

DocFP’s

DupURLelim

URLset

URL Frontier

URLfilter

robotsfilters

Fetch

Page 35: Web Crawling Fall 2011 Dr. Lillian N. Cassel. Overview of the class Purpose: Course Description – How do they do that? Many web applications, from Google

DNS – Domain Name Server• Internet service to resolve URLs into IP

addresses• Distributed servers, some significant latency

possible• OS implementations – DNS lookup is blocking

– only one outstanding request at a time.• Solutions– DNS caching– Batch DNS resolver – collects requests and sends

them out together

Ref: Manning Introduction to Information Retrieval

Page 36: Web Crawling Fall 2011 Dr. Lillian N. Cassel. Overview of the class Purpose: Course Description – How do they do that? Many web applications, from Google

Parsing• Fetched page contains – Embedded links to more pages– Actual content for use in the application

• Extract the links– Relative link? Expand (normalize)– Seen before? Discard– New?• Meet criteria? Append to URL frontier• Does not meet criteria? Discard

• Examine content

Page 37: Web Crawling Fall 2011 Dr. Lillian N. Cassel. Overview of the class Purpose: Course Description – How do they do that? Many web applications, from Google

Content

• Seen before?–How to tell?• Finger Print, Shingles–Documents identical, or similar

–If already in the index, do not process it again

Ref: Manning Introduction to Information Retrieval

Page 38: Web Crawling Fall 2011 Dr. Lillian N. Cassel. Overview of the class Purpose: Course Description – How do they do that? Many web applications, from Google

Distributed crawler• For big crawls,– Many processes, each doing part of the job• Possibly on different nodes• Geographically distributed

– How to distribute• Give each node a set of hosts to crawl• Use a hashing function to partition the set of

hosts

– How do these nodes communicate?• Need to have a common index

Ref: Manning Introduction to Information Retrieval

Page 39: Web Crawling Fall 2011 Dr. Lillian N. Cassel. Overview of the class Purpose: Course Description – How do they do that? Many web applications, from Google

Communication between nodes

Ref: Manning Introduction to Information Retrieval

WWW

Fetch

DNS

Parse

Contentseen?

URLfilter

DupURLelim

DocFP’s

URLset

URL Frontier

robotsfilters

Hostsplitter

Tootherhosts

Fromotherhosts

The output of the URL filter at each node is sent to the Duplicate URL Eliminator at all nodes

Page 40: Web Crawling Fall 2011 Dr. Lillian N. Cassel. Overview of the class Purpose: Course Description – How do they do that? Many web applications, from Google

URL Frontier• Two requirements– Politeness: do not go too often to the same site– Freshness: keep pages up to date• News sites, for example, change frequently

• Conflicts – The two requirements may be directly in conflict with each other.

• Complication– Fetching URLs embedded in a page will yield

many URLs located on the same server. Delay fetching those.

Ref: Manning Introduction to Information Retrieval

Page 41: Web Crawling Fall 2011 Dr. Lillian N. Cassel. Overview of the class Purpose: Course Description – How do they do that? Many web applications, from Google

More …• We will examine these things more completely.

What will you actually do?• Goal– Write a simple crawler

• Not distributed, not multi-threaded• Use a seed URL, connect with the server, fetch the

document, extract links, extract content

– Explore existing crawlers• Evaluate their characteristics• Learn to use one to do serious crawling

– Process the documents fetched to serve some purpose. Create a web site for that purpose.

Ref: Manning Introduction to Information Retrieval

Page 42: Web Crawling Fall 2011 Dr. Lillian N. Cassel. Overview of the class Purpose: Course Description – How do they do that? Many web applications, from Google

Processing the documents• Create an index and store the

documents and the index so that appropriate content can be found when needed.

• Learn the fundamentals of information retrieval as they apply to web services

Page 43: Web Crawling Fall 2011 Dr. Lillian N. Cassel. Overview of the class Purpose: Course Description – How do they do that? Many web applications, from Google

Next week• A simple web crawler• Web crawl visualization• Criteria for judging web crawlers• Some web crawlers to compare