crawling the web- presentation

Upload: nishima13

Post on 04-Jun-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/14/2019 crawling the web- presentation

    1/21

  • 8/14/2019 crawling the web- presentation

    2/21

    Outline

    Crawlers within Search Engine

    Motivation and taxonomy of crawlers

    Architecture of Web Crawler

    Basic crawlers and implementation issues Crawler ethics and conflicts

    2

  • 8/14/2019 crawling the web- presentation

    3/21

    Q: How does a searchengine know that all

    these pages contain thequery terms?

    A: Because all of thosepages have been

    crawled

    3

  • 8/14/2019 crawling the web- presentation

    4/21

    Organizing the Web The Web is big. Really big.

    Over 3 billion pages, just in the indexable Web

    The Web is dynamic Problems:

    How to store a database of links?

    How to crawl the web?

    How to recommend pages that match a query?

  • 8/14/2019 crawling the web- presentation

    5/21

    Architecture of a Search Engine1. A web crawler gathers asnapshot of the Web

    2. The gathered pages areindexed for easy retrieval

    3. User submits a

    search query

    4. Search engine ranks pages that

    match the query and returns an

    ordered list

  • 8/14/2019 crawling the web- presentation

    6/21

    Search Engine : major components Crawlers

    Collects documents by recursively fetching links from a set of startingpages.

    Each crawler has different policiesThe pages indexed by various search engines are different

    The IndexerProcesses pages, decide which of them to index, build various datastructures representing the pages (inverted index,web graph, etc),

    different representation among search engines. Might also buildadditional structure ( LSI )

    The Query ProcessorProcesses user queries and returns matching answers in an orderdetermined by a ranking algorithm.

  • 8/14/2019 crawling the web- presentation

    7/21

    Motivation for crawlers Support universal search engines (Google, Yahoo,MSN/Windows Live, Ask, etc.)

    Vertical (specialized) search engines, e.g. news,

    shopping, papers, recipes, reviews, etc. Business intelligence: keep track of potential

    competitors, partners

    Monitor Web sites of interest Can you think of some others?

    7

  • 8/14/2019 crawling the web- presentation

    8/21

    Crawler:

    basic

    idea

    starting

    pages

    (seeds)

    8

  • 8/14/2019 crawling the web- presentation

    9/21

    Research on crawlers 1993 : First crawler, Matthew Grays Wanderer 1994 :

    David Eichmann. The RBSE Spider Balancing Effective

    Search Against Web Load. In Proceedings of the FirstInternational World Wide Web Conference, 1994. Oliver A. McBryan. GENVL and WWWW : Tools for

    taming the web. In Proceedings of the First InternationalWorld Wide Web Conference, 1994.

    Brian Pinkerton . Finding What people Want :Experiences with the webCrawler. In Proceedings of theSecond International World Wide Web Conference, 1994.

  • 8/14/2019 crawling the web- presentation

    10/21

    Many names Crawler

    Spider

    Robot (or bot)

    Web agent

    Wanderer, worm,

    And famous instances: googlebot, scooter, slurp,msnbot,

    10

  • 8/14/2019 crawling the web- presentation

    11/21

    A crawler within a search engine

    11

    Web

    Text index PageRank

    Page repository

    googlebot

    Text & link

    analysisQuery

    hits

    Ranker

  • 8/14/2019 crawling the web- presentation

    12/21

    Crawler basic algorithm1. Remove a URL from the unvisited URL list

    2. Determine the IP Address of its host name

    3. Download the corresponding document4. Extract any links contained in it.

    5. If the URL is new, add it to the list of

    unvisited URLs6. Process the downloaded document

    7. Back to step 1

  • 8/14/2019 crawling the web- presentation

    13/21

    www

    DNS

    Fetch

    ParseContentSeen?

    URLFilter

    DupURLElim

    URL Frontier

    Doc

    Fingerprint

    Robots

    templates

    URL

    set

    URL Frontier: containing URLs yet to be fetches in the current crawl. At first, aseed set is stored in URL Frontier, and a crawler begins by taking a URL from

    the seed set.

    DNS: domain name service resolution. Look up IP address for domain names.

    Fetch: generally use the http protocol to fetch the URL.

    Parse: the page is parsed. Texts (images, videos, and etc.) and Links areextracted.

    Architecture of a crawler (Contd)

  • 8/14/2019 crawling the web- presentation

    14/21

    www

    DNS

    Fetch

    ParseContentSeen?

    URLFilter

    DupURLElim

    URL Frontier

    Doc

    Fingerprint

    Robots

    templates

    URL

    set

    Content Seen?: test whether a web page with the same content has already been seen

    at another URL. Need to develop a way to measure the fingerprint of a web page.

    URL Filter:

    Whether the extracted URL should be excluded from the frontier (robots.txt).

    URL should be normalized (relative encoding).

    en.wikipedia.org/wiki/Main_Page

    Disclaimers

    Dup URL Elim: the URL is checked for duplicate elimination.

    Architecture of a crawler (Contd)

  • 8/14/2019 crawling the web- presentation

    15/21

    Basic crawlers This is a sequentialcrawler

    Seedscan be any list ofstarting URLs

    Order of page visits isdetermined by frontierdatastructure

    Stopcriterion can beanything

  • 8/14/2019 crawling the web- presentation

    16/21

    Graph traversal

    (BFS or DFS?)

    Breadth First Search

    Implemented with QUEUE (FIFO)

    Finds pages along shortest paths

    If we start with good pages, this keeps

    us close; maybe other good stuff Depth First Search

    Implemented with STACK (LIFO)

    Wander away (lost in cyberspace)

    16

  • 8/14/2019 crawling the web- presentation

    17/21

    Implementation issues Dont want to fetch same page twice!

    Keep lookup table (hash) of visited pages What if not visited but in frontier already?

    The frontier grows very fast! May need to prioritize for large crawls

    Fetcher must be robust! Dont crash if download fails Timeout mechanism

    We can also conflate synonymsinto a single form using athesaurus 30-50% smaller index

    Doing this in both pages and queries allows to retrieve pagesabout automobile when user asks for car

    Thesaurus can be implemented as a hash table

    17

  • 8/14/2019 crawling the web- presentation

    18/21

    MORE ABOUT CRAWLERS

    Honor the Robot Exclusion ProtocolA server can specify which parts of its document tree any crawler is or isnot allowed to crawl by a file named robots.txt placed in the HTTP rootdirectory, e.g. http://www.indiana.edu/robots.txtCrawler should always check, parse, and obey this file before sending anyrequests to a server

    http://www.indiana.edu/robots.txthttp://www.indiana.edu/robots.txt
  • 8/14/2019 crawling the web- presentation

    19/21

  • 8/14/2019 crawling the web- presentation

    20/21

    Gray areas for crawler ethics If you write a crawler that unwillingly followslinks to ads, are you just being careless, or are

    you violating terms of service, or are youviolating the law by defrauding advertisers?

    Is non-compliance with Googles robots.txt in thiscase equivalent to click fraud?

    If you write a browser extension that performssome useful service, should you comply with

    robot exclusion?

    20

  • 8/14/2019 crawling the web- presentation

    21/21

    Thank you!