search engine spiders

Search Engine Spiders

http://scienceforseo.blogspot.com

IR tutorial series: Part 2

...programs which scan the web in a methodical and automated way.

...they copy all the pages they visit and leave them to the search engine for indexing.

...not all spiders have the same job though, some check links, or collect email addresses, or validate code for example.

Spiders are...

...some people call them crawlers, bots and even ants or worms.

(Spidering means to request every page on a site)

A spider's architecture:

Downloads web pages

Stuff is stored

URLs get queued

Co-ordinates the processes

An example

The crawl list would look like this (although it would be much much bigger than this small sample):

http://www.techcrunch.com/http://www.crunchgear.com/http://www.mobilecrunch.com/http://www.techcrunchit.com/http://www.crunchbase.com/http://www.techcrunch.com/#http://www.inviteshare.com/http://pitches.techcrunch.com/http://gillmorgang.techcrunch.com/http://www.talkcrunch.com/http://www.techcrunch50.com/http://uk.techcrunch.com/http://fr.techcrunch.com/http://jp.techcrunch.com/

The spider will also save a copy of each page it visitsin a database.

The search engine will then index those.

The first URLs given to the spider as a starting pointare called seeds.

The list gets bigger and bigger and in order to make surethat the search engine index is current, the spider willneed to re-visit those links often to track any changes.

There are 2 lists: a list of URLs visited and a list of URLsto visit. This list is known as The crawl frontier.

Difficulties

The web is enormous: no search engine indexes more than 16% of the web. The spider will download only the most relevant pages.

The rate of change is phenomenal a spider needs to re-visit pages often to check for updates and changes.

Server-side scripting languages do not often return unique content but give a lot of URLs for the spider to visit,which is a waste of time for it

Solutions

Spiders will use the following policies:A selection policy that states which pages to download.

A re-visit policy that states when to check for changes to the pages.

A politeness policy that states how to avoid overloading websites.

A parallelization policy that states how to coordinate distributed web crawlers.

Build a spider

You can use any programming language that you feel comfortable with, although JAVA, Perl and C# ones are the most popular.

You can also use these tutorials:Java sun spider - http://tiny.cc/e2KAyChilkat in python - http://tiny.cc/WH7ehSwish-e in Perl - http://tiny.cc/nNF5Q

Remember that a poorly designed spider can impact overall network and server performance.

OpenSource spiders

You can use one of these for free (some knowledge of programming can help in setting them up):

OpenWebSpider in C# - http://www.openwebspider.orgArachnid in Java - http://arachnid.sourceforge.net/Java-web-spider - http://code.google.com/p/java-web-spider/MOMSpider in perl - http://tiny.cc/36XQA

Robots.txt

This is a file that allows webmasters to give instructions to visiting spiders who must respect it. Some areas are off-limits.

Disallow spider from everythingUser-agent: *Disallow: /Disallow all except Googlebot and BackRub, which can access /privateUser-agent: GooglebotUser-agent: BackRubDisallow: /privateand churl, which can access everythingUser-agent: churlDisallow:

Spider ethics

There is code for spiders that developers must follow and you can read them here: http://www.robotstxt.org/guidelines.html

In (very) short:Are you sure the world needs another spider?

Identify the spider, yourself and publish your documentation.

Test locally

Moderate the speed and frequency of runs to a given host

Only retrieve what you can handle (format & scale)

Monitor your runs

Share your results

List your spider in the database http://www.robotstxt.org/db.html

Spider traps

Intentionally and non-intentionally, traps crop up on the spider's path sometimes and stop it functioning properly.

Dynamic pages, deep directories that never end, pages with special links and commands pointing the spider to other directories...anything that can put the spider into an infinite loop is an issue.

You might however want to deploy a spider trap if you know that one is visiting your site and not respecting your robots.txt for example or because it's a spambot.

Fleiner's spider trap

You are a bad netizen if you are a web bot! You are a bad netizen if you are a web bot! To give robots some work here some special links: these are .html>very page but with

search engine spiders

Technology