Crawling
• for each url in queue • download file
• parse links from file
• for each link found
• add to the end of queue
• handle file
Top 50 open source web crawlers for data mining http://bigdata-madesimple.com/top-50-open-source-web-crawlers-for-data-mining/
Crawls
• Time • Periodic crawls, snapshots
• Continuous crawls
• Scope • Universal crawling
• Focused crawling
Crawl order
• Breadth-first • Page importance/relevance
• Backlink count
• pageRank
• Forward link count
• Location matrices
• OPIC
• Larger-sites-first
• FICA
Crawling for research data
• Choose a crawler that suits your needs • Make sure to obey time limits and robots.txt • Add on (project’s) home page info on the crawling
• www.pagename/webmasters
• if possible add this address to the crawlers information
• Give your bot a name
• If doing intensive crawling • Inform your internet provider / university it-dep.
• Inform Funet CERT