analyzing web crawler as feed forward engine for efficient solution to search problem in the minimum...

AuthorsMuhammad Atif Qureshi

Arjumand YounusFrancisco Rojas

1International Conference on Information Science and Applications 2010

Introduction Implementation Alternatives Crawler Architecture Implications Conclusion

Analyzing Web Crawler as Feed Forward Engine for Efficient Solution to Search Problem in the Minimum Amount of Time through a Distributed Framework

Background Motivation Problem Statement Contributions

Web crawler Description

Program that downloads web pages recursively by fetching links from a seed of web pages

Backbone of search engine’s data repository

Competing factors among search engines Coverage of internet Throughput of complete download

[ Introduction ]

Web crawler needs to have highly optimized system architecture with ability to Download large number of web pages per second Be robust against crashes Be manageable and considerate of resources and

web servers Most of the works focus on “improving

strategy for web crawlers” [LLWL08] [SS02]

Our focus is to provide a convincing analysis of web crawler from system's viewpoint

[ Introduction ]

Description: analysis of web crawling from a systems’ perspective

Issues Threads vs. events Distributed implementation Prevention from DDoS attack Web crawler as feed forward engine for next

phases of search engine 7

International Conference on Information Science and Applications 2010

[ Introduction ]

First ever threads vs. events debate from web crawlers perspective

MapReduce architecture for distributed web crawler implementation

Implications towards birth of operating system for Internet based applications e.g. web crawlers

[ Introduction ]

Threads vs. Events Performance Evaluation for Threads vs.

Events

Problems in Threads Large memory footprint Context switch overhead Cache and TLB misses Expensive synchronization mechanisms

Problems in Events Add to programmers’ difficulty Debugging is troublesome

[ Implementation Alternatives ]

Environment CPU: Intel Pentium 4 Core 2 Duo 3GHz RAM: 3.2 GB OS: Linux 2.6.28-11-generic

Experiments 1st experiment:

Comparison of crawler throughput with varying pool size

2nd experiment: Comparison of crawler throughput with varying seed

URL size11

No. of Seed URLs were kept constant at 1000

Pool size was kept constant at 200

High Level View of MapReduce Usage High Level Distributed Design with

MapReduce Prevention of DDoS Attack

[ Crawler Architecture]

International Conference on Information Science and Applications 2010 16

The distributed implementation was done with our own version of MapReduce[DG04] library.

Target server: yahoo.com

Same crawling machines

Simultaneous and continuing connections

Push Right-side Order

URL Pop Left-side Priority

1 a.com 1

2 a.com/a 7

3 1.a.com 5

4 b.com 2

5 c.net 3

6 1.b.com 6

7 c.com 4

IMPLICATIONS

Observations during implementation of feed forward mechanisms in web crawler Exokernel based approach favorable for web

crawler Priority queue control Filesystem should not provide consistency

guarantees Indexing and dictionary concept should be

supported by file system

SEARCH ENGINE OPERATING SYSTEM

[DG04] Dean, J., and Ghemawat, S., “ MapReduce: simplified data processing on large clusters,” In Proc. 6th Int’l Symposium on Operating Systems Design and Implementation, San Francisco, CA, 2004: 137-150.

[LLWL08] Lee, H.T., Leonard, D., Wang, X., and Loguinov, D., “IRLbot: scaling to 6 billion pages and beyond,” In Proc. 17th Int’l Conf. on World Wide Web, April 21-25, 2008, Beijing, China.

[SS02] Shkapenyuk, V. and Suel, T., “Design and Implementation of a High-Performance Distributed Web Crawler,” In Proc. 18th Int’l Conf. on Data Engineering, pp. 3-57, San Jose, California, USA, Feb. 2002.

analyzing web crawler as feed forward engine for efficient solution to search problem in the minimum...

information science

web crawler exokernel

ddos attack web crawler

web servers

internet based applications

distributed implementation

analysis of web crawling

implementation prevention

Technology

vibroflot modular product range - ice international ......to...

chapter 27fac.ksu.edu.sa/sites/default/files/ch11 -...

spock my decisions are completely logical. patient…elliot...

lastfm crawler

hex crawler

smart crawler a two stage crawler

textbook crawler

acid crawler

keywlker crawler

crawler thesis

thermo scientiﬁc antaris feed and ingredient analyzer...

work to feed chart - equifeeds.co.za · equi-feeds work to...

feed up , feed back & feed forward

pipe crawler

seabed crawler

crawler siege

rocka crawler

crawler spare parts crawler spare parts -...

primary feed secondary feed tertiary feed

crawler cranes