analyzing web crawler as feed forward engine for efficient solution to search problem in the minimum...

Post on 13-May-2015

1.677 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

My presentation slides for paper presented in International Conference on Information Science and Applications, ICISA, Seoul 2010.

TRANSCRIPT

AuthorsMuhammad Atif Qureshi

Arjumand YounusFrancisco Rojas

1International Conference on Information Science and Applications 2010

Introduction Implementation Alternatives Crawler Architecture Implications Conclusion

2International Conference on Information Science and Applications 2010

Analyzing Web Crawler as Feed Forward Engine for Efficient Solution to Search Problem in the Minimum Amount of Time through a Distributed Framework

3International Conference on Information Science and Applications 2010

Background Motivation Problem Statement Contributions

4International Conference on Information Science and Applications 2010

Web crawler Description

Program that downloads web pages recursively by fetching links from a seed of web pages

Backbone of search engine’s data repository

Competing factors among search engines Coverage of internet Throughput of complete download

5International Conference on Information Science and Applications 2010

[ Introduction ]

Web crawler needs to have highly optimized system architecture with ability to Download large number of web pages per second Be robust against crashes Be manageable and considerate of resources and

web servers Most of the works focus on “improving

strategy for web crawlers” [LLWL08] [SS02]

Our focus is to provide a convincing analysis of web crawler from system's viewpoint

6International Conference on Information Science and Applications 2010

[ Introduction ]

Description: analysis of web crawling from a systems’ perspective

Issues Threads vs. events Distributed implementation Prevention from DDoS attack Web crawler as feed forward engine for next

phases of search engine 7

International Conference on Information Science and Applications 2010

[ Introduction ]

First ever threads vs. events debate from web crawlers perspective

MapReduce architecture for distributed web crawler implementation

Implications towards birth of operating system for Internet based applications e.g. web crawlers

8International Conference on Information Science and Applications 2010

[ Introduction ]

Threads vs. Events Performance Evaluation for Threads vs.

Events

9International Conference on Information Science and Applications 2010

Problems in Threads Large memory footprint Context switch overhead Cache and TLB misses Expensive synchronization mechanisms

Problems in Events Add to programmers’ difficulty Debugging is troublesome

10International Conference on Information Science and Applications 2010

[ Implementation Alternatives ]

Environment CPU: Intel Pentium 4 Core 2 Duo 3GHz RAM: 3.2 GB OS: Linux 2.6.28-11-generic

Experiments 1st experiment:

Comparison of crawler throughput with varying pool size

2nd experiment: Comparison of crawler throughput with varying seed

URL size11

International Conference on Information Science and Applications 2010

[ Implementation Alternatives ]

12International Conference on Information Science and Applications 2010

[ Implementation Alternatives ]

No. of Seed URLs were kept constant at 1000

13International Conference on Information Science and Applications 2010

[ Implementation Alternatives ]

Pool size was kept constant at 200

High Level View of MapReduce Usage High Level Distributed Design with

MapReduce Prevention of DDoS Attack

14International Conference on Information Science and Applications 2010

15International Conference on Information Science and Applications 2010

[ Crawler Architecture]

International Conference on Information Science and Applications 2010 16

The distributed implementation was done with our own version of MapReduce[DG04] library.

[ Crawler Architecture]

Target server: yahoo.com

Same crawling machines

Simultaneous and continuing connections

[ Crawler Architecture]

International Conference on Information Science and Applications 2010

Push Right-side Order

URL Pop Left-side Priority

1 a.com 1

2 a.com/a 7

3 1.a.com 5

4 b.com 2

5 c.net 3

6 1.b.com 6

7 c.com 4

[ Crawler Architecture]

International Conference on Information Science and Applications 2010

19

IMPLICATIONS

International Conference on Information Science and Applications 2010

Observations during implementation of feed forward mechanisms in web crawler Exokernel based approach favorable for web

crawler Priority queue control Filesystem should not provide consistency

guarantees Indexing and dictionary concept should be

supported by file system

20

SEARCH ENGINE OPERATING SYSTEM

International Conference on Information Science and Applications 2010

[DG04] Dean, J., and Ghemawat, S., “ MapReduce: simplified data processing on large clusters,” In Proc. 6th Int’l Symposium on Operating Systems Design and Implementation, San Francisco, CA, 2004: 137-150.

[LLWL08] Lee, H.T., Leonard, D., Wang, X., and Loguinov, D., “IRLbot: scaling to 6 billion pages and beyond,” In Proc. 17th Int’l Conf. on World Wide Web, April 21-25, 2008, Beijing, China. 

[SS02] Shkapenyuk, V. and Suel, T., “Design and Implementation of a High-Performance Distributed Web Crawler,” In Proc. 18th Int’l Conf. on Data Engineering, pp. 3-57, San Jose, California, USA, Feb. 2002.

21International Conference on Information Science and Applications 2010

top related