analyzing web crawler as feed forward engine for efficient solution to search problem in the minimum...

21
Authors Muhammad Atif Qureshi Arjumand Younus Francisco Rojas 1 International Conference on Information Science and Applications 2010

Upload: m-atif-qureshi

Post on 13-May-2015

1.677 views

Category:

Technology


0 download

DESCRIPTION

My presentation slides for paper presented in International Conference on Information Science and Applications, ICISA, Seoul 2010.

TRANSCRIPT

Page 1: Analyzing Web Crawler as Feed Forward Engine for Efficient Solution to Search Problem in the Minimum Amount of Time through a Distributed Framework

AuthorsMuhammad Atif Qureshi

Arjumand YounusFrancisco Rojas

1International Conference on Information Science and Applications 2010

Page 2: Analyzing Web Crawler as Feed Forward Engine for Efficient Solution to Search Problem in the Minimum Amount of Time through a Distributed Framework

Introduction Implementation Alternatives Crawler Architecture Implications Conclusion

2International Conference on Information Science and Applications 2010

Page 3: Analyzing Web Crawler as Feed Forward Engine for Efficient Solution to Search Problem in the Minimum Amount of Time through a Distributed Framework

Analyzing Web Crawler as Feed Forward Engine for Efficient Solution to Search Problem in the Minimum Amount of Time through a Distributed Framework

3International Conference on Information Science and Applications 2010

Page 4: Analyzing Web Crawler as Feed Forward Engine for Efficient Solution to Search Problem in the Minimum Amount of Time through a Distributed Framework

Background Motivation Problem Statement Contributions

4International Conference on Information Science and Applications 2010

Page 5: Analyzing Web Crawler as Feed Forward Engine for Efficient Solution to Search Problem in the Minimum Amount of Time through a Distributed Framework

Web crawler Description

Program that downloads web pages recursively by fetching links from a seed of web pages

Backbone of search engine’s data repository

Competing factors among search engines Coverage of internet Throughput of complete download

5International Conference on Information Science and Applications 2010

[ Introduction ]

Page 6: Analyzing Web Crawler as Feed Forward Engine for Efficient Solution to Search Problem in the Minimum Amount of Time through a Distributed Framework

Web crawler needs to have highly optimized system architecture with ability to Download large number of web pages per second Be robust against crashes Be manageable and considerate of resources and

web servers Most of the works focus on “improving

strategy for web crawlers” [LLWL08] [SS02]

Our focus is to provide a convincing analysis of web crawler from system's viewpoint

6International Conference on Information Science and Applications 2010

[ Introduction ]

Page 7: Analyzing Web Crawler as Feed Forward Engine for Efficient Solution to Search Problem in the Minimum Amount of Time through a Distributed Framework

Description: analysis of web crawling from a systems’ perspective

Issues Threads vs. events Distributed implementation Prevention from DDoS attack Web crawler as feed forward engine for next

phases of search engine 7

International Conference on Information Science and Applications 2010

[ Introduction ]

Page 8: Analyzing Web Crawler as Feed Forward Engine for Efficient Solution to Search Problem in the Minimum Amount of Time through a Distributed Framework

First ever threads vs. events debate from web crawlers perspective

MapReduce architecture for distributed web crawler implementation

Implications towards birth of operating system for Internet based applications e.g. web crawlers

8International Conference on Information Science and Applications 2010

[ Introduction ]

Page 9: Analyzing Web Crawler as Feed Forward Engine for Efficient Solution to Search Problem in the Minimum Amount of Time through a Distributed Framework

Threads vs. Events Performance Evaluation for Threads vs.

Events

9International Conference on Information Science and Applications 2010

Page 10: Analyzing Web Crawler as Feed Forward Engine for Efficient Solution to Search Problem in the Minimum Amount of Time through a Distributed Framework

Problems in Threads Large memory footprint Context switch overhead Cache and TLB misses Expensive synchronization mechanisms

Problems in Events Add to programmers’ difficulty Debugging is troublesome

10International Conference on Information Science and Applications 2010

[ Implementation Alternatives ]

Page 11: Analyzing Web Crawler as Feed Forward Engine for Efficient Solution to Search Problem in the Minimum Amount of Time through a Distributed Framework

Environment CPU: Intel Pentium 4 Core 2 Duo 3GHz RAM: 3.2 GB OS: Linux 2.6.28-11-generic

Experiments 1st experiment:

Comparison of crawler throughput with varying pool size

2nd experiment: Comparison of crawler throughput with varying seed

URL size11

International Conference on Information Science and Applications 2010

[ Implementation Alternatives ]

Page 12: Analyzing Web Crawler as Feed Forward Engine for Efficient Solution to Search Problem in the Minimum Amount of Time through a Distributed Framework

12International Conference on Information Science and Applications 2010

[ Implementation Alternatives ]

No. of Seed URLs were kept constant at 1000

Page 13: Analyzing Web Crawler as Feed Forward Engine for Efficient Solution to Search Problem in the Minimum Amount of Time through a Distributed Framework

13International Conference on Information Science and Applications 2010

[ Implementation Alternatives ]

Pool size was kept constant at 200

Page 14: Analyzing Web Crawler as Feed Forward Engine for Efficient Solution to Search Problem in the Minimum Amount of Time through a Distributed Framework

High Level View of MapReduce Usage High Level Distributed Design with

MapReduce Prevention of DDoS Attack

14International Conference on Information Science and Applications 2010

Page 15: Analyzing Web Crawler as Feed Forward Engine for Efficient Solution to Search Problem in the Minimum Amount of Time through a Distributed Framework

15International Conference on Information Science and Applications 2010

[ Crawler Architecture]

Page 16: Analyzing Web Crawler as Feed Forward Engine for Efficient Solution to Search Problem in the Minimum Amount of Time through a Distributed Framework

International Conference on Information Science and Applications 2010 16

The distributed implementation was done with our own version of MapReduce[DG04] library.

[ Crawler Architecture]

Page 17: Analyzing Web Crawler as Feed Forward Engine for Efficient Solution to Search Problem in the Minimum Amount of Time through a Distributed Framework

Target server: yahoo.com

Same crawling machines

Simultaneous and continuing connections

[ Crawler Architecture]

International Conference on Information Science and Applications 2010

Page 18: Analyzing Web Crawler as Feed Forward Engine for Efficient Solution to Search Problem in the Minimum Amount of Time through a Distributed Framework

Push Right-side Order

URL Pop Left-side Priority

1 a.com 1

2 a.com/a 7

3 1.a.com 5

4 b.com 2

5 c.net 3

6 1.b.com 6

7 c.com 4

[ Crawler Architecture]

International Conference on Information Science and Applications 2010

Page 19: Analyzing Web Crawler as Feed Forward Engine for Efficient Solution to Search Problem in the Minimum Amount of Time through a Distributed Framework

19

IMPLICATIONS

International Conference on Information Science and Applications 2010

Page 20: Analyzing Web Crawler as Feed Forward Engine for Efficient Solution to Search Problem in the Minimum Amount of Time through a Distributed Framework

Observations during implementation of feed forward mechanisms in web crawler Exokernel based approach favorable for web

crawler Priority queue control Filesystem should not provide consistency

guarantees Indexing and dictionary concept should be

supported by file system

20

SEARCH ENGINE OPERATING SYSTEM

International Conference on Information Science and Applications 2010

Page 21: Analyzing Web Crawler as Feed Forward Engine for Efficient Solution to Search Problem in the Minimum Amount of Time through a Distributed Framework

[DG04] Dean, J., and Ghemawat, S., “ MapReduce: simplified data processing on large clusters,” In Proc. 6th Int’l Symposium on Operating Systems Design and Implementation, San Francisco, CA, 2004: 137-150.

[LLWL08] Lee, H.T., Leonard, D., Wang, X., and Loguinov, D., “IRLbot: scaling to 6 billion pages and beyond,” In Proc. 17th Int’l Conf. on World Wide Web, April 21-25, 2008, Beijing, China. 

[SS02] Shkapenyuk, V. and Suel, T., “Design and Implementation of a High-Performance Distributed Web Crawler,” In Proc. 18th Int’l Conf. on Data Engineering, pp. 3-57, San Jose, California, USA, Feb. 2002.

21International Conference on Information Science and Applications 2010