qing-cai chen; xiao-hong yang; xiao-long wang machine learning and cybernetics (icmlc), 2011...

11
A PEER-TO-PEER BASED PASSIVE WEB CRAWLING SYSTEM Qing-Cai Chen; Xiao-Hong Yang; Xiao-Long Wang Machine Learning and Cybernetics (ICMLC), 2011 International Conference on Year: 2011 , Page(s): 1878 – 1883 1 Speaker : Chang, Kun-Hsiang

Upload: wilfred-woods

Post on 03-Jan-2016

214 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Qing-Cai Chen; Xiao-Hong Yang; Xiao-Long Wang Machine Learning and Cybernetics (ICMLC), 2011 International Conference on Year: 2011, Page(s): 1878 – 1883

1

A PEER-TO-PEER BASED PASSIVE WEB CRAWLING

SYSTEM

Qing-Cai Chen; Xiao-Hong Yang; Xiao-Long Wang Machine

Learning and Cybernetics (ICMLC), 2011 International

Conference on Year: 2011 , Page(s): 1878 – 1883

Speaker : Chang, Kun-Hsiang

Page 2: Qing-Cai Chen; Xiao-Hong Yang; Xiao-Long Wang Machine Learning and Cybernetics (ICMLC), 2011 International Conference on Year: 2011, Page(s): 1878 – 1883

2

Outline

Abstract P2P based passive web crawling system Crawler server registering Content updated notification Download updated content by P2P network Website discovering

Page 3: Qing-Cai Chen; Xiao-Hong Yang; Xiao-Long Wang Machine Learning and Cybernetics (ICMLC), 2011 International Conference on Year: 2011, Page(s): 1878 – 1883

3

Abstract This paper proposes an innovative client/server

based web crawling system.

main benefits : Capability of timely management web changes for a

crawle.The saving of website bandwidth resources.The capability of downloading large files or multimedia

content features.The capability of protection intellectual properties

while indexing and searching the content.

Page 4: Qing-Cai Chen; Xiao-Hong Yang; Xiao-Long Wang Machine Learning and Cybernetics (ICMLC), 2011 International Conference on Year: 2011, Page(s): 1878 – 1883

4

The basic principle of a Crawler

Page 5: Qing-Cai Chen; Xiao-Hong Yang; Xiao-Long Wang Machine Learning and Cybernetics (ICMLC), 2011 International Conference on Year: 2011, Page(s): 1878 – 1883

5

P2P based passive web crawling system

Page 6: Qing-Cai Chen; Xiao-Hong Yang; Xiao-Long Wang Machine Learning and Cybernetics (ICMLC), 2011 International Conference on Year: 2011, Page(s): 1878 – 1883

6

Responsibilities Assignment for Crawler Server and Crawler Client

Page 7: Qing-Cai Chen; Xiao-Hong Yang; Xiao-Long Wang Machine Learning and Cybernetics (ICMLC), 2011 International Conference on Year: 2011, Page(s): 1878 – 1883

7

Crawler server registering robots.xml

PortIP address.

Page 8: Qing-Cai Chen; Xiao-Hong Yang; Xiao-Long Wang Machine Learning and Cybernetics (ICMLC), 2011 International Conference on Year: 2011, Page(s): 1878 – 1883

8

Content updated notification

a new registered server, it has to wait for several days or weeks to be notified to download all history contents on this website.

Page 9: Qing-Cai Chen; Xiao-Hong Yang; Xiao-Long Wang Machine Learning and Cybernetics (ICMLC), 2011 International Conference on Year: 2011, Page(s): 1878 – 1883

9

Download updated content by P2P network

Page 10: Qing-Cai Chen; Xiao-Hong Yang; Xiao-Long Wang Machine Learning and Cybernetics (ICMLC), 2011 International Conference on Year: 2011, Page(s): 1878 – 1883

10

Website discovering

Page 11: Qing-Cai Chen; Xiao-Hong Yang; Xiao-Long Wang Machine Learning and Cybernetics (ICMLC), 2011 International Conference on Year: 2011, Page(s): 1878 – 1883

11

END