crossmarc web pages collection: crawling and spidering components vangelis karkaletsis institute of...

12
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Vangelis Karkaletsis Institute of Informatics & Institute of Informatics & Telecommunications Telecommunications NCSR “Demokritos” NCSR “Demokritos” Final Project Review Luxembourg, October 31, 2003

Upload: darrell-carpenter

Post on 31-Dec-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”

CROSSMARC Web Pages Collection: Crawling and Spidering Components

Vangelis KarkaletsisVangelis Karkaletsis

Institute of Informatics & TelecommunicationsInstitute of Informatics & TelecommunicationsNCSR “Demokritos”NCSR “Demokritos”

Final Project Review

Luxembourg, October 31, 2003

Page 2: CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”

Final Review “Crawling and Spidering” Luxembourg, 31 October 2003

2

Web Pages Collection: Web Pages Collection: Focused CrawlerFocused Crawler

• Identifies web sites that are of relevance to a particular domain. It combines:

• a crawler that exploits the topic-based Web site hierarchies used by various search engines

• a crawler that submits to a search engine queries from the domain ontologies and lexicons of CROSSMARC

• a crawler that takes a set of ‘seed’ pages and conducts a ‘similar pages’ search from advanced search engines

– The list of Web sites produced is filtered

Page 3: CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”

Final Review “Crawling and Spidering” Luxembourg, 31 October 2003

3

Web Pages Collection:Web Pages Collection:crawler customizationcrawler customization

– change of settings of crawler configuration files

– experimentation and evaluation to find the optimal settings for each version as well as their optimal combination

– train the light spidering module that filters the crawler results

Page 4: CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”

Final Review “Crawling and Spidering” Luxembourg, 31 October 2003

4

Web Pages Collection: Web Pages Collection: Crawler EvaluationCrawler Evaluation

• more than one experimentation cycle may be needed depending on the domain and language

• our evaluation methodology provides a good way of comparing different initial settings of the crawler

Language 1st DomainPrecision (%)

2nd DomainPrecision (%)

English 45,2 87,5

Italian 25,6 41,7

Greek 26,0 53,2

French 57,1 30,8

Page 5: CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”

Final Review “Crawling and Spidering” Luxembourg, 31 October 2003

5

• Site navigation: traverses a Web site, collecting information from each page visited and forwarding it to the “Page-Filtering” and “Link-Scoring” modules

• Page-filtering is responsible for deciding whether a page is an interesting one and should be stored or not

– before storing a page, its language is identified

– the page is also converted to XHTML

• Link-scoring validates the links to be followed. Only links with a score above a certain threshold are followed.

Web Pages Collection: Web Pages Collection: Web sites spiderWeb sites spider

Page 6: CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”

Final Review “Crawling and Spidering” Luxembourg, 31 October 2003

6

• The following types of URLs are supported:• Frame links, Text links, Image links, Image maps

• JavaScript cases, HTML forms

in order to discover and extract more URLs in the Web page.

• Each URL is checked if it• redirects to another site

• points to a non-HTML file

• is already in the queue of visited URLs

Web Pages Collection: Web Pages Collection: Web sites spider - NavigationWeb sites spider - Navigation

Page 7: CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”

Final Review “Crawling and Spidering” Luxembourg, 31 October 2003

7

• Two approaches were investigated:

– Machine learning: The WebPageClassifier tool was developed that

• reads a corpus of positive and negative Web pages, • translates it into a feature vector format, and • uses learning algorithms to construct the Web page

classifier.

– Heuristics: The heuristics based filter• accepts as input the Web page, in the form of a token

sequence,• compares each token to a list of regular expressions from

the domain lexicon in use.

Web Pages Collection: Web Pages Collection: Web sites spider – Page FilteringWeb sites spider – Page Filtering

Page 8: CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”

Final Review “Crawling and Spidering” Luxembourg, 31 October 2003

8

• Two approaches were investigated:

– Machine learning: The training system for Link scoring takes as input

• a collection of domain-specific web sites, • the positive web pages within these web sites, • the domain ontology and one or more domain lexicon files

from which it creates the training data set.

– Heuristics: The heuristics based link scorer • takes as input the link’s text content as well as its context

(left and right),• parses the three strings looking for domain relevant

information based on a score-table,• combines the scores of the three strings using a weighted

function.

Web Pages Collection: Web Pages Collection: Web sites spider – Link scoringWeb sites spider – Link scoring

Page 9: CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”

Final Review “Crawling and Spidering” Luxembourg, 31 October 2003

9

Web Pages Collection:Web Pages Collection:spider customizationspider customization

– Use the same navigation mechanism– Use the machine learning based “page filtering”

which requires:– the domain ontology and lexicons – the creation of a representative training corpus

(CROSSMARC provides the Corpus Formation tool) – the use of the WebPageClassifer tool to construct the

domain-specific classifier

– Use the rule-based approach suggested for link scoring which requires:

– the specification of new settings in the configuration file of the link scoring module

– experimenting with each specification until the optimal setting is found

Page 10: CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”

Final Review “Crawling and Spidering” Luxembourg, 31 October 2003

12

Web Pages Collection: Web Pages Collection: Web sites spider EvaluationWeb sites spider Evaluation

• Page Filtering

Language1st Domain

F-measure (%)2nd Domain F-

measure (%)

English 96,9 83,2

Italian 93,7 73,7

Greek 92,7 87,9

French 96,9 82,3

– we are able to identify with a high degree of confidence whether a page in interesting or not according to the domain

– results can be improved further

• so far only ontology-based features are used

• combination with statistically selected one a promising research direction

Page 11: CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”

Final Review “Crawling and Spidering” Luxembourg, 31 October 2003

13

• Link scoring:

– Rather poor the results of both methods

– An issue that could be investigated is the combination of the two methods to improve recall results

– Concluding, the task of scoring links without visiting them

• remains a very challenging one and

• is becoming more important in the general setting of topic-specific search engines and portals

Web Pages Collection: Web Pages Collection: Web sites spider EvaluationWeb sites spider Evaluation

Page 12: CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”

Final Review “Crawling and Spidering” Luxembourg, 31 October 2003

14

Concluding RemarksConcluding Remarks

• Crawler• Applied in both domains of the project• Customization instructions are provided• The tool and the corpora used in both domains

and four languages will be available for research purposes

• Spider• Applied in three domains• Customization methodology and tools are

provided• The corpora collected for page filtering and link

scoring will be available for research purposes