web information extraction for the database research domain

14
WEB INFORMATION EXTRACTION FOR THE DB RESEARCH DOMAIN Michael Genkin ([email protected]) Liat Kakun ([email protected]) School of Engineering and Computer Science Advisor: Dr. Sara Cohen

Upload: michael-genkin

Post on 24-Jun-2015

510 views

Category:

Technology


4 download

DESCRIPTION

A presentation describing my final project for an engineering degree at the Hebrew University of Jerusalem - a system for extracting information from web sites into instances of an XML schema, utilizing machine learning, structural analysis of documents and a divide & conquer strategy.

TRANSCRIPT

Page 1: Web Information Extraction for the Database Research Domain

WEB INFORMATION EXTRACTIONFOR THE DB RESEARCH DOMAIN

Michael Genkin ([email protected])

Liat Kakun ([email protected])

School of Engineering and Computer Science

Advisor:Dr. Sara Cohen

Page 2: Web Information Extraction for the Database Research Domain

Introduction

Wealth of information available online To much for it to be handled, effectively, by

humans. Mostly inaccessible to computers

A web information extraction project Provide a complete, domain specific, system Allow structured queries on top of web

information. Part of a research on developing tools to

support scientific policy management @ HUJI DB Group. Advisor: Dr. Sara Cohen Other groups creating components – web crawler,

UI.

Page 3: Web Information Extraction for the Database Research Domain

Introduction

Extract information from DB research projects’ web sites. Domain specific Divide & Conquer Structural document analysis Linguistic analysis Machine learning

The domain encoded in an XML schema document Contains processing instruction as well as domain

semantics. The result is an XML based, query-able, database

Page 4: Web Information Extraction for the Database Research Domain

Methods – Structural Analysis #1

Before: After:

Transform each input document into a structurally valid, monolithic, document – using industry standard tools such as HTML Tidy and Readability.

Page 5: Web Information Extraction for the Database Research Domain

Methods – Structural Analysis #2

Vertically segment each document into logical blocks.

Employ, stack based, style analysis to identify each of the blocks.

Page 6: Web Information Extraction for the Database Research Domain

Methods - Classification

Employ multiclass classification (by vector similarity) to map the logical document blocks to the appropriate schema elements.

Page 7: Web Information Extraction for the Database Research Domain

Methods – Pattern Recognition

Pattern: .//bibliography/ul/li/*

Mine likely candidate blocks for patterns using the PAT Tree algorithm; adjusted for finding a maximum likelihood pattern.

Page 8: Web Information Extraction for the Database Research Domain

Methods – Metadata Extraction

Use CRF for extraction of additional metadata where appropriate (e.g. bibliographic lists).

Page 9: Web Information Extraction for the Database Research Domain

Results – Setting

50 web pages of DB research projects from American and Israeli universities. Chosen manually to represent a wide

variety of web page styles. All pages pre-processed by our systems

– their structure analyzed; Then manually tagged for classification, patterns, metadata.

20% of the dataset is sampled for training purposes, randomly. Repeated 5 times, and averaged.

Page 10: Web Information Extraction for the Database Research Domain

Results – Measures

Standard information extraction measures, adapted. Accuracy – the number of classifications

that were correct, per document.

Recall – content recall and structural recall, weighted, per logical block.

Document recall: Similarly – Precision

Page 11: Web Information Extraction for the Database Research Domain

Results

Precision

Recall

Pattern Recognition

85% 89.7%

Classification Accuracy

82.5%

Page 12: Web Information Extraction for the Database Research Domain

Conclusions

This is a feasible approach for creating a web information extraction system.

Good results can be achieved with a relative small sample.

The modular system design allows easy adaptation for additional domains.

Future directions: Schema generation Better information integration Additional modules (e.g. deep linguistic

analysis)

Page 13: Web Information Extraction for the Database Research Domain

Questions?

Page 14: Web Information Extraction for the Database Research Domain

Thank You!