web information extraction for the database research domain

WEB INFORMATION EXTRACTIONFOR THE DB RESEARCH DOMAIN

Michael Genkin ([email protected])

Liat Kakun ([email protected])

School of Engineering and Computer Science

Advisor:Dr. Sara Cohen

Introduction

Wealth of information available online To much for it to be handled, effectively, by

humans. Mostly inaccessible to computers

A web information extraction project Provide a complete, domain specific, system Allow structured queries on top of web

information. Part of a research on developing tools to

support scientific policy management @ HUJI DB Group. Advisor: Dr. Sara Cohen Other groups creating components – web crawler,

UI.

Introduction

Extract information from DB research projects’ web sites. Domain specific Divide & Conquer Structural document analysis Linguistic analysis Machine learning

The domain encoded in an XML schema document Contains processing instruction as well as domain

semantics. The result is an XML based, query-able, database

Methods – Structural Analysis #1

Before: After:

Transform each input document into a structurally valid, monolithic, document – using industry standard tools such as HTML Tidy and Readability.

Methods – Structural Analysis #2

Vertically segment each document into logical blocks.

Employ, stack based, style analysis to identify each of the blocks.

Methods - Classification

Employ multiclass classification (by vector similarity) to map the logical document blocks to the appropriate schema elements.

Methods – Pattern Recognition

Pattern: .//bibliography/ul/li/*

Mine likely candidate blocks for patterns using the PAT Tree algorithm; adjusted for finding a maximum likelihood pattern.

Methods – Metadata Extraction

Use CRF for extraction of additional metadata where appropriate (e.g. bibliographic lists).

Results – Setting

50 web pages of DB research projects from American and Israeli universities. Chosen manually to represent a wide

variety of web page styles. All pages pre-processed by our systems

– their structure analyzed; Then manually tagged for classification, patterns, metadata.

20% of the dataset is sampled for training purposes, randomly. Repeated 5 times, and averaged.

Results – Measures

Standard information extraction measures, adapted. Accuracy – the number of classifications

that were correct, per document.

Recall – content recall and structural recall, weighted, per logical block.

Document recall: Similarly – Precision

Results

Precision

Recall

Pattern Recognition

85% 89.7%

Classification Accuracy

82.5%

Conclusions

This is a feasible approach for creating a web information extraction system.

Good results can be achieved with a relative small sample.

The modular system design allows easy adaptation for additional domains.

Future directions: Schema generation Better information integration Additional modules (e.g. deep linguistic

analysis)

Questions?

Thank You!

web information extraction for the database research domain

Technology

domain semantics

input document

db research projects

db research domainadvisor

methods structural analysis

components web crawler

style analysis

huji db group