web information extraction for the database research domain
DESCRIPTION
A presentation describing my final project for an engineering degree at the Hebrew University of Jerusalem - a system for extracting information from web sites into instances of an XML schema, utilizing machine learning, structural analysis of documents and a divide & conquer strategy.TRANSCRIPT
WEB INFORMATION EXTRACTIONFOR THE DB RESEARCH DOMAIN
Michael Genkin ([email protected])
Liat Kakun ([email protected])
School of Engineering and Computer Science
Advisor:Dr. Sara Cohen
Introduction
Wealth of information available online To much for it to be handled, effectively, by
humans. Mostly inaccessible to computers
A web information extraction project Provide a complete, domain specific, system Allow structured queries on top of web
information. Part of a research on developing tools to
support scientific policy management @ HUJI DB Group. Advisor: Dr. Sara Cohen Other groups creating components – web crawler,
UI.
Introduction
Extract information from DB research projects’ web sites. Domain specific Divide & Conquer Structural document analysis Linguistic analysis Machine learning
The domain encoded in an XML schema document Contains processing instruction as well as domain
semantics. The result is an XML based, query-able, database
Methods – Structural Analysis #1
Before: After:
Transform each input document into a structurally valid, monolithic, document – using industry standard tools such as HTML Tidy and Readability.
Methods – Structural Analysis #2
Vertically segment each document into logical blocks.
Employ, stack based, style analysis to identify each of the blocks.
Methods - Classification
Employ multiclass classification (by vector similarity) to map the logical document blocks to the appropriate schema elements.
Methods – Pattern Recognition
Pattern: .//bibliography/ul/li/*
Mine likely candidate blocks for patterns using the PAT Tree algorithm; adjusted for finding a maximum likelihood pattern.
Methods – Metadata Extraction
Use CRF for extraction of additional metadata where appropriate (e.g. bibliographic lists).
Results – Setting
50 web pages of DB research projects from American and Israeli universities. Chosen manually to represent a wide
variety of web page styles. All pages pre-processed by our systems
– their structure analyzed; Then manually tagged for classification, patterns, metadata.
20% of the dataset is sampled for training purposes, randomly. Repeated 5 times, and averaged.
Results – Measures
Standard information extraction measures, adapted. Accuracy – the number of classifications
that were correct, per document.
Recall – content recall and structural recall, weighted, per logical block.
Document recall: Similarly – Precision
Results
Precision
Recall
Pattern Recognition
85% 89.7%
Classification Accuracy
82.5%
Conclusions
This is a feasible approach for creating a web information extraction system.
Good results can be achieved with a relative small sample.
The modular system design allows easy adaptation for additional domains.
Future directions: Schema generation Better information integration Additional modules (e.g. deep linguistic
analysis)
Questions?
Thank You!