content extraction from news pages using particle swarm optimization on linguistic and structural...

Content Extraction from News Pages Using Particle

Swarm Optimization on Linguistic and Structural

FeaturesCai-Nicolas Ziegler, Michal

SkubaczSiemens AG, Corporate Research & Technologies

2007 IEEE/WIC/ACM International Conference on Web Intelligence

Introduction (1/2)

Actual textual information only represents a minor fraction

Enrichment of pure content with page clutter

Problem online reputation

monitoring platforms

Search engine

Non-related images

Advertisement

copyright

content

Introduction (2/2)

Fully-automated extraction of content from HTML pages

Extract coherent blocks of text from HTML pages

DOM parsing Compute linguistic and structural features

for each block Learn feature thresholds using PSO Decide whether the block at hand is signal

or noise

Related Work Wrapper induction system

Require human labeling of relevant passages for each site template

Wrapper generator Fully-automated Only for web pages containing numerous

recurring entries of like style and format Domain-oriented extraction

Human effort is required for adaptation to novel domains

Structural approach Make use of the organization of HTML pages

Take into account visual cues of Web page rendering

Approach Outline Input

Plain HTML page Output

Text doc. containing the extracted content

Applicable to any type of news pages

No human intervention once feature thresholds have been learned

Merging and Selecting Text Blocks

Build a DOM model of the current document

Aware of the semantics of HTML elements

Apply two operations removal

Only the element occurrence itself is discarded

With or without whitespace insertion

Pruning Respective element and its

subtree are discards

DOM Model of an HTML Fragment

<p>Mr. Bush said in a <a href=“press release.html”>press release</a> yesterday</p>

Web page display

p

aMr. Bush said in a yesterday

press release press release.html

Upon removing and pruning elements, a DOM tree is eventually constructed from the processed document. Tree traversal → only select the merged text nodes Obtain text blocks

Computing Features Linguistic feature

Average sentence length ↑

Number of sentences ↑ Character distribution

↑ Stop-word ratio ↑

Structural feature Anchor ratio ↓ Format tag ratio ↑ List element ratio ↓ Text structuring

ratio ↑

Thresholds for each unique feature Conjunctive

Only when all feature thresholds are satisfied, the text block will become classified as signal.

Benchmark Composition

610 human-labeled documents 5 different languages : English(265),

German(195), Spanish(50), French(50), Italian(50)

From the same source, about 2-4 articles were taken

Hand-labeled by 3 labeler Human labeling would not achieve 100%

agreement Example

the date the article was written → include or not ?

Computing Similarity

simσ (i,j) : compare 2 documents diσ

and djσ

diσ and dj

σ : representing the textual content extracted from an HTML news article page a

σ

Using a variant of cosine similarity

document vector, term length > 1

identicalMax. dissent

On Particle Swarm Optimization (1/2)

Benchmark set 610 labeled documents was subdivided

into a training set and test set Enable the determination of thresholds

that allows good discrimination between true article content and page clutter

Determine optimal threshold values Eight pivot parameters → combinatory

explosion! Particle Swarm Optimization (PSO)

On Particle Swarm Optimization (2/2)

Particle 8 features 的 thresholds set pi in 8 dimensional space

Every pi has Velocity vector wi Memory of its own best state

ρi In each step

Particles change their position in hyperspace

Their own best state ρi Current population Pk’s best

particle pb τ : cognitive component

Degree to which particles move towards their own best state

φ: social component Extent to which they follow the

population’s best particle Both assume values around 2

constant → 1

Fitness functionParticle’s classification accuracy

Training and Empirical Evaluation

Before testing, training set was used to learn near-optimal thresholds via PSO

70% training set vs. 30% test set split on 610 documents

Training the classifier Population size n = 100 Stop the optimization procedure upon

generation of 100 populations

Training the Classifier When cognition is

favored over social behavior, bumps become more pronounced

Due to the increased degree of randomness

Best particle generation 32 Cog. > Soc. 0.827

Try tuning population size n

No significant differences

Converges quickly (after 10 iterations)

Performance Benchmarking Upper bound HL

A sample of 70 documents from the full test set was labeled by two persons in parallel

Measure their mutual agreement → 0.909 Automated content extraction system FE Pruning and removing PR

Without feature matching Accept all blocks

Border-line TB Simply extracts all text nodes within the original HTML

document operating on the DOM tree model

Feature Entropy Evaluation Understand the

utility and impact of the various features

Evaluations on the complete test set

Testing each feature in isolation

Linguistic features exhibit lower, i.e., better, entropy than structural one

Combination 0.77 → 0.857

0.77 → best

2nd best

3rd best

和 list tag ratio, text structuring ratio 差不多→ poor performance

Outlook and Conclusion Content extraction from news articles on the Web

is a non-trivial, indispensable task for many applications

Presented a novel paradigm Fully-automated Classifying text blocks within HTML pages as signal or

noise By means of linguistic and structural features

Accuracy close to human judgment Future work

Include more features, e.g., part-of–speech information Threshold defining techniques from machine learning,

e.g., C4.5 decision trees The system will be incorporated as processing

component into practical reputation monitoring platform

Used by Siemens Corporate Communications

content extraction from news pages using particle swarm optimization on linguistic and structural...

Documents

html news article

html fragmentmr

text blocksbuild

determination of thresholds

content extraction

textual content

blocklearn feature thresholds

organization of html