optimizing statistical information extraction programs over evolving text fei chen xixuan (aaron)...

Optimizing Statistical Information Extraction Programs Over Evolving Text

Fei Chen

Xixuan (Aaron) FengChristopher Ré

Min Wang

• Statistical Information Extraction (IE) is increasingly used.– For example, MSR Academic Search,

Ali Baba (HU Berlin), MPI YAGO– isWiki at HP Labs

• Text Corpora evolve! – An issue: difficult to keep IE results up to date– Current approach: rerun from scratch, which can be too slow

• Our Goal: Improve statistical IE runtime on evolving corpora by recycling previous IE results. – We focus on a popular statistical model for IE – conditional

random fields (CRFs), and build CRFlex– Show 10x speedup is possible for repeated extractions

One-Slide Summary

Background

• Document• Token sequence

• Trellis graph

• Label sequence

• Table

P: Person A: AffiliationDavid DeWitt Microsoft

P: Person A: Affiliation

David DeWitt is working at Microsoft.

y1 y2 y3 y4 y5 y6

P P O O O A

x1 x2 x3 x4 x5 x6

David DeWitt is working at Microsoftx:

y:

Background 1: CRF-based IE Programs

1 2 3 4 5 6P: Person

A: AffiliationO: Other

weight0.2

• Token sequence Label Sequence (CRF Labeling)• (I) Computing Feature Functions (Applying Rules)• (II) Constructing Trellis Graph (Dot Product)• (III) Viterbi Inference (Dynamic Programming)

– A version of standard shortest path algorithm

weightw = v ∙ λ = 0.2

feature v(0, 1)

Background 2: CRF Inference Steps

]Person[]dcapitalize is [),,,( 11 iiiii yyxiyyf x

]nAffiliatio[]"at" is [),,,( 11 iiii yxiyyg x

model λ(0.5, 0.2)

1 2 3 4 5 6P: Person

A: AffiliationO: Other

x1 x2 x3 x4 x5 x6

David DeWitt is working at Microsoftx

f(O, A, x, 6) = 0g(O, A, x, 6) = 1

Challenges

• How to do CRF inference incrementally w/ exactly same results as re-run– no straight-forward solutions for

each step

• How to trade off savings and overhead– intermediate results (feature

values & trellis graph) are much larger than input (tokens) & output (labels)

David DeWitt is working at Microsoft

(I) Computing Feature Functionsf1 f2 fK...

(II) Computing Trellis GraphFeature Values

Token Sequences

(III) Perform InferenceTrellis Graph

Label Sequences

Technical Contributions

• (I) Computing Feature Functions (Applying Rules)– (Cyclex) Efficient Information Extraction over Evolving Text Data,

F. Chen, et al. ICDE-08

• (II) Constructing Trellis Graph (Dot Product)– In a position, unchanged features unchanged trellis

• (III) Viterbi Inference (Dynamic Programming)– Auxiliary information needed to localize dependencies– Modified version for recycling

Recycling Each Inference Step

Step Input OutputI Token Sequence Feature ValuesII Feature Values Trellis GraphIII Trellis Graph Label Sequence

Performance Trade-off

• Materialization decision in each inference step– A new trade-off thanks to the large amount of

intermediate representation of statistical methods– CPU computation varies from task to task

Keep output? Pros Cons

YesMore recycling chance

(Low CPU time)High I/O time

No Low I/O timeLess recycling chance

(High CPU time)

Optimization

• Binary choices for 2 intermediate outputs 22 = 4 plans• More plans possible

– If partial materialization in a step

• No plan is always fastest cost-based optimizer– CPU time per token, I/O time per token – task-dependent– Changes between consecutive snapshots – dataset-dependent– Measure by running on a subset at first few snapshots

Keep output? Pros Cons

YesMore recycling chance

(Low CPU time)High I/O time

No Low I/O timeLess recycling chance

(High CPU time)

Experiments

Repeated Extraction Evaluation

• Dataset– Wikipedia English w/ Entertainment tag, 16 snapshots (once every three

weeks), 3000+ pages per snapshot on average

• IE Task: Named Entity Recognition• Features

– Cheap: token-based regular expressions– Expensive: approximate matching over dictionaries

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 150

400000

800000

1200000Comparison to Baselines

snapshot

runtime (s)

NtIydk9b22Jm0HvGO1ikDF

Rerun Cyclex

CEuqZHM9Gsw3wvJt0xOloX CEuqZHM9Gsw3wvJt0xOloX

~10XSpeed-upStatistics

Collection

Conclusion

• Concerning real-world deployment of statistical IE programs, we:– Devised a recycling framework without loss of

correctness– Explored a performance trade-off, CPU vs. I/O – Demonstrated that up to about 10X speed-up on a

real-world dataset is possible

• Future Directions– More graphical models and inference algorithms– In parallel settings

• Only the fastest 3 (out of 8) are plotted– No plan is always within top 3

Importance of Optimizer

3 4 5 6 7 8 9 10 11 12 13 14 15200000

500000

800000 POS

snapshot

hHfagpqiYCSLd0VfqEigAZ hHfagpqiYCSLd0VfqEigAZ

runtime (s)

3 4 5 6 7 8 9 10 11 12 13 14 15400000

700000

1000000Chunking-Expensive LF

snapshot

hHfagpqiYCSLd0VfqEigAZ hHfagpqiYCSLd0VfqEigAZruntime (s)

3 4 5 6 7 8 9 10 11 12 13 14 15100000

200000

300000

NER

snapshot


runtime (s)

3 4 5 6 7 8 9 10 11 12 13 14 15750000

1000000

1250000

Chunking-Cheap LF

snapshot


runtime (s)

LFVC VC NLFVC AFVC NLFFGVC

03ulgQpM8F9PHVhAFWr23T

Per Snapshot Comparisons

0 1 2 3 4 5 6 7 8 9 1011121314150

500000100000015000002000000

POS

snapshot

runtime (s)

VcyN5GJddqWVbPFKfUKYKz bRhoRHVTNhP0aBJZ9SfYks bRhoRHVTNhP0aBJZ9SfYks

0 1 2 3 4 5 6 7 8 9 1011121314150

1000000200000030000004000000 Chunking-Cheap LF

snapshot

runtime (s)

bRhoRHVTNhP0aBJZ9SfYks bRhoRHVTNhP0aBJZ9SfYks

0 1 2 3 4 5 6 7 8 9 1011121314150

800000160000024000003200000 Chunking-Expensive LF

snapshot

runtime (s)

dV8wUIw8PHLZUdumsEwum2

0 1 2 3 4 5 6 7 8 9 1011121314150

400000800000

1200000NER

snapshot

runtime (s)

NtIydk9b22Jm0HvGO1ikDF

Rerun Cyclex

CEuqZHM9Gsw3wvJt0xOloX CEuqZHM9Gsw3wvJt0xOloX

• Only the fastest 3 and Rerun are plotted– IO can be more in the slow plans

Runtime Decomposition

AFVC NLFVC LFVC Rerun0

400000

800000

Chunking-Expensive LF

lNieJIXoq8HcexT1FtZDQG

2699

runtime (s)

LFVC VC AFVC Rerun0

100000

200000NER


1242runtime (s)

Match Extraction IO Other

BgruTHdJ9Do3AWOi6dxYIM

NLFVC NLF-FGVC

VC Rerun0

600000

1200000Chunking-Cheap LF


3299runtime (s)

VC NLFVC LFVC Rerun0

150000

300000POS


1816runtime (s)

Scoping Details

• Per-document IE– No breakable assumptions for a document– Repeatedly crawling using a fixed set of URLs

• Focus on the most popular model in IE– Linear-chain CRF– Viterbi inference

• Optimize inference process with a pre-trained model

• Exact results as rerun, no approximation• Recycle granularity is token (or position)

Recycle Each Step

newfactors

previousfactors

factorrecompute

regions

factor copyregions

FactorRecycler

previousfeaturevalues

vector matchregion

VectorDiff

new featurevalues

FactorCopier

1

2

a

b

3

(b) Step II

newlabels

previouslabels

inferencerecompute

regions

inference copyregions

InferenceRecycler

previousfactors

factor matchregion

FactorDiff

newfactors

LabelCopier

1

2

a

b

3

previousViterbicontext

&Viterbicontext

&Viterbicontext

(c) Step III

newfeaturevalues

previousfeaturevalues

featurerecompute

regions

feature copyregions

FeatureRecyclers

previoustoken

sequence

token matchregion

UnixDiff

new tokensequence

FeatureCopier

1

2

b

3

(a) Step I

a

XN+1

XN

VN

VN+1

AFC

12

a

b

3(I) Feature

Computation

(III) ViterbiInference

FN

YN

YN+1

AVI

12

a

b

3

(II) TrellisComputation

VN

FN

FN+1

ATC

12

a

b

3

optimizing statistical information extraction programs over evolving text fei chen xixuan (aaron)...

Documents