optimizing statistical information extraction programs over evolving text fei chen xixuan (aaron)...
TRANSCRIPT
Optimizing Statistical Information Extraction Programs Over Evolving Text
Fei Chen
Xixuan (Aaron) FengChristopher Ré
Min Wang
• Statistical Information Extraction (IE) is increasingly used.– For example, MSR Academic Search,
Ali Baba (HU Berlin), MPI YAGO– isWiki at HP Labs
• Text Corpora evolve! – An issue: difficult to keep IE results up to date– Current approach: rerun from scratch, which can be too slow
• Our Goal: Improve statistical IE runtime on evolving corpora by recycling previous IE results. – We focus on a popular statistical model for IE – conditional
random fields (CRFs), and build CRFlex– Show 10x speedup is possible for repeated extractions
One-Slide Summary
Background
• Document• Token sequence
• Trellis graph
• Label sequence
• Table
P: Person A: AffiliationDavid DeWitt Microsoft
P: Person A: Affiliation
David DeWitt is working at Microsoft.
y1 y2 y3 y4 y5 y6
P P O O O A
x1 x2 x3 x4 x5 x6
David DeWitt is working at Microsoftx:
y:
Background 1: CRF-based IE Programs
1 2 3 4 5 6P: Person
A: AffiliationO: Other
weight0.2
• Token sequence Label Sequence (CRF Labeling)• (I) Computing Feature Functions (Applying Rules)• (II) Constructing Trellis Graph (Dot Product)• (III) Viterbi Inference (Dynamic Programming)
– A version of standard shortest path algorithm
weightw = v ∙ λ = 0.2
feature v(0, 1)
Background 2: CRF Inference Steps
]Person[]dcapitalize is [),,,( 11 iiiii yyxiyyf x
]nAffiliatio[]"at" is [),,,( 11 iiii yxiyyg x
model λ(0.5, 0.2)
1 2 3 4 5 6P: Person
A: AffiliationO: Other
x1 x2 x3 x4 x5 x6
David DeWitt is working at Microsoftx
f(O, A, x, 6) = 0g(O, A, x, 6) = 1
Challenges
• How to do CRF inference incrementally w/ exactly same results as re-run– no straight-forward solutions for
each step
• How to trade off savings and overhead– intermediate results (feature
values & trellis graph) are much larger than input (tokens) & output (labels)
David DeWitt is working at Microsoft
(I) Computing Feature Functionsf1 f2 fK...
(II) Computing Trellis GraphFeature Values
Token Sequences
(III) Perform InferenceTrellis Graph
Label Sequences
Technical Contributions
• (I) Computing Feature Functions (Applying Rules)– (Cyclex) Efficient Information Extraction over Evolving Text Data,
F. Chen, et al. ICDE-08
• (II) Constructing Trellis Graph (Dot Product)– In a position, unchanged features unchanged trellis
• (III) Viterbi Inference (Dynamic Programming)– Auxiliary information needed to localize dependencies– Modified version for recycling
Recycling Each Inference Step
Step Input OutputI Token Sequence Feature ValuesII Feature Values Trellis GraphIII Trellis Graph Label Sequence
Performance Trade-off
• Materialization decision in each inference step– A new trade-off thanks to the large amount of
intermediate representation of statistical methods– CPU computation varies from task to task
Keep output? Pros Cons
YesMore recycling chance
(Low CPU time)High I/O time
No Low I/O timeLess recycling chance
(High CPU time)
Optimization
• Binary choices for 2 intermediate outputs 22 = 4 plans• More plans possible
– If partial materialization in a step
• No plan is always fastest cost-based optimizer– CPU time per token, I/O time per token – task-dependent– Changes between consecutive snapshots – dataset-dependent– Measure by running on a subset at first few snapshots
Keep output? Pros Cons
YesMore recycling chance
(Low CPU time)High I/O time
No Low I/O timeLess recycling chance
(High CPU time)
Experiments
Repeated Extraction Evaluation
• Dataset– Wikipedia English w/ Entertainment tag, 16 snapshots (once every three
weeks), 3000+ pages per snapshot on average
• IE Task: Named Entity Recognition• Features
– Cheap: token-based regular expressions– Expensive: approximate matching over dictionaries
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 150
400000
800000
1200000Comparison to Baselines
snapshot
runtime (s)
NtIydk9b22Jm0HvGO1ikDF
Rerun Cyclex
CEuqZHM9Gsw3wvJt0xOloX CEuqZHM9Gsw3wvJt0xOloX
~10XSpeed-upStatistics
Collection
Conclusion
• Concerning real-world deployment of statistical IE programs, we:– Devised a recycling framework without loss of
correctness– Explored a performance trade-off, CPU vs. I/O – Demonstrated that up to about 10X speed-up on a
real-world dataset is possible
• Future Directions– More graphical models and inference algorithms– In parallel settings
• Only the fastest 3 (out of 8) are plotted– No plan is always within top 3
Importance of Optimizer
3 4 5 6 7 8 9 10 11 12 13 14 15200000
500000
800000 POS
snapshot
hHfagpqiYCSLd0VfqEigAZ hHfagpqiYCSLd0VfqEigAZ
runtime (s)
3 4 5 6 7 8 9 10 11 12 13 14 15400000
700000
1000000Chunking-Expensive LF
snapshot
hHfagpqiYCSLd0VfqEigAZ hHfagpqiYCSLd0VfqEigAZruntime (s)
3 4 5 6 7 8 9 10 11 12 13 14 15100000
200000
300000
NER
snapshot
hHfagpqiYCSLd0VfqEigAZ hHfagpqiYCSLd0VfqEigAZ
runtime (s)
3 4 5 6 7 8 9 10 11 12 13 14 15750000
1000000
1250000
Chunking-Cheap LF
snapshot
hHfagpqiYCSLd0VfqEigAZ hHfagpqiYCSLd0VfqEigAZ
runtime (s)
LFVC VC NLFVC AFVC NLFFGVC
03ulgQpM8F9PHVhAFWr23T
Per Snapshot Comparisons
0 1 2 3 4 5 6 7 8 9 1011121314150
500000100000015000002000000
POS
snapshot
runtime (s)
VcyN5GJddqWVbPFKfUKYKz bRhoRHVTNhP0aBJZ9SfYks bRhoRHVTNhP0aBJZ9SfYks
0 1 2 3 4 5 6 7 8 9 1011121314150
1000000200000030000004000000 Chunking-Cheap LF
snapshot
runtime (s)
bRhoRHVTNhP0aBJZ9SfYks bRhoRHVTNhP0aBJZ9SfYks
0 1 2 3 4 5 6 7 8 9 1011121314150
800000160000024000003200000 Chunking-Expensive LF
snapshot
runtime (s)
dV8wUIw8PHLZUdumsEwum2
0 1 2 3 4 5 6 7 8 9 1011121314150
400000800000
1200000NER
snapshot
runtime (s)
NtIydk9b22Jm0HvGO1ikDF
Rerun Cyclex
CEuqZHM9Gsw3wvJt0xOloX CEuqZHM9Gsw3wvJt0xOloX
• Only the fastest 3 and Rerun are plotted– IO can be more in the slow plans
Runtime Decomposition
AFVC NLFVC LFVC Rerun0
400000
800000
Chunking-Expensive LF
lNieJIXoq8HcexT1FtZDQG
2699
runtime (s)
LFVC VC AFVC Rerun0
100000
200000NER
lNieJIXoq8HcexT1FtZDQG
1242runtime (s)
Match Extraction IO Other
BgruTHdJ9Do3AWOi6dxYIM
NLFVC NLF-FGVC
VC Rerun0
600000
1200000Chunking-Cheap LF
lNieJIXoq8HcexT1FtZDQG
3299runtime (s)
VC NLFVC LFVC Rerun0
150000
300000POS
lNieJIXoq8HcexT1FtZDQG
1816runtime (s)
Scoping Details
• Per-document IE– No breakable assumptions for a document– Repeatedly crawling using a fixed set of URLs
• Focus on the most popular model in IE– Linear-chain CRF– Viterbi inference
• Optimize inference process with a pre-trained model
• Exact results as rerun, no approximation• Recycle granularity is token (or position)
Recycle Each Step
newfactors
previousfactors
factorrecompute
regions
factor copyregions
FactorRecycler
previousfeaturevalues
vector matchregion
VectorDiff
new featurevalues
FactorCopier
1
2
a
b
3
(b) Step II
newlabels
previouslabels
inferencerecompute
regions
inference copyregions
InferenceRecycler
previousfactors
factor matchregion
FactorDiff
newfactors
LabelCopier
1
2
a
b
3
previousViterbicontext
&Viterbicontext
&Viterbicontext
(c) Step III
newfeaturevalues
previousfeaturevalues
featurerecompute
regions
feature copyregions
FeatureRecyclers
previoustoken
sequence
token matchregion
UnixDiff
new tokensequence
FeatureCopier
1
2
b
3
(a) Step I
a
XN+1
XN
VN
VN+1
AFC
12
a
b
3(I) Feature
Computation
(III) ViterbiInference
FN
YN
YN+1
AVI
12
a
b
3
(II) TrellisComputation
VN
FN
FN+1
ATC
12
a
b
3