creating imprecise databases from information extraction models rahul gupta sunita sarawagi (ibm...
TRANSCRIPT
Creating Imprecise Databases from Information Extraction
ModelsRahul Gupta Sunita Sarawagi
(IBM India Research Lab) (IIT Bombay)
textState of the artExtractionModel (CRF)
Q2. Can we store all this efficiently ?
Q1.Isn’t storing the best extraction sufficient?
Extraction 2, probability p2
Extraction 1, probability p1
Extraction 3, probability p3
Extraction k, probability pk
ImpreciseDatabase
CRFs: Probability = Confidence
• If N segmentations are labeled with probability p, then around Np should be correct.
0
0.2
0.4
0.6
0.8
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Probability of top segmentation
Fra
cti
on
co
rre
ct
Is the best extraction sufficient?
0
0.2
0.4
0.6
0.8
1 2 3 4
Number of columns in projection query
Squ
are
Err
or
Only best extraction All extractions with probabilities
textState of the artExtraction Model (CRF)
Extraction 2, probability p2
Extraction 1, probability p1
Extraction 3, probability p3
Extraction k, probability pk
ImpreciseDatabase
Q2. Can we store all this efficiently ?
Q1.Isn’t storing the best extraction sufficient?
NO!
How much to store?
• k can be exponential in the worst case• Retrieval is O(k.logk)• Arbitrary amount of storage not available
00.1
0.20.3
0.4
1 2 3 4-10 11-20 21-30 31-50 51-200
>200
Number of segmentations required to cover 0.9 probability
Fre
qu
en
cy
textState of the artExtraction Model (CRF)
Extraction 2, probability p2
Extraction 1, probability p1
Extraction 3, probability p3
Extraction k, probability pk
ImpreciseDatabase
Q2. Can we store all this efficiently ?
Q1.Isn’t storing the best extraction sufficient?
NO!
NO!
text Extraction 2, probability p2
Extraction 1, probability p1
Extraction 3, probability p3
Extraction k, probability pk
ImpreciseDatabase
CRF
Our approach
Example
HNO AREA CITY PINCODE PROB
52 Bandra West Bombay 400 062 0.1
52-A Bandra West Bombay
400 062 0.2
52-A Bandra West Bombay 400 062 0.5
52 Bandra West Bombay
400 062 0.2
Input: “52-A Bandra West Bombay 400 062”
CRF
Imprecise Data Models
• Segmentation-per-row model (Exact)
• One-row model
• Multi-row model
Segmentation-per-row model
(Rows: Uncertain; Cols: Exact)
HNO AREA CITY PINCODE PROB
52 Bandra West Bombay 400 062 0.1
52-A Bandra West Bombay
400 062 0.2
52-A Bandra West Bombay 400 062 0.5
52 Bandra West Bombay
400 062 0.2
Exact but impractical. We can have toomany segmentations!
One-row ModelEach column is a multinomial distribution
(Row: Exact; Columns: Indep, Uncertain)HNO AREA CITY PINCODE
52 (0.3) Bandra West (0.6)
Bombay (0.6) 400 062 (1.0)
52-A (0.7) Bandra (0.4) West Bombay (0.4)
e.g. P(52-A, Bandra West, Bombay, 400 062) = 0.7 x 0.6 x 0.6 x 1.0 = 0.252
Simple model, closed form solution, but crude.
Multi-row ModelSegmentation generated by a ‘mixture’ of rows
(Rows: Uncertain; Columns: Indep, Uncertain)
HNO AREA CITY PINCODE Prob
52 (0.167)
52-A (0.833)
Bandra West (1.0)
Bombay (1.0) 400 062 (1.0)
0.6
52 (0.5)
52-A (0.5)
Bandra (1.0)
West Bombay (1.0)
400 062 (1.0)
0.4
e.g. P(52-A, Bandra West, Bombay, 400 062) = 0.833 x 1.0 x 1.0 x 1.0 x 0.6 + 0.5 x 0.0 x 0.0 x 1.0 x 0.4 = 0.50
Goal
• Devise algorithms to efficiently populate these imprecise data models– For multi-row model, populate a constant
number of rows– Avoid enumeration of top-k segmentations
• Is the multi-row model worth all this effort?
Outline
• DAG view of CRFs
• Populating the one-row model
• Populating the multi-row model– Enumeration based approach– Enumeration-less approach
• Experiments
• Related work and conclusions
CRF: DAG View
52
52-A
Bandra
Bandra West
Bombay
West Bombay
400 062
• Segment ¼ Node• Segmentation ¼ Path• Probability ¼ Score = exp(uF(u)+eG(e))
Bandra West Bombay
Semantic Mismatch
• CRF: Probability factors over segments
• Imprecise DB: Probability factors over labels
Have to approximate one family of distribution by another
Populating Models: Approximating P()
• Metric: KL(P||Q) = s P(s) log (P(s)/Q(s))
– Zero only when Q = P, positive otherwise– Q tries to match the modes of P– Easy to play with– Enumeration over s not required.
Outline
• DAG view of CRFs
• Populating the one-row model
• Populating the multi-row model– Enumeration based approach– Enumeration-less approach
• Experiments
• Related work and conclusions
Populating the One-row Model
Min KL(P||Q) = Min KL(P|| y Qy)
= Min y KL(Py||Qy)
• Solution: Qy(t,u) = P(t,u,y) (marginal)
= computed via and
Q
CRFs: Marginal Probabilities
Marginal P((t,u,y))
/ u(y)y’t-1(y’)Score(t,u,y,y’)
52
52-A
Bandra
Bandra West
Bombay
West Bombay
400 062
Bandra West Bombay
Marginal ?
CRF: Forward Vectors
52
52-A
Bandra
Bandra West
Bombay
West Bombay
400 062
i(y) = Score of all paths ending at position i with label y = prefix Score of (prefix + edge from prefix) = y’,di-d(y’)Score(i-d+1,i,y’,y)
Bandra West Bombay
CRF: Backward Vectors
52
52-A
Bandra
Bandra West
Bombay
West Bombay
400 062
(y) = Score of all paths starting at i+1 with label y at position i.
= suffix Score of (suffix+ edge to suffix) = y’,di+d(y’)Score(i+1,i+d,y,y’)
Bandra West Bombay
Outline
• DAG view of CRFs
• Populating the one-row model
• Populating the multi-row model– Enumeration based approach– Enumeration-less approach
• Experiments
• Related work and conclusions
Populating Multi-row Model
• log Qmulti is unwieldy and disallows closed form solution– But the problem is same as estimating the
parameters of a mixture.
• Algorithms– Enumeration-based approach– Enumeration-less approach.
Enumeration-based approach
• Edge weight of s!k denotes ‘soft-membership’ of segmentation s in row k.• Depends on how strongly row k ‘generates’ s.
• Row-parameters depends on its member segmentations.
Segmentations, probabilities rows
Continued..
• EM-Algorithm– Start with random edge-weights– Repeat
• Compute new row-parameters (M-step)• Compute new edge-weights (E-step)
• Monotone convergence to a locally optimal solution is guaranteed.– In practice, global optimality often achieved by
taking multiple starting points.
Outline
• DAG view of CRFs
• Populating the one-row model
• Populating the multi-row model– Enumeration based approach– Enumeration-less approach
• Experiments
• Related work and conclusions
Enumeration-less approach
Segmentations(not enumerated)
Compound segmentations(enumerated)
Grouping mechanism
Multi-row model(2 rows)
EM
Similar to enumeration-based approach
Grouping Mechanism
• Divide the DAG paths into distinct sets– e.g. whether they pass through a node or not,
or they contain a given starting position or not.
• We have the following boolean tests– Atuy = segment (t,u,y) present ?– Bty = segment starts at t with label y ?– Cuy = segment ends at u with label y ?
• P(A),P(B),P(C) can be computed easily using and vectors
Grouping Mechanism: ExampleHNO AREA CITY PINCODE Id
52 Bandra West Bombay 400 062 s1
52-A Bandra West Bombay 400 062 s2
52-A Bandra West Bombay 400 062 s3
52 Bandra West Bombay 400 062 s4
s1,s4
s3 s2
A’52-A’,HNO
B‘West’,*
0 1
0 1
P({s1,s4}) = P(A=0),
P({s3}) = P(A=1,B=0)
We can work with boolean tests instead of segmentations!
Compute via , like we did for marginals
Choosing the boolean tests
• At each step, choose a current partition and a variable that ‘best’ bisects that partition– ‘Best’ computed by an ‘approximation-quality’
metric.– Can be computed using and
Outline
• DAG view of CRFs
• Populating the one-row model
• Populating the multi-row model– Enumeration based approach– Enumeration-less approach
• Experiments
• Related work and conclusions
Experiments: Need for multi-row
• KL very high at m=1. One-row model clearly inadequate.• Even a two-row model is sufficient in many cases.
Experiments: KL divergence
(Address)
• Accuracy of enumeration-less approach almost as good as the ideal enumeration-based approach.
Related Work
• Imprecise DBs– Families of data models ([Fuhr ’90], [Sarma
et.al. ’06], [Hung et.al. ’03])– Efficient query processing ([Cheng et.al. ’03],
[Dalvi and Suciu ’04], [Andritsos et.al. ’06])– Aggregate queries ([Ross et.al. ’05])
Related Work (Contd.)
• Approximating intractable distributions– With tree-distributions ([Chow and Liu ’68])– Mixture of mean-fields ([Jaakkola and Jordan
’99])– With Variational methods ([Jordan et.al. ’99])– Tree-based reparameterization ([Wainwright
et.al. ’01])
Conclusions
• First approach to quantify uncertainty of information extraction through imprecise database models.
• Algorithms to populate multi-row DBs– A competitive enumeration-less algorithm for
CRFs.
• Future Work– Representing uncertainty of de-duplication– Multi-table uncertainties
Questions?
Backup: Expts: Number of partitions?
• Much fewer partitions generated for merging as compared to the ideal enumeration-based approach.• Final EM run on substantially fewer (compound) segmentations.
Backup: Expts: Sensitivity to cut-off
• Info-gain cut-off largely independent of dataset.• Approximation quality not resilient to small changes in info-gain cut-off
Experiments: KL divergence
(Address) (Cora)
• Accuracy of enumeration-less approach almost as good as the ideal enumeration-based approach.
Backup: Segmentation Models (Semi-CRF)
s: segmentation (t,u,y): segment
P(s|x) = exp(T. jf(j,x,s)) / Z(x)
e.g. s = 52-A Bandra West Bombay 400 062HNo Area City Pincode
• f(j,x,s) = feature vector for jth segment
– xj is a valid pincode ^ yj-1 = CITY– xj contains ‘East’ or ‘West’ ^ j · n-1 ^ yj = AREA
Backup: EM Details
• E-stepR(k|sd) / k Qk(sd)
– R(k|sd) = ‘soft’ assignment of sd to row k
– R(k|sd) favours that row which generates it with high probability
• M-stepQy
k(t,u) / sd:(t,u,y)2 sdP(sd)R(k|sd)
k = sdP(sd)R(k|sd)
(k = weight of a row = its total member count weighed by their probabilities)
Backup: Enumeration-less approach
Segmentations(not enumerated)
Compound segmentations(Hard Partitions: enumerated)
Grouping Mechanism
Multi-row model(4 rows)
Backup: Issues
• Hard-partitioning not as good as soft-partitioning.
• Generally, numrows is too small. Sufficient partitioning may be infeasible.
• Solution:– Generate much more partitions than rows.– Apply EM algorithm to soft-assign the partition
to the rows.
Merging
• Each partition is a ‘compound’ segmentation (s).
• Can afford to enumerate partitions
• M-step– Need P(s) and R(k|s). P(s) same as P(c).
• E-step– Need Qk(s) = prob of generating compound
segmentation s from kth row. Tricky!
Backup: Merging (E-step)
• Idea: – Decompose generation task over the labels (they
are independent)
– Prob of generating the Py(s) multinomial from the Qy
k multinomial depends on ‘Bregman’ divergence = KL distance (for multinomials)
Qk(s) / k exp(-yKL(Py(s)||Qyk))
Expts: What about query results?
• KL Divergence highly correlated with ranking inversion errors.• Much fewer ranking errors for the Merging approach
Approximating segmentation models with mixtures
Approximate P / exp(T.F(x))with
Q = k k Qk(x) = kki Qki(x)
• Metrics– Approximation: KL(P||Q) – Data Querying: Rank-Inversion, Square-Error
• Can be computationally intractable/non-decomposable