creating imprecise databases from information extraction models rahul gupta sunita sarawagi (ibm...

50
Creating Imprecise Databases from Information Extraction Models Rahul Gupta Sunita Sarawagi IBM India Research Lab) (IIT Bombay)

Upload: allen-garrett

Post on 20-Jan-2016

228 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Creating Imprecise Databases from Information Extraction Models Rahul Gupta Sunita Sarawagi (IBM India Research Lab) (IIT Bombay)

Creating Imprecise Databases from Information Extraction

ModelsRahul Gupta Sunita Sarawagi

(IBM India Research Lab) (IIT Bombay)

Page 2: Creating Imprecise Databases from Information Extraction Models Rahul Gupta Sunita Sarawagi (IBM India Research Lab) (IIT Bombay)

textState of the artExtractionModel (CRF)

Q2. Can we store all this efficiently ?

Q1.Isn’t storing the best extraction sufficient?

Extraction 2, probability p2

Extraction 1, probability p1

Extraction 3, probability p3

Extraction k, probability pk

ImpreciseDatabase

Page 3: Creating Imprecise Databases from Information Extraction Models Rahul Gupta Sunita Sarawagi (IBM India Research Lab) (IIT Bombay)

CRFs: Probability = Confidence

• If N segmentations are labeled with probability p, then around Np should be correct.

0

0.2

0.4

0.6

0.8

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Probability of top segmentation

Fra

cti

on

co

rre

ct

Page 4: Creating Imprecise Databases from Information Extraction Models Rahul Gupta Sunita Sarawagi (IBM India Research Lab) (IIT Bombay)

Is the best extraction sufficient?

0

0.2

0.4

0.6

0.8

1 2 3 4

Number of columns in projection query

Squ

are

Err

or

Only best extraction All extractions with probabilities

Page 5: Creating Imprecise Databases from Information Extraction Models Rahul Gupta Sunita Sarawagi (IBM India Research Lab) (IIT Bombay)

textState of the artExtraction Model (CRF)

Extraction 2, probability p2

Extraction 1, probability p1

Extraction 3, probability p3

Extraction k, probability pk

ImpreciseDatabase

Q2. Can we store all this efficiently ?

Q1.Isn’t storing the best extraction sufficient?

NO!

Page 6: Creating Imprecise Databases from Information Extraction Models Rahul Gupta Sunita Sarawagi (IBM India Research Lab) (IIT Bombay)

How much to store?

• k can be exponential in the worst case• Retrieval is O(k.logk)• Arbitrary amount of storage not available

00.1

0.20.3

0.4

1 2 3 4-10 11-20 21-30 31-50 51-200

>200

Number of segmentations required to cover 0.9 probability

Fre

qu

en

cy

Page 7: Creating Imprecise Databases from Information Extraction Models Rahul Gupta Sunita Sarawagi (IBM India Research Lab) (IIT Bombay)

textState of the artExtraction Model (CRF)

Extraction 2, probability p2

Extraction 1, probability p1

Extraction 3, probability p3

Extraction k, probability pk

ImpreciseDatabase

Q2. Can we store all this efficiently ?

Q1.Isn’t storing the best extraction sufficient?

NO!

NO!

Page 8: Creating Imprecise Databases from Information Extraction Models Rahul Gupta Sunita Sarawagi (IBM India Research Lab) (IIT Bombay)

text Extraction 2, probability p2

Extraction 1, probability p1

Extraction 3, probability p3

Extraction k, probability pk

ImpreciseDatabase

CRF

Our approach

Page 9: Creating Imprecise Databases from Information Extraction Models Rahul Gupta Sunita Sarawagi (IBM India Research Lab) (IIT Bombay)

Example

HNO AREA CITY PINCODE PROB

52 Bandra West Bombay 400 062 0.1

52-A Bandra West Bombay

400 062 0.2

52-A Bandra West Bombay 400 062 0.5

52 Bandra West Bombay

400 062 0.2

Input: “52-A Bandra West Bombay 400 062”

CRF

Page 10: Creating Imprecise Databases from Information Extraction Models Rahul Gupta Sunita Sarawagi (IBM India Research Lab) (IIT Bombay)

Imprecise Data Models

• Segmentation-per-row model (Exact)

• One-row model

• Multi-row model

Page 11: Creating Imprecise Databases from Information Extraction Models Rahul Gupta Sunita Sarawagi (IBM India Research Lab) (IIT Bombay)

Segmentation-per-row model

(Rows: Uncertain; Cols: Exact)

HNO AREA CITY PINCODE PROB

52 Bandra West Bombay 400 062 0.1

52-A Bandra West Bombay

400 062 0.2

52-A Bandra West Bombay 400 062 0.5

52 Bandra West Bombay

400 062 0.2

Exact but impractical. We can have toomany segmentations!

Page 12: Creating Imprecise Databases from Information Extraction Models Rahul Gupta Sunita Sarawagi (IBM India Research Lab) (IIT Bombay)

One-row ModelEach column is a multinomial distribution

(Row: Exact; Columns: Indep, Uncertain)HNO AREA CITY PINCODE

52 (0.3) Bandra West (0.6)

Bombay (0.6) 400 062 (1.0)

52-A (0.7) Bandra (0.4) West Bombay (0.4)

e.g. P(52-A, Bandra West, Bombay, 400 062) = 0.7 x 0.6 x 0.6 x 1.0 = 0.252

Simple model, closed form solution, but crude.

Page 13: Creating Imprecise Databases from Information Extraction Models Rahul Gupta Sunita Sarawagi (IBM India Research Lab) (IIT Bombay)

Multi-row ModelSegmentation generated by a ‘mixture’ of rows

(Rows: Uncertain; Columns: Indep, Uncertain)

HNO AREA CITY PINCODE Prob

52 (0.167)

52-A (0.833)

Bandra West (1.0)

Bombay (1.0) 400 062 (1.0)

0.6

52 (0.5)

52-A (0.5)

Bandra (1.0)

West Bombay (1.0)

400 062 (1.0)

0.4

e.g. P(52-A, Bandra West, Bombay, 400 062) = 0.833 x 1.0 x 1.0 x 1.0 x 0.6 + 0.5 x 0.0 x 0.0 x 1.0 x 0.4 = 0.50

Page 14: Creating Imprecise Databases from Information Extraction Models Rahul Gupta Sunita Sarawagi (IBM India Research Lab) (IIT Bombay)

Goal

• Devise algorithms to efficiently populate these imprecise data models– For multi-row model, populate a constant

number of rows– Avoid enumeration of top-k segmentations

• Is the multi-row model worth all this effort?

Page 15: Creating Imprecise Databases from Information Extraction Models Rahul Gupta Sunita Sarawagi (IBM India Research Lab) (IIT Bombay)

Outline

• DAG view of CRFs

• Populating the one-row model

• Populating the multi-row model– Enumeration based approach– Enumeration-less approach

• Experiments

• Related work and conclusions

Page 16: Creating Imprecise Databases from Information Extraction Models Rahul Gupta Sunita Sarawagi (IBM India Research Lab) (IIT Bombay)

CRF: DAG View

52

52-A

Bandra

Bandra West

Bombay

West Bombay

400 062

• Segment ¼ Node• Segmentation ¼ Path• Probability ¼ Score = exp(uF(u)+eG(e))

Bandra West Bombay

Page 17: Creating Imprecise Databases from Information Extraction Models Rahul Gupta Sunita Sarawagi (IBM India Research Lab) (IIT Bombay)

Semantic Mismatch

• CRF: Probability factors over segments

• Imprecise DB: Probability factors over labels

Have to approximate one family of distribution by another

Page 18: Creating Imprecise Databases from Information Extraction Models Rahul Gupta Sunita Sarawagi (IBM India Research Lab) (IIT Bombay)

Populating Models: Approximating P()

• Metric: KL(P||Q) = s P(s) log (P(s)/Q(s))

– Zero only when Q = P, positive otherwise– Q tries to match the modes of P– Easy to play with– Enumeration over s not required.

Page 19: Creating Imprecise Databases from Information Extraction Models Rahul Gupta Sunita Sarawagi (IBM India Research Lab) (IIT Bombay)

Outline

• DAG view of CRFs

• Populating the one-row model

• Populating the multi-row model– Enumeration based approach– Enumeration-less approach

• Experiments

• Related work and conclusions

Page 20: Creating Imprecise Databases from Information Extraction Models Rahul Gupta Sunita Sarawagi (IBM India Research Lab) (IIT Bombay)

Populating the One-row Model

Min KL(P||Q) = Min KL(P|| y Qy)

= Min y KL(Py||Qy)

• Solution: Qy(t,u) = P(t,u,y) (marginal)

= computed via and

Q

Page 21: Creating Imprecise Databases from Information Extraction Models Rahul Gupta Sunita Sarawagi (IBM India Research Lab) (IIT Bombay)

CRFs: Marginal Probabilities

Marginal P((t,u,y))

/ u(y)y’t-1(y’)Score(t,u,y,y’)

52

52-A

Bandra

Bandra West

Bombay

West Bombay

400 062

Bandra West Bombay

Marginal ?

Page 22: Creating Imprecise Databases from Information Extraction Models Rahul Gupta Sunita Sarawagi (IBM India Research Lab) (IIT Bombay)

CRF: Forward Vectors

52

52-A

Bandra

Bandra West

Bombay

West Bombay

400 062

i(y) = Score of all paths ending at position i with label y = prefix Score of (prefix + edge from prefix) = y’,di-d(y’)Score(i-d+1,i,y’,y)

Bandra West Bombay

Page 23: Creating Imprecise Databases from Information Extraction Models Rahul Gupta Sunita Sarawagi (IBM India Research Lab) (IIT Bombay)

CRF: Backward Vectors

52

52-A

Bandra

Bandra West

Bombay

West Bombay

400 062

(y) = Score of all paths starting at i+1 with label y at position i.

= suffix Score of (suffix+ edge to suffix) = y’,di+d(y’)Score(i+1,i+d,y,y’)

Bandra West Bombay

Page 24: Creating Imprecise Databases from Information Extraction Models Rahul Gupta Sunita Sarawagi (IBM India Research Lab) (IIT Bombay)

Outline

• DAG view of CRFs

• Populating the one-row model

• Populating the multi-row model– Enumeration based approach– Enumeration-less approach

• Experiments

• Related work and conclusions

Page 25: Creating Imprecise Databases from Information Extraction Models Rahul Gupta Sunita Sarawagi (IBM India Research Lab) (IIT Bombay)

Populating Multi-row Model

• log Qmulti is unwieldy and disallows closed form solution– But the problem is same as estimating the

parameters of a mixture.

• Algorithms– Enumeration-based approach– Enumeration-less approach.

Page 26: Creating Imprecise Databases from Information Extraction Models Rahul Gupta Sunita Sarawagi (IBM India Research Lab) (IIT Bombay)

Enumeration-based approach

• Edge weight of s!k denotes ‘soft-membership’ of segmentation s in row k.• Depends on how strongly row k ‘generates’ s.

• Row-parameters depends on its member segmentations.

Segmentations, probabilities rows

Page 27: Creating Imprecise Databases from Information Extraction Models Rahul Gupta Sunita Sarawagi (IBM India Research Lab) (IIT Bombay)

Continued..

• EM-Algorithm– Start with random edge-weights– Repeat

• Compute new row-parameters (M-step)• Compute new edge-weights (E-step)

• Monotone convergence to a locally optimal solution is guaranteed.– In practice, global optimality often achieved by

taking multiple starting points.

Page 28: Creating Imprecise Databases from Information Extraction Models Rahul Gupta Sunita Sarawagi (IBM India Research Lab) (IIT Bombay)

Outline

• DAG view of CRFs

• Populating the one-row model

• Populating the multi-row model– Enumeration based approach– Enumeration-less approach

• Experiments

• Related work and conclusions

Page 29: Creating Imprecise Databases from Information Extraction Models Rahul Gupta Sunita Sarawagi (IBM India Research Lab) (IIT Bombay)

Enumeration-less approach

Segmentations(not enumerated)

Compound segmentations(enumerated)

Grouping mechanism

Multi-row model(2 rows)

EM

Similar to enumeration-based approach

Page 30: Creating Imprecise Databases from Information Extraction Models Rahul Gupta Sunita Sarawagi (IBM India Research Lab) (IIT Bombay)

Grouping Mechanism

• Divide the DAG paths into distinct sets– e.g. whether they pass through a node or not,

or they contain a given starting position or not.

• We have the following boolean tests– Atuy = segment (t,u,y) present ?– Bty = segment starts at t with label y ?– Cuy = segment ends at u with label y ?

• P(A),P(B),P(C) can be computed easily using and vectors

Page 31: Creating Imprecise Databases from Information Extraction Models Rahul Gupta Sunita Sarawagi (IBM India Research Lab) (IIT Bombay)

Grouping Mechanism: ExampleHNO AREA CITY PINCODE Id

52 Bandra West Bombay 400 062 s1

52-A Bandra West Bombay 400 062 s2

52-A Bandra West Bombay 400 062 s3

52 Bandra West Bombay 400 062 s4

s1,s4

s3 s2

A’52-A’,HNO

B‘West’,*

0 1

0 1

P({s1,s4}) = P(A=0),

P({s3}) = P(A=1,B=0)

We can work with boolean tests instead of segmentations!

Compute via , like we did for marginals

Page 32: Creating Imprecise Databases from Information Extraction Models Rahul Gupta Sunita Sarawagi (IBM India Research Lab) (IIT Bombay)

Choosing the boolean tests

• At each step, choose a current partition and a variable that ‘best’ bisects that partition– ‘Best’ computed by an ‘approximation-quality’

metric.– Can be computed using and

Page 33: Creating Imprecise Databases from Information Extraction Models Rahul Gupta Sunita Sarawagi (IBM India Research Lab) (IIT Bombay)

Outline

• DAG view of CRFs

• Populating the one-row model

• Populating the multi-row model– Enumeration based approach– Enumeration-less approach

• Experiments

• Related work and conclusions

Page 34: Creating Imprecise Databases from Information Extraction Models Rahul Gupta Sunita Sarawagi (IBM India Research Lab) (IIT Bombay)

Experiments: Need for multi-row

• KL very high at m=1. One-row model clearly inadequate.• Even a two-row model is sufficient in many cases.

Page 35: Creating Imprecise Databases from Information Extraction Models Rahul Gupta Sunita Sarawagi (IBM India Research Lab) (IIT Bombay)

Experiments: KL divergence

(Address)

• Accuracy of enumeration-less approach almost as good as the ideal enumeration-based approach.

Page 36: Creating Imprecise Databases from Information Extraction Models Rahul Gupta Sunita Sarawagi (IBM India Research Lab) (IIT Bombay)

Related Work

• Imprecise DBs– Families of data models ([Fuhr ’90], [Sarma

et.al. ’06], [Hung et.al. ’03])– Efficient query processing ([Cheng et.al. ’03],

[Dalvi and Suciu ’04], [Andritsos et.al. ’06])– Aggregate queries ([Ross et.al. ’05])

Page 37: Creating Imprecise Databases from Information Extraction Models Rahul Gupta Sunita Sarawagi (IBM India Research Lab) (IIT Bombay)

Related Work (Contd.)

• Approximating intractable distributions– With tree-distributions ([Chow and Liu ’68])– Mixture of mean-fields ([Jaakkola and Jordan

’99])– With Variational methods ([Jordan et.al. ’99])– Tree-based reparameterization ([Wainwright

et.al. ’01])

Page 38: Creating Imprecise Databases from Information Extraction Models Rahul Gupta Sunita Sarawagi (IBM India Research Lab) (IIT Bombay)

Conclusions

• First approach to quantify uncertainty of information extraction through imprecise database models.

• Algorithms to populate multi-row DBs– A competitive enumeration-less algorithm for

CRFs.

• Future Work– Representing uncertainty of de-duplication– Multi-table uncertainties

Page 39: Creating Imprecise Databases from Information Extraction Models Rahul Gupta Sunita Sarawagi (IBM India Research Lab) (IIT Bombay)

Questions?

Page 40: Creating Imprecise Databases from Information Extraction Models Rahul Gupta Sunita Sarawagi (IBM India Research Lab) (IIT Bombay)

Backup: Expts: Number of partitions?

• Much fewer partitions generated for merging as compared to the ideal enumeration-based approach.• Final EM run on substantially fewer (compound) segmentations.

Page 41: Creating Imprecise Databases from Information Extraction Models Rahul Gupta Sunita Sarawagi (IBM India Research Lab) (IIT Bombay)

Backup: Expts: Sensitivity to cut-off

• Info-gain cut-off largely independent of dataset.• Approximation quality not resilient to small changes in info-gain cut-off

Page 42: Creating Imprecise Databases from Information Extraction Models Rahul Gupta Sunita Sarawagi (IBM India Research Lab) (IIT Bombay)

Experiments: KL divergence

(Address) (Cora)

• Accuracy of enumeration-less approach almost as good as the ideal enumeration-based approach.

Page 43: Creating Imprecise Databases from Information Extraction Models Rahul Gupta Sunita Sarawagi (IBM India Research Lab) (IIT Bombay)

Backup: Segmentation Models (Semi-CRF)

s: segmentation (t,u,y): segment

P(s|x) = exp(T. jf(j,x,s)) / Z(x)

e.g. s = 52-A Bandra West Bombay 400 062HNo Area City Pincode

• f(j,x,s) = feature vector for jth segment

– xj is a valid pincode ^ yj-1 = CITY– xj contains ‘East’ or ‘West’ ^ j · n-1 ^ yj = AREA

Page 44: Creating Imprecise Databases from Information Extraction Models Rahul Gupta Sunita Sarawagi (IBM India Research Lab) (IIT Bombay)

Backup: EM Details

• E-stepR(k|sd) / k Qk(sd)

– R(k|sd) = ‘soft’ assignment of sd to row k

– R(k|sd) favours that row which generates it with high probability

• M-stepQy

k(t,u) / sd:(t,u,y)2 sdP(sd)R(k|sd)

k = sdP(sd)R(k|sd)

(k = weight of a row = its total member count weighed by their probabilities)

Page 45: Creating Imprecise Databases from Information Extraction Models Rahul Gupta Sunita Sarawagi (IBM India Research Lab) (IIT Bombay)

Backup: Enumeration-less approach

Segmentations(not enumerated)

Compound segmentations(Hard Partitions: enumerated)

Grouping Mechanism

Multi-row model(4 rows)

Page 46: Creating Imprecise Databases from Information Extraction Models Rahul Gupta Sunita Sarawagi (IBM India Research Lab) (IIT Bombay)

Backup: Issues

• Hard-partitioning not as good as soft-partitioning.

• Generally, numrows is too small. Sufficient partitioning may be infeasible.

• Solution:– Generate much more partitions than rows.– Apply EM algorithm to soft-assign the partition

to the rows.

Page 47: Creating Imprecise Databases from Information Extraction Models Rahul Gupta Sunita Sarawagi (IBM India Research Lab) (IIT Bombay)

Merging

• Each partition is a ‘compound’ segmentation (s).

• Can afford to enumerate partitions

• M-step– Need P(s) and R(k|s). P(s) same as P(c).

• E-step– Need Qk(s) = prob of generating compound

segmentation s from kth row. Tricky!

Page 48: Creating Imprecise Databases from Information Extraction Models Rahul Gupta Sunita Sarawagi (IBM India Research Lab) (IIT Bombay)

Backup: Merging (E-step)

• Idea: – Decompose generation task over the labels (they

are independent)

– Prob of generating the Py(s) multinomial from the Qy

k multinomial depends on ‘Bregman’ divergence = KL distance (for multinomials)

Qk(s) / k exp(-yKL(Py(s)||Qyk))

Page 49: Creating Imprecise Databases from Information Extraction Models Rahul Gupta Sunita Sarawagi (IBM India Research Lab) (IIT Bombay)

Expts: What about query results?

• KL Divergence highly correlated with ranking inversion errors.• Much fewer ranking errors for the Merging approach

Page 50: Creating Imprecise Databases from Information Extraction Models Rahul Gupta Sunita Sarawagi (IBM India Research Lab) (IIT Bombay)

Approximating segmentation models with mixtures

Approximate P / exp(T.F(x))with

Q = k k Qk(x) = kki Qki(x)

• Metrics– Approximation: KL(P||Q) – Data Querying: Rank-Inversion, Square-Error

• Can be computationally intractable/non-decomposable