information extraction using hmms

31
1 Information Extraction using HMMs Sunita Sarawagi

Upload: rubaina-manaf

Post on 02-Jan-2016

25 views

Category:

Documents


1 download

DESCRIPTION

Information Extraction using HMMs. Sunita Sarawagi. Title. Journal. Year. Author. Volume. Page. IE by text segmentation. Source: concatenation of structured elements with limited reordering and some missing fields Example: Addresses, bib records. House number. Zip. State. Building. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Information Extraction using HMMs

1

Information Extraction using HMMs

Sunita Sarawagi

Page 2: Information Extraction using HMMs

2

IE by text segmentationSource: concatenation of structured elements with

limited reordering and some missing fields Example: Addresses, bib records

House number Building Road City Zip

4089 Whispering Pines Nobel Drive San Diego CA 92122

P.P.Wangikar, T.P. Graycar, D.A. Estell, D.S. Clark, J.S. Dordick (1993) Protein and Solvent Engineering of Subtilising BPN' in Nearly Anhydrous Organic Media J.Amer. Chem. Soc. 115, 12231-12237.

Author Year Title JournalVolume

Page

State

Page 3: Information Extraction using HMMs

3

Hidden Markov Models Doubly stochastic models

Efficient dynamic programming algorithms exist for

Finding Pr(S) The highest probability path P that

maximizes Pr(S,P) (Viterbi) Training the model

(Baum-Welch algorithm)

S2

S4

S10.9

0.5

0.50.8

0.2

0.1

S3

A

C

0.6

0.4

A

C

0.3

0.7

A

C

0.5

0.5

A

C

0.9

0.1

Page 4: Information Extraction using HMMs

4

Input features Content of the element

Specific keywords like street, zip, vol, pp, Properties of words like capitalization, parts of

speech, number? Inter-element sequencing Intra-element sequencing Element length External database

Dictionary words Semantic relationship between words

Frequency constraints

Page 5: Information Extraction using HMMs

5

IE with Hidden Markov Models Probabilistic models for IE

Title

Journal

Author 0.9

0.5

0.50.8

0.2

0.1

Transition probabilitie

s

Year

A

B

C

0.6

0.3

0.1

X

B

Z

0.4

0.2

0.4

Y

A

C

0.1

0.1

0.8

Emission probabiliti

es

dddd

dd

0.8

0.2

Page 6: Information Extraction using HMMs

6

HMM Structure Naïve Model: One state per element

Nested model

Each element another HMM

Page 7: Information Extraction using HMMs

7

Comparing nested models Naïve: Single state per tag

Element length distribution: a, a2, a3,… Intra-tag sequencing not captured

Chain: Element length distribution:

Each length gets its own parameter Intra-tag sequencing captured Arbitrary mixing of dictionary,

Eg. “California York” Pr(W|L) not modeled well.

Parallel path: Element length distribution: each length gets a

parameter Separates vocabulary of different length

elements, (limited bigram model)

Page 8: Information Extraction using HMMs

8

Embedding a HMM in a state

Page 9: Information Extraction using HMMs

9

Bigram model of Bikel et al. Each inner model a detailed bigram model

First word: conditioned on state and previous state Subsequent words conditioned on previous word and

state Special “start” and “end” symbols that can be thought

Large number of parameters (Training data order~60,000 words in the smallest

experiment) Backing off mechanism to previous simpler “parent”

models (lambda parameters to control mixing)

Page 10: Information Extraction using HMMs

10

Separate HMM per tag Special prefix and suffix states to capture the start and

end of a tag

S2

S4

S1Prefix Suffix

Road name

S2

S4

S1

S3

Prefix Suffix

Building name

Page 11: Information Extraction using HMMs

11

HMM Dictionary For each word (=feature), associate the

probability of emitting that word Multinomial model

Features of a word, example,

part of speech, capitalized or not type: number, letter, word etc

Maximum entropy models (McCallum 2000), other exponential models

Bikel: <word,feature> pairs

Page 12: Information Extraction using HMMs

12

000.. . . .999

3 -d ig i ts

00000 .. . .99999

5 -d ig i ts

0 ..99 0000 ..9999 000000 ..

O th e rs

N u m b e rs

A .. ..z

C h a rs

a a ..

M u lt i -le tte r

W o rds

. , / - + ? #

D e lim ite rs

A ll

Feature Hierarchy

Page 13: Information Extraction using HMMs

13

Learning model parameters When training data defines unique path through

HMM Transition probabilities

Probability of transitioning from state i to state j =

number of transitions from i to j total transitions from state i

Emission probabilities Probability of emitting symbol k from state i =

number of times k generated from i number of transition from I

When training data defines multiple path: A more general EM like algorithm (Baum-Welch)

Page 14: Information Extraction using HMMs

14

Smoothing Two kinds of missing symbols:

Case-1: Unknown over the entire dictionary Case-2: Zero count in some state

Approaches: Laplace smoothing: ki + 1

m + |T| Absolute discounting

P(unknown) proportional to number of distinct tokens P(unknown) = (k’) x (number of distinct symbols) P(known) = (actual probability)-(k’), k’ is a small fixed constant, case 2 smaller than case 1

Page 15: Information Extraction using HMMs

15

Smoothing (Cont.) Smoothing parameters derived from data

Partition training data into two parts Train on part-1 Use part-2 to map all new tokens to UNK and treat it

as new word in vocabulary OK for case-1, not good for case-2. Bikel et al use

this method for case-1. For case-2 zero counts are backed off to 1/(Vocab-size)

Page 16: Information Extraction using HMMs

16

Using the HMM to segment Find highest probability path through the HMM. Viterbi: quadratic dynamic programming algorithm

House

ot

Road

City

Pin

115 Grant street Mumbai 400070

House

Road

City

Pin

115 Grant ……….. 400070

ot

House

Road

City

Pin

House

Road

Pin

Page 17: Information Extraction using HMMs

17

Most Likely Path for a Given Sequence

The probability that the path is taken and the sequence is generated:

L

iiNL iiiaxbaxx

1001 11

)()...,...Pr(

Lxx ...1

N ...0

transition

probabilities

emission

probabilities

Page 18: Information Extraction using HMMs

18

Example

A 0.1C 0.4G 0.4T 0.1

A 0.4C 0.1G 0.1T 0.4

begin end

0.5

0.5

0.2

0.8

0.4

0.6

0.1

0.90.2

0.8

0 5

4

3

2

1

6.03.08.04.02.04.05.0

)C()A()A(),AACPr( 35313111101

abababa

A 0.4C 0.1G 0.2T 0.3

A 0.2C 0.3G 0.3T 0.2

Page 19: Information Extraction using HMMs

19

Finding the most probable path: the Viterbi algorithm

define to be the probability of the most probable path accounting for the first i characters of x and ending in state k

)(ivk

we want to compute , the probability of the most probable path accounting for all of the sequence and ending in the end state

can define recursively can use dynamic programming to find efficiently

)(LvN

)(LvN

Page 20: Information Extraction using HMMs

20

Finding the most probable path: the Viterbi algorithm

initialization:

1)0(0 v

k statesother for ,0)0( kv

Page 21: Information Extraction using HMMs

21

The Viterbi algorithm recursion for emitting states (i =1…L):

klkk

ill aivxbiv )1(max)()(

klkk

l aivi )1(maxarg)(ptr keep track of most probable path

Page 22: Information Extraction using HMMs

22

The Viterbi algorithm

to recover the most probable path, follow pointers back starting at

termination:

L

kNkk

aLv )( maxargL

kNkk

aLvx )( max),Pr(

Page 23: Information Extraction using HMMs

23

Database Integration Augment dictionary

Example: list of Cities Assigning probabilities is a problem

Exploit functional dependencies Example

Santa Barbara -> USA Piskinov -> Georgia

Page 24: Information Extraction using HMMs

24

2001 University Avenue, Kendall Sq. Piskinov, Georgia

2001 University Avenue, Kendall Sq., Piskinov, Georgia

House number

Road NameCity StateArea

2001 University Avenue, Kendall Sq., Piskinov, Georgia

House number Road Name Area CountryCity

Page 25: Information Extraction using HMMs

25

Frequency constraints Including constraints of the form: the same tag

cannot appear in two disconnected segments Eg: Title in a citation cannot appear twice Street name cannot appear twice

Not relevant for named-entity tagging kinds of problems

Page 26: Information Extraction using HMMs

26

Constrained Viterbi

klkk

ill aivxbiv )1(max)()( Original Viterbi

Modified Viterbi

….

Page 27: Information Extraction using HMMs

27

Comparative Evaluation Naïve model – One state per element in the

HMM Independent HMM – One HMM per element; Rule Learning Method – Rapier Nested Model – Each state in the Naïve model

replaced by a HMM

Page 28: Information Extraction using HMMs

Results: Comparative Evaluation

The Nested model does best in all three cases

(from Borkar 2001)

Dataset instances

Elements

IITB student

Addresses

2388 17

Company

Addresses

769 6

US

Addresses

740 6

Page 29: Information Extraction using HMMs

29

Results: Effect of Feature Hierarchy

Feature Selection showed at least a 3% increase in accuracy

Page 30: Information Extraction using HMMs

30

Results: Effect of training data size

HMMs are fast Learners.

We reach very close to the maximum accuracy with just 50 to 100 addresses

Page 31: Information Extraction using HMMs

31

HMM approach: summary

Inter-element sequencing

Intra-element sequencing

Element length

Characteristic words

Non-overlapping tags

Outer HMM transitions

Inner HMM

Multi-state Inner HMM

Dictionary

Global optimization