named entity recognition based on bilingual co-training

22
Named Entity Recognition based on Bilingual Co- training Li Yegang [email protected] School of Computer, BIT

Upload: nia

Post on 16-Jan-2016

34 views

Category:

Documents


0 download

DESCRIPTION

Named Entity Recognition based on Bilingual Co-training. Li Yegang [email protected] School of Computer, BIT. Outline. Introduction and Related Work Bilingual Co-training Corrective NE Projection Annotation Experiment Result. 1.Introduction. Related work Das et al.,2011 - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Named Entity Recognition based on Bilingual Co-training

Named Entity Recognition based on Bilingual Co-training

Li Yegang

[email protected]

School of Computer, BIT

Page 2: Named Entity Recognition based on Bilingual Co-training

Outline

• Introduction and Related Work

• Bilingual Co-training

• Corrective NE Projection Annotation

• Experiment Result

Page 3: Named Entity Recognition based on Bilingual Co-training

1.Introduction

• Related work– Das et al.,2011

• using parallel English-foreign language data, a high-quality NER tagger for English, and projected annotations for the foreign language .

– Burkett et al., 2010• Parallel data has also been used to improve existing

monolingual taggers or other analyzers in both languages.

Page 4: Named Entity Recognition based on Bilingual Co-training

Introduction

• Disadvantages– current NE alignment methods are not accurate

enough, and many noises could be introduced during the word alignment stage.

– manual annotation is usually obtained from a few limited domains, leading to a bad affect on statistical supervised learning methods.

Page 5: Named Entity Recognition based on Bilingual Co-training

Introduction

• Complementaritya) Results from Chinese name tagger : <PER> 金庸新 </PER > 小说 (b) Results from English name tagger:

the new novels of <PER>Jin Yong</PER>

(c) Name tagging after using bilingual co-training:

< PER > 金庸 </PER > 新小说• “金庸新” and “ 金庸” in Chinese can be a PER name,

while its English translation “Jin Yong” indicates that “ 金庸” is more likely to be a PER name than “ 金庸新” .

Page 6: Named Entity Recognition based on Bilingual Co-training

Introduction

• Complementarity(a) Results from English name tagger

The captain of a ferry boat who works on 〈 PER 〉 Lake Constance 〈 /PER 〉 ...

(b) Results from Chinese name tagger:

在〈 LOC 〉康斯坦茨湖〈 /LOC 〉工作的一艘渡船的船长 ...

(c) Name tagging after using bilingual co-training:

The captain of a ferry boat who works on 〈 LOC 〉 LakeConstance 〈 /LOC 〉…

 “Lake” in English can be the suffix word of either a PER or LOC name, while its Chinese translation “康斯坦茨湖” indicates that “Lake Constance” is more likely to be a LOC name.

Page 7: Named Entity Recognition based on Bilingual Co-training

2.Bilingual Co-training

• Co-training– Starting with a set of labeled data, co-training

algorithms attempt to increase the amount of annotated data using large amounts of unlabeled data. The process may continue for several iterations.

Page 8: Named Entity Recognition based on Bilingual Co-training

Bilingual Co-training

• Bilingual Co-training– regarding the parallel Chinese-English sentences

as weaker independent views for NE identity

Page 9: Named Entity Recognition based on Bilingual Co-training

Bilingual Co-training

)()(0

)()(1),(

ji

jikji wtTwsT

wtTwsTwtwsconformity

U

K

k kji wtwsconformityKn

conformity1

),(11

ratio_

Page 10: Named Entity Recognition based on Bilingual Co-training

• Given:– A set Ls of source labeled examples – A set Lt of target labeled examples– A set Us of source unlabeled examples– A set Ut of target unlabeled examples

Algorithm of Bilingual Co-training

Page 11: Named Entity Recognition based on Bilingual Co-training

• 1. Classifiers– Use Ls to train the classifiers Classifier(s)– Use Lt to train the classifiers Classifier(t)

Algorithm of Bilingual Co-training

Page 12: Named Entity Recognition based on Bilingual Co-training

Algorithm of Bilingual Co-training

Page 13: Named Entity Recognition based on Bilingual Co-training

Algorithm of Bilingual Co-training

Page 14: Named Entity Recognition based on Bilingual Co-training

Algorithm of Bilingual Co-training

Page 15: Named Entity Recognition based on Bilingual Co-training

3.Corrective NE Projection Annotation

• Projection NE Candidate

– For each word in the source language NE, we find all the possible projection word in target language through the word alignment. Next, we have all the projection words as the “seed” data. With an open-ended window for each seed, all the possible sequences located within the window are considered as possible candidates for NE projection. Their lengths range from 1 to the empirically determined length of the window. During the best candidate projection NE selection, the NE alignment model discussed as follows is applied to search the best projection NE.

Page 16: Named Entity Recognition based on Bilingual Co-training

3.Corrective NE Projection Annotation

• NE Alignment Model

A

M

m

dc

bakmm

M

m

dc

bakmm

dc

bak

eNNcaf

eNNcaf

eNNcaP

1

1

~,,exp

~,,exp

)~,|(

translation feature, the source NE and target NE’s co-occurrence feature, and length of NE pair feature

Page 17: Named Entity Recognition based on Bilingual Co-training

3.Corrective NE Projection Annotation

• Translation Feature (Brown et al.,1993)

minj

ectn

NeNcPm

j

n

iijm

1,1

)|()1(

1)|(

1 1

))|~(log())~|(log(

~,,ba

dc

dc

ba

dc

bakm

NceNPeNNcP

eNNcaf

Page 18: Named Entity Recognition based on Bilingual Co-training

3.Corrective NE Projection Annotation

• Co-occurrence Feature

)~(*,

)~,(

,*)(

)~,(

~,,

dc

dc

ba

ba

dc

ba

dc

bakm

eNcount

eNNccount

Nccount

eNNccount

eNNcaf

Page 19: Named Entity Recognition based on Bilingual Co-training

3.Corrective NE Projection Annotation

• Length Feature (Church,1993)

2

1

~~,,

~,,

ba

dc

bad

cbakm

dc

bakm

Nc

eNNceNNcaf

eNNcaf

n

j

n

i i

i

j

j

n

i i

i

Nccount

Necount

nNccount

Necount

n

Nccount

Necount

n

1

2

1

2

1

)(

)(1

)(

)(1

,)(

)(1

Page 20: Named Entity Recognition based on Bilingual Co-training

Experimental Results

NE Type Chinese(F) Enlish(F)

PER 89.59 81.22

LOC 88.48 80.43

ORG 84.54 79.69

ALL 87.17 80.37

Baseline F-Measure (%) of NE Tagging

Page 21: Named Entity Recognition based on Bilingual Co-training

Experimental Results

Co-training Model F-Measure (%) of NE Tagging

NE Type Chinese(F) Enlish(F)

PER 90.86 82.31

LOC 89.53 82.01

ORG 85.71 80.42

ALL 88.28 81.76

Page 22: Named Entity Recognition based on Bilingual Co-training

Thank you!