iis2010-mm-mp

8/6/2019 iis2010-mm-mp

1/15

Named Entity Recognition

in the Domain of Polish Stock Exchange Reports

Micha Marciczuk and Maciej Piasecki

Wrocaw University of Technology

7 czerwca 2010

Project NEKST (Natively Enhanced Knowledge Sharing Technologies)co-financed by Innovative Economy Programme project POIG.01.01.02-14-013/09

8/6/2019 iis2010-mm-mp

2/15

Scope Introduction

Introduction

Overview:

a problem of Named Entity Recognition,

recognition of PERSON and COMPANY annotations,

two corpora of Stock Exchange Reports from an economicdomain and Police Reports from a security domain,

combination of a machine learning approach with amanually created rules,

application of Hidden Markov Models.

The corpus was published athttp://nlp.pwr.wroc.pl/gpw/download

Micha Marciczuk and Maciej Piasecki (PWr.) 7 czerwca 2010 2 / 15
http://nlp.pwr.wroc.pl/gpw/downloadhttp://nlp.pwr.wroc.pl/gpw/download

8/6/2019 iis2010-mm-mp

3/15

Scope Task Definition

Task Definition

We defined NEs as language expressions referring to extra-linguisticreal or abstract objects of the preselected kinds.

We limited the Named Entity Recognition task to identify expressionsconsiting of proper names refering to PERSON and COMPANYentities.

Examples of correct and incorrect expressions of PERSON type:

correct: R. Dolea, Marek Wiak, Luis Manuel Conceicao do Amaral,person names are part of a company name: Moore Stephens Trzemalski ,Krynicki i Partnerzy Kancelaria Biegych Rewidentw Sp . z o.o.,

location: pl. Jana Pawa II.


8/6/2019 iis2010-mm-mp

4/15

Scope Corpora of Economic Domain

Corpora of Economic Domain

Stock Exchange Reports (SCER)1215 documents, 282 376 tokens,

670 PERSON and 3 238 COMPANY annotations,

source http://gpwinfostrefa.pl,

Characteristic

a set of economic reports published by companies,

very formal style of writing,

a lot of expressions written from an upper case letter that arenot proper names,

a lot of names of institutions, organizations, companies,people and location.

http://gpwinfostrefa.pl/http://gpwinfostrefa.pl/

8/6/2019 iis2010-mm-mp

5/15

Scope Corpora of Security Domain

Corpora of Security Domain

Police Reports (CPR)

12 documents, 29 569 tokens,

555 PERSON and 121 COMPANY annotations,

source [Gralinski et al., 2009].

Characteristic

a set of statements produced by witnesses and suspects,

rather informal style of writing,

a lot of pseudonyms and common words that are propernames,

a lot of one-word person names.


8/6/2019 iis2010-mm-mp

6/15

Scope Corpus Developement

Corpus Developement

To annotate the corpora wedeveloped and used the Inforexsystem.

System fueatures:web-based does not requireinstallation (requires only a FireFoxbrowser with JavaScript),

remote corpora are stored on aserver,

shared corpora can besimultaneously annotated by manyusers.


8/6/2019 iis2010-mm-mp

7/15

Scope Baselines

Baselines

1

Heuristic matches a sequence of words starting with anupper case letter. For COMPANY the name must end with anactivity form, i.e.: SA, LLC, Spka, AG, S.A.,Sp., B.V..

2 Gazetteers matches a sequence of words present in the

dictionary of first names and last names (63 555 entries) orcompany names (6200 entries) [Piskorski, 2004].

CSER CPRPERSON COMPANY PERSON COMPANY

HeuristicPrecision 0.89 % 0.76 % 19.35 % 0.19 %Recall 42.45 % 4.42 % 93.87 % 4.13 %F1-measure 1.75 % 1.29 % 32.09 % 0.36 %

GazetteersPrecision 9.61 % 37.01 % 46.79 % 21.05 %Recall 41.19 % 40.54 % 9.02 % 3.31 %F1-measure 15.59 % 38.69 % 15.12 % 5.71 %


8/6/2019 iis2010-mm-mp

8/15

Scope Recognition Based on HMM

Recognition Based on HMM

LingPipe [Alias-i, 2008] implementation of HMM7 hidden states for every annotation type,

3 additional states (BOS, EOS, middle token),

Witten-Bell smoothing,

first-best decoder based on Viterbis algorithm.

Pan Jan Nowak zosta nominowany na stanowisko prezesa(Mr. Jan Nowak was nominated for the chairman position.)

(BOS) (E-O-PER) (B-PER) (E-PER) (B-O-PER) (W-O) (W-O) (W-O) (W-O) (W-O) (EOS)


8/6/2019 iis2010-mm-mp

9/15

Scope Single Domain Evaluation (PERSON)

Single Domain Evaluation (PERSON)

We performed a 10-fold Cross Validation on the Stock ExchangeCorpus for PERSON annotations.

Heuristic Gazetters HMM

Precision 0.89 % 9.61 % 64.74 %

Recall 42.45 % 41.19 % 93.73 %

F1-measure 1.75 % 15.59 % 76.59 %


8/6/2019 iis2010-mm-mp

10/15

Scope Error Analysis

Error Analysis

We have identified 10 types of errors.

No. Error type Matchesfull partial total

1 Name of institution, company, etc. 38 91 1292 Name of location (street, place, etc.) 30 16 463 Other proper names 2 10 12

4 Phrases in English - 21 215 Incorrect annotation boundary - 35 356 Common word starting from upper case character - 6 67 Common word starting from lower case character - 26 268 Single character - 6 69 Common word with a spelling error - 3 3

10 Other - 46 46

A) 1, 2, 3 incorrect types of annotation recognition of COMPANY andLOCATION,B) 4, 7, 8, 9 lower case and non-alphabetic expressions rule filtering,C) 5 incorrect annotation boundary annotation merging and trimming.


8/6/2019 iis2010-mm-mp

11/15

Scope Single Domain Evaluation (PERSON & COMPANY)

Single Domain Evaluation (PERSON & COMPANY)

Referring to the A group of errors, we have re-annotated the CSERwith COMPANY annotations and repeated the 10-fold CV .

PERSON COMPANYREV COMB

Precision 64.74 %* 78.63 % 76.56 %

Recall 93.73 %* 94.62 % 83.14 %

F1-measure 76.59 %* 85.89 % 79.71 %

* results from the previous 10-fold CV


8/6/2019 iis2010-mm-mp

12/15

Scope Post-filtering

Post-filtering

Referring to the B and C groups of errors, we have applied twotypes of post-processing: annotation filtering and merging.

HMM +filtering +trimming +bothPERSON (REV)

Precision 64.74 %* 76.27 % 64.85 % 75.82 %Recall 93.73 %* 91.64 % 93.88 % 91.77 %F1-measure 76.59 %* 83.25 % 76.71 % 83.03 %

PERSON (COMB)Precision 78.63 %* 87.16 % 78.76 % 86.69 %

Recall 94.62 %* 91.33 % 94.77 % 91.48 %F1-measure 85.89 %* 89.20 % 86.02 % 89.02 %

* results from the previous 10-fold CV


8/6/2019 iis2010-mm-mp

13/15

Scope Cross Domain Evaluation

Cross Domain Evaluation

The system was trained on the Corpus of Stock Exchange andtested on the Corpus of Police Reports.

HMM +filtering +trimming +bothPERSON (REV)

Precision 7.73 % 62.91 % 32.16 % 53.71 %Recall 48.47 % 48.29 % 56.22 % 56.04 %F1-measure 35.28 % 54.64 % 40.92 % 54.85 %

PERSON (COMB)Precision 29.81 % 69.75 % 37.13 % 58.33 %Recall 39.64 % 39.46 % 49.37 % 49.19 %

F1-measure 34.03 % 50.40 % 42.38 % 53.37 %COMPANYPrecision 12.30 % - - -Recall 56.20 % - - -F1-measure 20.18 % - - -


8/6/2019 iis2010-mm-mp

14/15

Summary Conclusion & Plans

Conclusion & Plans

Conclusionresults of single-domain evaluation are promising,

simple rule-based post-processing of HMM can improve the final results,

low domain portability better utilization of gazetteers and rules areneeded.

Plans

to extend the schema annotation by LOCATION & ORGANIZATION,

to collect a new corpus for cross-domain evaluation,

to develop new rules for post-processing (for example to fix a problemwith sentence segmentation),

incorporate other sources of knowledge (rules and gazetteers for majorityvoting, plWordNet for generalization),

new learning models: HMM including morphology, other ML methodswith context features (preceding verbs, prepositions, etc.).


8/6/2019 iis2010-mm-mp

15/15

References Main papers

Alias-i, LingPipe 3.9.0. http://alias-i.com/lingpipe

(October 1, 2008)

Graliski, F., Jassem, K., Marciczuk, M., Wawrzyniak, P.:Named Entity Recognition in Machine Anonymization. In:Kopotek, M. A., Przepiorkowski, A., Wierzcho, A. T.,

Trojanowski, K. (eds.) Recent Advances in IntelligentInformation Systems, pp. 247260. Academic PublishingHouse Exit (2009)

Piskorski J.: Extraction of Polish named entities. In:

Proceedings of the Fourth International Conference onLanguage Resources and Evaluation, LREC 2004, pp. 313316.ACL, Prague, Czech Republic (2004)

http://alias-i.com/lingpipehttp://alias-i.com/lingpipe

iis2010-mm-mp

Documents