iis2010-mm-mp

Upload: michal-marcinczuk

Post on 07-Apr-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/6/2019 iis2010-mm-mp

    1/15

    Named Entity Recognition

    in the Domain of Polish Stock Exchange Reports

    Micha Marciczuk and Maciej Piasecki

    Wrocaw University of Technology

    7 czerwca 2010

    Project NEKST (Natively Enhanced Knowledge Sharing Technologies)co-financed by Innovative Economy Programme project POIG.01.01.02-14-013/09

  • 8/6/2019 iis2010-mm-mp

    2/15

    Scope Introduction

    Introduction

    Overview:

    a problem of Named Entity Recognition,

    recognition of PERSON and COMPANY annotations,

    two corpora of Stock Exchange Reports from an economicdomain and Police Reports from a security domain,

    combination of a machine learning approach with amanually created rules,

    application of Hidden Markov Models.

    The corpus was published athttp://nlp.pwr.wroc.pl/gpw/download

    Micha Marciczuk and Maciej Piasecki (PWr.) 7 czerwca 2010 2 / 15

    http://nlp.pwr.wroc.pl/gpw/downloadhttp://nlp.pwr.wroc.pl/gpw/download
  • 8/6/2019 iis2010-mm-mp

    3/15

    Scope Task Definition

    Task Definition

    We defined NEs as language expressions referring to extra-linguisticreal or abstract objects of the preselected kinds.

    We limited the Named Entity Recognition task to identify expressionsconsiting of proper names refering to PERSON and COMPANYentities.

    Examples of correct and incorrect expressions of PERSON type:

    correct: R. Dolea, Marek Wiak, Luis Manuel Conceicao do Amaral,person names are part of a company name: Moore Stephens Trzemalski ,Krynicki i Partnerzy Kancelaria Biegych Rewidentw Sp . z o.o.,

    location: pl. Jana Pawa II.

    Micha Marciczuk and Maciej Piasecki (PWr.) 7 czerwca 2010 3 / 15

  • 8/6/2019 iis2010-mm-mp

    4/15

    Scope Corpora of Economic Domain

    Corpora of Economic Domain

    Stock Exchange Reports (SCER)1215 documents, 282 376 tokens,

    670 PERSON and 3 238 COMPANY annotations,

    source http://gpwinfostrefa.pl,

    Characteristic

    a set of economic reports published by companies,

    very formal style of writing,

    a lot of expressions written from an upper case letter that arenot proper names,

    a lot of names of institutions, organizations, companies,people and location.

    Micha Marciczuk and Maciej Piasecki (PWr.) 7 czerwca 2010 4 / 15

    http://gpwinfostrefa.pl/http://gpwinfostrefa.pl/
  • 8/6/2019 iis2010-mm-mp

    5/15

    Scope Corpora of Security Domain

    Corpora of Security Domain

    Police Reports (CPR)

    12 documents, 29 569 tokens,

    555 PERSON and 121 COMPANY annotations,

    source [Gralinski et al., 2009].

    Characteristic

    a set of statements produced by witnesses and suspects,

    rather informal style of writing,

    a lot of pseudonyms and common words that are propernames,

    a lot of one-word person names.

    Micha Marciczuk and Maciej Piasecki (PWr.) 7 czerwca 2010 5 / 15

  • 8/6/2019 iis2010-mm-mp

    6/15

    Scope Corpus Developement

    Corpus Developement

    To annotate the corpora wedeveloped and used the Inforexsystem.

    System fueatures:web-based does not requireinstallation (requires only a FireFoxbrowser with JavaScript),

    remote corpora are stored on aserver,

    shared corpora can besimultaneously annotated by manyusers.

    Micha Marciczuk and Maciej Piasecki (PWr.) 7 czerwca 2010 6 / 15

  • 8/6/2019 iis2010-mm-mp

    7/15

    Scope Baselines

    Baselines

    1

    Heuristic matches a sequence of words starting with anupper case letter. For COMPANY the name must end with anactivity form, i.e.: SA, LLC, Spka, AG, S.A.,Sp., B.V..

    2 Gazetteers matches a sequence of words present in the

    dictionary of first names and last names (63 555 entries) orcompany names (6200 entries) [Piskorski, 2004].

    CSER CPRPERSON COMPANY PERSON COMPANY

    HeuristicPrecision 0.89 % 0.76 % 19.35 % 0.19 %Recall 42.45 % 4.42 % 93.87 % 4.13 %F1-measure 1.75 % 1.29 % 32.09 % 0.36 %

    GazetteersPrecision 9.61 % 37.01 % 46.79 % 21.05 %Recall 41.19 % 40.54 % 9.02 % 3.31 %F1-measure 15.59 % 38.69 % 15.12 % 5.71 %

    Micha Marciczuk and Maciej Piasecki (PWr.) 7 czerwca 2010 7 / 15

  • 8/6/2019 iis2010-mm-mp

    8/15

    Scope Recognition Based on HMM

    Recognition Based on HMM

    LingPipe [Alias-i, 2008] implementation of HMM7 hidden states for every annotation type,

    3 additional states (BOS, EOS, middle token),

    Witten-Bell smoothing,

    first-best decoder based on Viterbis algorithm.

    Pan Jan Nowak zosta nominowany na stanowisko prezesa(Mr. Jan Nowak was nominated for the chairman position.)

    (BOS) (E-O-PER) (B-PER) (E-PER) (B-O-PER) (W-O) (W-O) (W-O) (W-O) (W-O) (EOS)

    Micha Marciczuk and Maciej Piasecki (PWr.) 7 czerwca 2010 8 / 15

  • 8/6/2019 iis2010-mm-mp

    9/15

    Scope Single Domain Evaluation (PERSON)

    Single Domain Evaluation (PERSON)

    We performed a 10-fold Cross Validation on the Stock ExchangeCorpus for PERSON annotations.

    Heuristic Gazetters HMM

    Precision 0.89 % 9.61 % 64.74 %

    Recall 42.45 % 41.19 % 93.73 %

    F1-measure 1.75 % 15.59 % 76.59 %

    Micha Marciczuk and Maciej Piasecki (PWr.) 7 czerwca 2010 9 / 15

  • 8/6/2019 iis2010-mm-mp

    10/15

    Scope Error Analysis

    Error Analysis

    We have identified 10 types of errors.

    No. Error type Matchesfull partial total

    1 Name of institution, company, etc. 38 91 1292 Name of location (street, place, etc.) 30 16 463 Other proper names 2 10 12

    4 Phrases in English - 21 215 Incorrect annotation boundary - 35 356 Common word starting from upper case character - 6 67 Common word starting from lower case character - 26 268 Single character - 6 69 Common word with a spelling error - 3 3

    10 Other - 46 46

    A) 1, 2, 3 incorrect types of annotation recognition of COMPANY andLOCATION,B) 4, 7, 8, 9 lower case and non-alphabetic expressions rule filtering,C) 5 incorrect annotation boundary annotation merging and trimming.

    Micha Marciczuk and Maciej Piasecki (PWr.) 7 czerwca 2010 10 / 15

  • 8/6/2019 iis2010-mm-mp

    11/15

    Scope Single Domain Evaluation (PERSON & COMPANY)

    Single Domain Evaluation (PERSON & COMPANY)

    Referring to the A group of errors, we have re-annotated the CSERwith COMPANY annotations and repeated the 10-fold CV .

    PERSON COMPANYREV COMB

    Precision 64.74 %* 78.63 % 76.56 %

    Recall 93.73 %* 94.62 % 83.14 %

    F1-measure 76.59 %* 85.89 % 79.71 %

    * results from the previous 10-fold CV

    Micha Marciczuk and Maciej Piasecki (PWr.) 7 czerwca 2010 11 / 15

  • 8/6/2019 iis2010-mm-mp

    12/15

    Scope Post-filtering

    Post-filtering

    Referring to the B and C groups of errors, we have applied twotypes of post-processing: annotation filtering and merging.

    HMM +filtering +trimming +bothPERSON (REV)

    Precision 64.74 %* 76.27 % 64.85 % 75.82 %Recall 93.73 %* 91.64 % 93.88 % 91.77 %F1-measure 76.59 %* 83.25 % 76.71 % 83.03 %

    PERSON (COMB)Precision 78.63 %* 87.16 % 78.76 % 86.69 %

    Recall 94.62 %* 91.33 % 94.77 % 91.48 %F1-measure 85.89 %* 89.20 % 86.02 % 89.02 %

    * results from the previous 10-fold CV

    Micha Marciczuk and Maciej Piasecki (PWr.) 7 czerwca 2010 12 / 15

  • 8/6/2019 iis2010-mm-mp

    13/15

    Scope Cross Domain Evaluation

    Cross Domain Evaluation

    The system was trained on the Corpus of Stock Exchange andtested on the Corpus of Police Reports.

    HMM +filtering +trimming +bothPERSON (REV)

    Precision 7.73 % 62.91 % 32.16 % 53.71 %Recall 48.47 % 48.29 % 56.22 % 56.04 %F1-measure 35.28 % 54.64 % 40.92 % 54.85 %

    PERSON (COMB)Precision 29.81 % 69.75 % 37.13 % 58.33 %Recall 39.64 % 39.46 % 49.37 % 49.19 %

    F1-measure 34.03 % 50.40 % 42.38 % 53.37 %COMPANYPrecision 12.30 % - - -Recall 56.20 % - - -F1-measure 20.18 % - - -

    Micha Marciczuk and Maciej Piasecki (PWr.) 7 czerwca 2010 13 / 15

  • 8/6/2019 iis2010-mm-mp

    14/15

    Summary Conclusion & Plans

    Conclusion & Plans

    Conclusionresults of single-domain evaluation are promising,

    simple rule-based post-processing of HMM can improve the final results,

    low domain portability better utilization of gazetteers and rules areneeded.

    Plans

    to extend the schema annotation by LOCATION & ORGANIZATION,

    to collect a new corpus for cross-domain evaluation,

    to develop new rules for post-processing (for example to fix a problemwith sentence segmentation),

    incorporate other sources of knowledge (rules and gazetteers for majorityvoting, plWordNet for generalization),

    new learning models: HMM including morphology, other ML methodswith context features (preceding verbs, prepositions, etc.).

    Micha Marciczuk and Maciej Piasecki (PWr.) 7 czerwca 2010 14 / 15

  • 8/6/2019 iis2010-mm-mp

    15/15

    References Main papers

    Alias-i, LingPipe 3.9.0. http://alias-i.com/lingpipe

    (October 1, 2008)

    Graliski, F., Jassem, K., Marciczuk, M., Wawrzyniak, P.:Named Entity Recognition in Machine Anonymization. In:Kopotek, M. A., Przepiorkowski, A., Wierzcho, A. T.,

    Trojanowski, K. (eds.) Recent Advances in IntelligentInformation Systems, pp. 247260. Academic PublishingHouse Exit (2009)

    Piskorski J.: Extraction of Polish named entities. In:

    Proceedings of the Fourth International Conference onLanguage Resources and Evaluation, LREC 2004, pp. 313316.ACL, Prague, Czech Republic (2004)

    Micha Marciczuk and Maciej Piasecki (PWr.) 7 czerwca 2010 15 / 15

    http://alias-i.com/lingpipehttp://alias-i.com/lingpipe