study on named entity recognition for polish based on hidden markov models -- poster

2
Study on Named Entity Recognition for Polish Study on Named Entity Recognition for Polish Based on Hidden Markov Models* Based on Hidden Markov Models* Michał Marcińczuk [email protected] Maciej Piasecki [email protected] Our Task: To evaluate the accuracy of N amed Entity Recognition for Polish based on Hidden Markov Models . We evaluated the recognition of PERSON type named entities that is a linearly continuous expressions referring to a person composed first name, second name, last name, maiden name and pseudonym. Corpora: Stock Exchange Reports – 1215 documents from GPWInfoStrefa.pl consists of 10 066 sentences, 282 418 tokens and includes 654 PERSON annotations. The corpus is available at http://nlp.pwr.wroc.pl/gpw/download Police Reports (Graliński et al., 2009) – 11 statements produced by witnesses and suspects consist of 1 583 sentences, 29 569 tokens and 555 PERSON annotations. The corpus was used for cross-domain evaluation. Web-based application used to annotate and browse the corpora. http://nlp.pwr.wroc.pl/gpw Results: Baseline: Heuristic – matches a sequence of words such that each word starts with an upper case letter each. Gazetteers – matches a sequence of words present in the dictionary of first names and last names (63 555 entries) (Piskorski, 2004). HMM: LingPipe implementation of HMM: 7 hidden states for every annotation type, 3 additional states (BOS, EOS, middle token), Witten-Bell smoothing, first-best decoder based on Viterbi’s algorithm, rescoring based on the language model. “Pan Jan Nowak został nominowany na stanowisko prezesa” (Mr. Jan Nowak was nominated for the chairman position.) (BOS) → (E-O-PER) → (B-PER) → (E-PER) → (B-O-PER) → (W-O)→ (W-O) → (W-O) → (W-O) → (W-O) → (EOS) Errors: We have analysed the errors produced by HMM and divided them in 6 groups: Incorrect proper name category 187 False Positives, Lowercase and non-alphabetic expressions 99 False Positives, Incorrect annotation boundaries 35 partially matched annotation, Missing annotations 18 missing annotations, Common words starting with an uppercase character – 6 False Positives. Post- processing: To improve the precision of HMM we applied a simple filter that matches only sequences of words starting with an upper case character and optionally separated by a hyphen. The rule was written as the following regular expression: WORD = "([A-ZĄŻŚŹĘĆŃÓŁ])([a-zążśźęćńół])*"; PATTERN = "/^WORD( WORD)+( - WORD)?( (WORD))?\$" 10-fold Cross Validation on the Stock Exchange Reports Corss-domain evaluation Heuristic Gazetteers HMM HMM + f1+ HMM + f2+ HMM HMM + f1+ HMM + f2+ Precision 0.85 % 9.47 % 61.82 % 74.49 % 89.00 % 28.04 % 63.61 % 83.49 % Recall 41.74 % 41.44 % 92.35 % 90.21 % 89.60 % 47.74 % 47.67 % 32.79 % F1-measure 1.67 % 15.42 % 74.05 % 81.66 % 89.33 % 35.33 % 54.43 % 47.09 % HMM + f is a HMM recognized combined with filter post-processing, where f1+ is a sequence of one or more token, f2+ is a sequence of two or more tokens. Model Granularity: PERSON FIRST NAME LAST NAME HMM HMM + f1+ HMM HMM + f1+ HMM HMM + f1+ Precision 61.82 % 74.49 % 76.05 % 89.13 % 73.73 % 83.06 % Recall 92.35 % 90.21 % 98.54 % 97.22 % 93.70 % 90.48 % F1-measure 74.05 % 81.66 % 85.84 % 93.00 % 82.53 % 86.62 % *Work financed by European Union within Innovative Economy Programme project POIG.01.01.02-14-013/09 f2+ was biased to the stock exchange corpus thus it was not considered in the evaluation of the model granularity. User demo Corpus: poligon User: test Password: test Positive patterns: Negative patterns: <first name> <first name> <second name> <last name> <last name> & <last name> Company company name <last name> <first name> <last name>-<maiden name> <first name> <last name> University institution name <first name> <last name> <first name> <last name> (<last name>) <first name> <last name> Square location name <initial> <last name> <first name> i <first name> <last name> Sample:

Upload: michal-marcinczuk

Post on 24-Mar-2015

193 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Study on Named Entity Recognition for Polish Based on Hidden Markov Models -- poster

Study on Named Entity Recognition for PolishStudy on Named Entity Recognition for PolishBased on Hidden Markov Models*Based on Hidden Markov Models*

Michał Marciń[email protected]

Maciej [email protected]

Our Task: To evaluate the accuracy of Named Entity Recognition for Polish based on Hidden Markov Models.

We evaluated the recognition of PERSON type named entities that is a linearly continuous expressions referring to a person composed first name, second name, last name, maiden name and pseudonym.

Corpora: Stock Exchange Reports – 1215 documents from GPWInfoStrefa.pl consists of 10 066 sentences, 282 418 tokens and includes 654 PERSON annotations. The corpus is available at http://nlp.pwr.wroc.pl/gpw/download

Police Reports (Graliński et al., 2009) – 11 statements produced by witnesses and suspects consist of 1 583 sentences, 29 569 tokens and 555 PERSON annotations. The corpus was used for cross-domain evaluation.

Web-based application used to annotate and browse the corpora.

http://nlp.pwr.wroc.pl/gpw

Results:

Baseline: Heuristic – matches a sequence of words such that each word starts with an upper case letter each.

Gazetteers – matches a sequence of words present in the dictionary of first names and last names (63 555 entries) (Piskorski, 2004).

HMM: LingPipe implementation of HMM:➢ 7 hidden states for every annotation type,➢ 3 additional states (BOS, EOS, middle token),➢ Witten-Bell smoothing,➢ first-best decoder based on Viterbi’s algorithm,➢ rescoring based on the language model.

“Pan Jan Nowak został nominowany na stanowisko prezesa”

(Mr. Jan Nowak was nominated for the chairman position.)

(BOS) → (E-O-PER) → (B-PER) → (E-PER) → (B-O-PER) → (W-O)→ (W-O) → (W-O) → (W-O) → (W-O) → (EOS)

Errors: We have analysed the errors produced by HMM and divided them in 6 groups:➢ Incorrect proper name category – 187 False Positives,➢ Lowercase and non-alphabetic expressions – 99 False Positives,➢ Incorrect annotation boundaries – 35 partially matched annotation, ➢ Missing annotations – 18 missing annotations,➢ Common words starting with an uppercase character – 6 False Positives.

Post-processing:

To improve the precision of HMM we applied a simple filter that matches only sequences of words starting with an upper case character and optionally separated by a hyphen. The rule was written as the following regular expression:

WORD = "([A-ZĄŻŚŹĘĆŃÓŁ])([a-zążśźęćńół])*";PATTERN = "/^WORD( WORD)+( - WORD)?( (WORD))?\$"

10-fold Cross Validation on the Stock Exchange Reports Corss-domain evaluation

Heuristic Gazetteers HMM HMM + f1+ HMM + f2+ HMM HMM + f1+ HMM + f2+

Precision 0.85 % 9.47 % 61.82 % 74.49 % 89.00 % 28.04 % 63.61 % 83.49 %

Recall 41.74 % 41.44 % 92.35 % 90.21 % 89.60 % 47.74 % 47.67 % 32.79 %

F1-measure 1.67 % 15.42 % 74.05 % 81.66 % 89.33 % 35.33 % 54.43 % 47.09 %

HMM + f is a HMM recognized combined with filter post-processing, where f1+ is a sequence of one or more token, f2+ is a sequence of two or more tokens.

Model Granularity:

PERSON FIRST NAME LAST NAME

HMM HMM + f1+ HMM HMM + f1+ HMM HMM + f1+

Precision 61.82 % 74.49 % 76.05 % 89.13 % 73.73 % 83.06 %

Recall 92.35 % 90.21 % 98.54 % 97.22 % 93.70 % 90.48 %

F1-measure 74.05 % 81.66 % 85.84 % 93.00 % 82.53 % 86.62 %

*Work financed by European Union within Innovative Economy Programme project POIG.01.01.02-14-013/09

f2+ was biased to the stock exchange corpus thus it was not considered in the evaluation of the model granularity.

User demoCorpus: poligonUser: testPassword: test

Positive patterns: Negative patterns:<first name> <first name> <second name> <last name> <last name> & <last name> Company – company name<last name> <first name> <last name>-<maiden name> <first name> <last name> University – institution name <first name> <last name> <first name> <last name> (<last name>) <first name> <last name> Square – location name<initial> <last name> <first name> i <first name> <last name>

Sample:

Page 2: Study on Named Entity Recognition for Polish Based on Hidden Markov Models -- poster

Study on Named Entity Recognition for PolishStudy on Named Entity Recognition for PolishBased on Hidden Markov Models*Based on Hidden Markov Models*

Michał Marciń[email protected]

Maciej [email protected]

Our Task: To evaluate the accuracy of Named Entity Recognition for Polish based on Hidden Markov Models.

We evaluated the recognition of PERSON type named entities that is a linearly continuous expressions referring to a person composed first name, second name, last name, maiden name and pseudonym.

Corpora: Stock Exchange Reports – 1215 documents from GPWInfoStrefa.pl consists of 10 066 sentences, 282 418 tokens and includes 654 PERSON annotations. The corpus is available at http://nlp.pwr.wroc.pl/gpw/download

Police Reports [1] – 11 statements produced by witnesses and suspects consist of 1 583 sentences, 29 569 tokens and 555 PERSON annotations. The corpus was used for cross-domain evaluation.

Web-based application used to annotate and browse the corpora.

http://nlp.pwr.wroc.pl/gpw

Reference:[1] Alias-i, LingPipe 3.9.0. http://alias-i.com/lingpipe (October 1, 2008)[2] Graliński, F., Jassem, K., Marcińczuk, M.: An Environment for Named Entity Recognition and Translation. In: Màrquez, L., Somers, H. (eds.) Proceedings of the 13th Annual Conference of the European Association for Machine Translation, pp. 8895. Barcelona, Spain (2009)[3] Piskorski J.: Extraction of Polish named entities. In: Proceedings of the Fourth International Conference on Language Resources and Evaluation, LREC 2004, pp. 313-316. ACL, Prague, Czech Republic (2004)

Results:

Baseline: Heuristic – matches a sequence of words such that each word starts with an upper case letter each.

Gazetteers – matches a sequence of words present in the dictionary of first names and last names (63 555 entries) [2].

HMM: LingPipe [1] implementation of HMM:➢ 7 hidden states for every annotation type,➢ 3 additional states (BOS, EOS, middle token),➢ Witten-Bell smoothing,➢ first-best decoder based on Viterbi’s algorithm,➢ rescoring based on the language model.

“Pan Jan Nowak został nominowany na stanowisko prezesa”

(Mr. Jan Nowak was nominated for the chairman position.)

(BOS) → (E-O-PER) → (B-PER) → (E-PER) → (B-O-PER) → (W-O)→ (W-O) → (W-O) → (W-O) → (W-O) → (EOS)

Errors: We have analysed the errors produced by HMM and divided them in 6 groups:➢ Incorrect proper name category – 187 False Positives,➢ Lowercase and non-alphabetic expressions – 99 False Positives,➢ Incorrect annotation boundaries – 35 partially matched annotation, ➢ Missing annotations – 18 missing annotations,➢ Common words starting with an uppercase character – 6 False Positives.

Post-processing:

To improve the precision of HMM we applied a simple filter that matches only sequences of words starting with an upper case character and optionally separated by a hyphen. The rule was written as the following regular expression:

WORD = "([A-ZĄŻŚŹĘĆŃÓŁ])([a-zążśźęćńół])*";PATTERN = "/^WORD( WORD)+( - WORD)?( (WORD))?\$"

10-fold Cross Validation on the Stock Exchange Reports Corss-domain evaluation

Heuristic Gazetteers HMM HMM + f1+ HMM + f2+ HMM HMM + f1+ HMM + f2+

Precision 0.85 % 9.47 % 61.82 % 74.49 % 89.00 % 28.04 % 63.61 % 83.49 %

Recall 41.74 % 41.44 % 92.35 % 90.21 % 89.60 % 47.74 % 47.67 % 32.79 %

F1-measure 1.67 % 15.42 % 74.05 % 81.66 % 89.33 % 35.33 % 54.43 % 47.09 %

HMM + f is a HMM recognized combined with filter post-processing, where f1+ is a sequence of one or more token, f2+ is a sequence of two or more tokens.

Model Granularity:

PERSON FIRST NAME LAST NAME

HMM HMM + f1+ HMM HMM + f1+ HMM HMM + f1+

Precision 61.82 % 74.49 % 76.05 % 89.13 % 73.73 % 83.06 %

Recall 92.35 % 90.21 % 98.54 % 97.22 % 93.70 % 90.48 %

F1-measure 74.05 % 81.66 % 85.84 % 93.00 % 82.53 % 86.62 %

*Work financed by European Union within Innovative Economy Programme project POIG.01.01.02-14-013/09

f2+ was biased to the stock exchange corpus thus it was not considered in the evaluation of the model granularity.

User demoCorpus: poligonUser: testPassword: test