study on named entity recognition for polish based on hidden markov models -- poster
TRANSCRIPT
![Page 1: Study on Named Entity Recognition for Polish Based on Hidden Markov Models -- poster](https://reader036.vdocuments.mx/reader036/viewer/2022082700/5511e06b4a7959e5028b4767/html5/thumbnails/1.jpg)
Study on Named Entity Recognition for PolishStudy on Named Entity Recognition for PolishBased on Hidden Markov Models*Based on Hidden Markov Models*
Michał Marciń[email protected]
Maciej [email protected]
Our Task: To evaluate the accuracy of Named Entity Recognition for Polish based on Hidden Markov Models.
We evaluated the recognition of PERSON type named entities that is a linearly continuous expressions referring to a person composed first name, second name, last name, maiden name and pseudonym.
Corpora: Stock Exchange Reports – 1215 documents from GPWInfoStrefa.pl consists of 10 066 sentences, 282 418 tokens and includes 654 PERSON annotations. The corpus is available at http://nlp.pwr.wroc.pl/gpw/download
Police Reports (Graliński et al., 2009) – 11 statements produced by witnesses and suspects consist of 1 583 sentences, 29 569 tokens and 555 PERSON annotations. The corpus was used for cross-domain evaluation.
Web-based application used to annotate and browse the corpora.
http://nlp.pwr.wroc.pl/gpw
Results:
Baseline: Heuristic – matches a sequence of words such that each word starts with an upper case letter each.
Gazetteers – matches a sequence of words present in the dictionary of first names and last names (63 555 entries) (Piskorski, 2004).
HMM: LingPipe implementation of HMM:➢ 7 hidden states for every annotation type,➢ 3 additional states (BOS, EOS, middle token),➢ Witten-Bell smoothing,➢ first-best decoder based on Viterbi’s algorithm,➢ rescoring based on the language model.
“Pan Jan Nowak został nominowany na stanowisko prezesa”
(Mr. Jan Nowak was nominated for the chairman position.)
(BOS) → (E-O-PER) → (B-PER) → (E-PER) → (B-O-PER) → (W-O)→ (W-O) → (W-O) → (W-O) → (W-O) → (EOS)
Errors: We have analysed the errors produced by HMM and divided them in 6 groups:➢ Incorrect proper name category – 187 False Positives,➢ Lowercase and non-alphabetic expressions – 99 False Positives,➢ Incorrect annotation boundaries – 35 partially matched annotation, ➢ Missing annotations – 18 missing annotations,➢ Common words starting with an uppercase character – 6 False Positives.
Post-processing:
To improve the precision of HMM we applied a simple filter that matches only sequences of words starting with an upper case character and optionally separated by a hyphen. The rule was written as the following regular expression:
WORD = "([A-ZĄŻŚŹĘĆŃÓŁ])([a-zążśźęćńół])*";PATTERN = "/^WORD( WORD)+( - WORD)?( (WORD))?\$"
10-fold Cross Validation on the Stock Exchange Reports Corss-domain evaluation
Heuristic Gazetteers HMM HMM + f1+ HMM + f2+ HMM HMM + f1+ HMM + f2+
Precision 0.85 % 9.47 % 61.82 % 74.49 % 89.00 % 28.04 % 63.61 % 83.49 %
Recall 41.74 % 41.44 % 92.35 % 90.21 % 89.60 % 47.74 % 47.67 % 32.79 %
F1-measure 1.67 % 15.42 % 74.05 % 81.66 % 89.33 % 35.33 % 54.43 % 47.09 %
HMM + f is a HMM recognized combined with filter post-processing, where f1+ is a sequence of one or more token, f2+ is a sequence of two or more tokens.
Model Granularity:
PERSON FIRST NAME LAST NAME
HMM HMM + f1+ HMM HMM + f1+ HMM HMM + f1+
Precision 61.82 % 74.49 % 76.05 % 89.13 % 73.73 % 83.06 %
Recall 92.35 % 90.21 % 98.54 % 97.22 % 93.70 % 90.48 %
F1-measure 74.05 % 81.66 % 85.84 % 93.00 % 82.53 % 86.62 %
*Work financed by European Union within Innovative Economy Programme project POIG.01.01.02-14-013/09
f2+ was biased to the stock exchange corpus thus it was not considered in the evaluation of the model granularity.
User demoCorpus: poligonUser: testPassword: test
Positive patterns: Negative patterns:<first name> <first name> <second name> <last name> <last name> & <last name> Company – company name<last name> <first name> <last name>-<maiden name> <first name> <last name> University – institution name <first name> <last name> <first name> <last name> (<last name>) <first name> <last name> Square – location name<initial> <last name> <first name> i <first name> <last name>
Sample:
![Page 2: Study on Named Entity Recognition for Polish Based on Hidden Markov Models -- poster](https://reader036.vdocuments.mx/reader036/viewer/2022082700/5511e06b4a7959e5028b4767/html5/thumbnails/2.jpg)
Study on Named Entity Recognition for PolishStudy on Named Entity Recognition for PolishBased on Hidden Markov Models*Based on Hidden Markov Models*
Michał Marciń[email protected]
Maciej [email protected]
Our Task: To evaluate the accuracy of Named Entity Recognition for Polish based on Hidden Markov Models.
We evaluated the recognition of PERSON type named entities that is a linearly continuous expressions referring to a person composed first name, second name, last name, maiden name and pseudonym.
Corpora: Stock Exchange Reports – 1215 documents from GPWInfoStrefa.pl consists of 10 066 sentences, 282 418 tokens and includes 654 PERSON annotations. The corpus is available at http://nlp.pwr.wroc.pl/gpw/download
Police Reports [1] – 11 statements produced by witnesses and suspects consist of 1 583 sentences, 29 569 tokens and 555 PERSON annotations. The corpus was used for cross-domain evaluation.
Web-based application used to annotate and browse the corpora.
http://nlp.pwr.wroc.pl/gpw
Reference:[1] Alias-i, LingPipe 3.9.0. http://alias-i.com/lingpipe (October 1, 2008)[2] Graliński, F., Jassem, K., Marcińczuk, M.: An Environment for Named Entity Recognition and Translation. In: Màrquez, L., Somers, H. (eds.) Proceedings of the 13th Annual Conference of the European Association for Machine Translation, pp. 8895. Barcelona, Spain (2009)[3] Piskorski J.: Extraction of Polish named entities. In: Proceedings of the Fourth International Conference on Language Resources and Evaluation, LREC 2004, pp. 313-316. ACL, Prague, Czech Republic (2004)
Results:
Baseline: Heuristic – matches a sequence of words such that each word starts with an upper case letter each.
Gazetteers – matches a sequence of words present in the dictionary of first names and last names (63 555 entries) [2].
HMM: LingPipe [1] implementation of HMM:➢ 7 hidden states for every annotation type,➢ 3 additional states (BOS, EOS, middle token),➢ Witten-Bell smoothing,➢ first-best decoder based on Viterbi’s algorithm,➢ rescoring based on the language model.
“Pan Jan Nowak został nominowany na stanowisko prezesa”
(Mr. Jan Nowak was nominated for the chairman position.)
(BOS) → (E-O-PER) → (B-PER) → (E-PER) → (B-O-PER) → (W-O)→ (W-O) → (W-O) → (W-O) → (W-O) → (EOS)
Errors: We have analysed the errors produced by HMM and divided them in 6 groups:➢ Incorrect proper name category – 187 False Positives,➢ Lowercase and non-alphabetic expressions – 99 False Positives,➢ Incorrect annotation boundaries – 35 partially matched annotation, ➢ Missing annotations – 18 missing annotations,➢ Common words starting with an uppercase character – 6 False Positives.
Post-processing:
To improve the precision of HMM we applied a simple filter that matches only sequences of words starting with an upper case character and optionally separated by a hyphen. The rule was written as the following regular expression:
WORD = "([A-ZĄŻŚŹĘĆŃÓŁ])([a-zążśźęćńół])*";PATTERN = "/^WORD( WORD)+( - WORD)?( (WORD))?\$"
10-fold Cross Validation on the Stock Exchange Reports Corss-domain evaluation
Heuristic Gazetteers HMM HMM + f1+ HMM + f2+ HMM HMM + f1+ HMM + f2+
Precision 0.85 % 9.47 % 61.82 % 74.49 % 89.00 % 28.04 % 63.61 % 83.49 %
Recall 41.74 % 41.44 % 92.35 % 90.21 % 89.60 % 47.74 % 47.67 % 32.79 %
F1-measure 1.67 % 15.42 % 74.05 % 81.66 % 89.33 % 35.33 % 54.43 % 47.09 %
HMM + f is a HMM recognized combined with filter post-processing, where f1+ is a sequence of one or more token, f2+ is a sequence of two or more tokens.
Model Granularity:
PERSON FIRST NAME LAST NAME
HMM HMM + f1+ HMM HMM + f1+ HMM HMM + f1+
Precision 61.82 % 74.49 % 76.05 % 89.13 % 73.73 % 83.06 %
Recall 92.35 % 90.21 % 98.54 % 97.22 % 93.70 % 90.48 %
F1-measure 74.05 % 81.66 % 85.84 % 93.00 % 82.53 % 86.62 %
*Work financed by European Union within Innovative Economy Programme project POIG.01.01.02-14-013/09
f2+ was biased to the stock exchange corpus thus it was not considered in the evaluation of the model granularity.
User demoCorpus: poligonUser: testPassword: test