native language...

13
Native Language Identification A challenging task description Barbora Hladká Martin Holub {hladka | holub}@ufal.mff.cuni.cz Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics Prague, 5.12. 2016 NPFL114 Deep Learning page 1/12

Upload: others

Post on 11-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Native Language Identificationufal.mff.cuni.cz/~straka/courses/npfl114/1617/nli.presentation.2016-12-05.pdf · Native Language Identification Sample texts (1) Many people lose thier

Native Language Identification

A challenging task description

Barbora Hladká Martin Holub

{hladka | holub}@ufal.mff.cuni.cz

Charles University,Faculty of Mathematics and Physics,

Institute of Formal and Applied Linguistics

Prague, 5.12. 2016 NPFL114 Deep Learning page 1/12

Page 2: Native Language Identificationufal.mff.cuni.cz/~straka/courses/npfl114/1617/nli.presentation.2016-12-05.pdf · Native Language Identification Sample texts (1) Many people lose thier

Native Language Identification

ARA CHI FRE GER HIN ITA

JPN KOR SPA TEL TUR

to be specialized in onespecific subject is also agood thing it will make you aproffional in a specificsubject , but when you itcomes to socity they will

Proficiency: highPrompt: P2

Changing your habbits is notan easy thing , but one isurged to do it for a numberof reasons.A successful teacher shouldrenew his lectures subjects .

Proficiency: highPrompt: P2

This indicates that he is upto date and also refreasheshis mind constantly .Also , this gives him a goodreputation among othercolleagues .

Proficiency: highPrompt: P1

university , I keep gettingnewer information fromdifferent books andresources and include it inmy lectures .

Proficiency: lowPrompt: P8

It is troublesome , as itseems , but it keeps mefreash in information andwith a good status amongmy colleagues .

Proficiency: highPrompt: P7

you are constantly trying tochange your ways ofmaking , or synthesizing ,chemical compunds , youmight come out with lessexpensive methods .

Proficiency: mediumPrompt: P2

The newer method isconsidered an invention in itsown , and it also savesmoney .

Proficiency: highPrompt: P2

to be specialized in onespecific subject is also agood thing it will make you aproffional in a specificsubject , but when you itcomes to socity they will

Proficiency: lowPrompt: P2 It is troublesome , as it

seems , but it keeps mefreash in information andwith a good status amongmy colleagues .

Proficiency: highPrompt: P2

to be specialized in onespecific subject is also agood thing it will make you aproffional in a specificsubject , but when you itcomes to socity they will

Proficiency: highPrompt: P2

Also , you might think ofchanging your view , clothesor even your hair cut .

Proficiency: mediumPrompt: P2

When I change my glassescolor , for example , thiswould be attractive to mystudents and colleagues andwill make me feel better .

Proficiency: lowPrompt: P7

Alternating work anddormancy in your life pacewith activities and exercise isof great benefit to the bodyand the mind .

Proficiency: mediumPrompt: P2

Lastly , change is a differentdecision in the human 's life ,but it is important for manygood reasons , and gets backon the human with goodbenefits .

Proficiency: highPrompt: P2

In contrast , many peoplebelieve that changing isreally important in people 'slives .

Proficiency: highPrompt: P2to be specialized in one

specific subject is also agood thing it will make you aproffional in a specificsubject , but when you itcomes to socity they will

Proficiency: mediumPrompt: P8

Prague, 5.12. 2016 NPFL114 Deep Learning page 2/12

Page 3: Native Language Identificationufal.mff.cuni.cz/~straka/courses/npfl114/1617/nli.presentation.2016-12-05.pdf · Native Language Identification Sample texts (1) Many people lose thier

Native Language Identification

Sample texts

(1) Many people lose thier life , thier job , or maybe thier family . Allof them were not learn from other ’s fault . Successful person whohas many choses that help him in his life . So , what we lose if welearn one more thing in our life .

(2) Last week I read an article in our daily newspaper about whoenjoys life more , female or male . I considered about these piece ofpaper a very long time and came to the conclusion that it is moreworth to distinguish between young and old people , than betweenfemale and male .

Prague, 5.12. 2016 NPFL114 Deep Learning page 3/12

Page 4: Native Language Identificationufal.mff.cuni.cz/~straka/courses/npfl114/1617/nli.presentation.2016-12-05.pdf · Native Language Identification Sample texts (1) Many people lose thier

Native Language Identification

Sample texts

(1) Many people lose thier life , thier job , or maybe thier family . Allof them were not learn from other ’s fault . Successful person whohas many choses that help him in his life . So , what we lose if welearn one more thing in our life .

(2) Last week I read an article in our daily newspaper about whoenjoys life more , female or male . I considered about these piece ofpaper a very long time and came to the conclusion that it is moreworth to distinguish between young and old people , than betweenfemale and male .

(1) . . . ARA(2) . . . GER

Prague, 5.12. 2016 NPFL114 Deep Learning page 3/12

Page 5: Native Language Identificationufal.mff.cuni.cz/~straka/courses/npfl114/1617/nli.presentation.2016-12-05.pdf · Native Language Identification Sample texts (1) Many people lose thier

Native Language Identification

Task: Predict L1 of English essays’s authors

Data:

• TOEFL11 is a corpus of non-native English writing• consists of essays on eight different topics (prompts P1 to P8)

• written by non-native speakers of three proficiency levels (labelslow/medium/high)

• the essays’ authors speak 11 different native languages that should bepredicted

• contains 1,100 tokenized essays per language with an average of 348word tokens per essay

• more info can be found in (Blanchard et al., 2013)

• Additionally, all texts have been preprocessed by the Standford POStagger (Toutanova et al., 2003).

Prague, 5.12. 2016 NPFL114 Deep Learning page 4/12

Page 6: Native Language Identificationufal.mff.cuni.cz/~straka/courses/npfl114/1617/nli.presentation.2016-12-05.pdf · Native Language Identification Sample texts (1) Many people lose thier

Existing approaches to NLIState of the art results

System # of feat. Acc.⋆ Approach

Gebre et al., 2013 73,626 84.6 tf-idf of unigramsand bigrams of words

Jarvis, Bestgen, 400K 84.5 {1,2,3}-gramsPepper, 2013 of words, lemmas,

POS tags, df ≥ 2

Kríž, Holub, 55⋆⋆ 82.4 language models using tokens,Pecina, 2015 characters, POS, suffixes

⋆Acc. – cross-validation results on train+dtest⋆⋆traditional n-grams are hidden in the language models

Prague, 5.12. 2016 NPFL114 Deep Learning page 5/12

Page 7: Native Language Identificationufal.mff.cuni.cz/~straka/courses/npfl114/1617/nli.presentation.2016-12-05.pdf · Native Language Identification Sample texts (1) Many people lose thier

Analyzing most discriminative n-grams

A sample of extracted word n-grams

n-gram fr df G max G max2 G max3

, 114,358 8,701 -2,857.8 TEL-ZHO 2,328.7 JPN-TEL 2,239.8 KOR-TEL

i 6,011 1,939 992 ARA-DEU 957.4 ARA-JPN 735.7 ARA

Japan 356 210 982.1 JPN 380.8 JPN-TUR 379.2 JPN-SPA

think 9,883 4,861 -915.1 HIN-ITA 797.6 ITA-TEL -767.9 HIN-JPN

think that 3,602 2,322 807.7 ITA-TEL 748.5 ITA 506.2 ITA-ZHO

I think that 1,963 1,375 796.3 ITA 523.6 ITA-ZHO 391.8 ITA-KOR

tour 3,810 694 -725.3 ITA-JPN -644.7 ITA-KOR -625.3 DEU-JPN

Indeed 282 222 694.2 FRA 288.5 FRA-HIN 282.4 FRA-TEL

in twenty 1,214 703 73.1 FRA-HIN -62.8 HIN NA HIN-JPN

fr . . . absolute n-gram frequency in the corpusdf . . . document frequency (number of examples containing given n-gram)

G stands for G-test statistic.

Prague, 5.12. 2016 NPFL114 Deep Learning page 6/12

Page 8: Native Language Identificationufal.mff.cuni.cz/~straka/courses/npfl114/1617/nli.presentation.2016-12-05.pdf · Native Language Identification Sample texts (1) Many people lose thier

Our best model (Kríž, Holub, and Pecina, 2015)

Language modeling approach to feature extraction

• We built 11 special language models of English (Mi), each based onthe texts with the same L1 language available in the training data.

• Then we compare Mi to a general language model of English (MG).

• The cross-entropy of text t with empirical n-gram distribution p givena language model M with distribution q is

H(t, M) = −∑

x p(x) log q(x).

• Normalized cross-entropy scores – used as features

DG(t, Mi) = H(t, Mi) − H(t, MG) = −∑

x

p(x) logqi(x)

qG(x),

where Mi are the special language models with distributions qi , andMG is the general language model with the distribution qG .

Prague, 5.12. 2016 NPFL114 Deep Learning page 7/12

Page 9: Native Language Identificationufal.mff.cuni.cz/~straka/courses/npfl114/1617/nli.presentation.2016-12-05.pdf · Native Language Identification Sample texts (1) Many people lose thier

Our best model – error analysis

Aggregated confusion matrixSum of 10 confusion matrices obtained in 10-fold cross validation process

ARA DEU FRA HIN ITA JPN KOR SPA TEL TUR ZHOARA 738 8 30 24 5 16 9 30 10 32 15DEU 7 836 13 4 20 4 6 20 1 16 1FRA 31 6 767 0 35 6 3 39 1 15 1HIN 19 1 1 684 0 2 3 3 194 9 4ITA 6 8 20 1 761 0 1 58 0 3 2JPN 7 2 3 3 2 709 109 6 1 8 22KOR 14 1 3 3 1 120 676 6 2 12 50SPA 30 21 45 6 69 3 4 720 1 17 7TEL 4 2 0 157 0 1 0 2 685 2 1TUR 32 15 11 16 6 11 23 11 5 778 16ZHO 12 0 7 2 1 28 66 5 0 8 781

Prague, 5.12. 2016 NPFL114 Deep Learning page 8/12

Page 10: Native Language Identificationufal.mff.cuni.cz/~straka/courses/npfl114/1617/nli.presentation.2016-12-05.pdf · Native Language Identification Sample texts (1) Many people lose thier

Initial experiments with NN

Model Reference

Paragraph vector model – Distributed Bag-Of-Words (Le & Mikolov, 2014)

Paragraph vector model – Distributed Memory (Le & Mikolov, 2014)

Word-to-Vec Inversion (Taddy, 2015)

Recurrent Neural Network (Cho et al., 2014)

Convolutional Neural Network (Kim, 2014)

Prague, 5.12. 2016 NPFL114 Deep Learning page 9/12

Page 11: Native Language Identificationufal.mff.cuni.cz/~straka/courses/npfl114/1617/nli.presentation.2016-12-05.pdf · Native Language Identification Sample texts (1) Many people lose thier

Goals

Build and tune a deep neural model to beat the state of the artusing

• recurrent neural networks

• convolutional neural networks

• . . .

Then the NN model(s) should be compared to the traditional SVMclassifiers:

– what is the difference?– is their performance complementary?

Prague, 5.12. 2016 NPFL114 Deep Learning page 10/12

Page 12: Native Language Identificationufal.mff.cuni.cz/~straka/courses/npfl114/1617/nli.presentation.2016-12-05.pdf · Native Language Identification Sample texts (1) Many people lose thier

References I

• Blanchard, Daniel and Tetreault, Joel and Higgins, Derrick and Cahill, Aoife andChodorow, Martin. TOEFL11: A Corpus of Non-Native English. ETS Research Report.2013.

• Cho, Kyunghyun, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014.On the properties of neural machine translation: Encoder–decoder approaches. InProceedings of SSST-8, EighthWorkshop on Syntax, Semantics and Structure inStatistical Translation, pages 103–111, Doha, Qatar, October. Association forComputational Linguistics.

• Gebre, Binyam Gebrekidan et al. Improving Native Language Identification with TF-IDFWeighting. In: Proceedings of the Eighth Workshop on Innovative Use of NLP for

Building Educational Applications, pp. 216–223, Atlanta, Georgia, 2013.

• Jarvis, Scott and Bestgen, Yves and Peppre, Steve. Maximizing Classification Accuracy inNative Language Identification. In: Proceedings of the Eighth Workshop on Innovative

Use of NLP for Building Educational Applications, pp. 111-118, Atlanta, Georgia, 2013.

• Kim, Yoon. 2014. Convolutional neural networks for sentence classification. InProceedings of the 2014 Conference on EMNLP, pages 1746–1751, Doha, Qatar, October.Association for Computational Linguistics.

• Kríž, Vincent and Holub, Martin and Pecina, Pavel. Feature Extraction for NativeLanguage Identification Using Language Modeling. In: Proceedings of Recent Advances in

Natural Language Processing. Hisarja, Bulgaria, pp. 298-306, 2015.

Prague, 5.12. 2016 NPFL114 Deep Learning page 11/12

Page 13: Native Language Identificationufal.mff.cuni.cz/~straka/courses/npfl114/1617/nli.presentation.2016-12-05.pdf · Native Language Identification Sample texts (1) Many people lose thier

References II

• Le, Quoc and Mikolov, Tomas. Distributed Representations of Sentences and Documents.In: Proceedings of the 31st International Conference on Machine Learning, Beijing, China,2014.

• Taddy, Matt. 2015. Document Classification by Inversion of Distributed LanguageRepresentations. In: Proceedings of the 53rd Annual Meeting of the Association for

Computational Linguistics and the 7th International Joint Conference on Natural

Language Processing (Short Papers), pp. 45–49, Beijing, China, 2015.

• Toutanova Kristina, Dan Klein, Christopher Manning, and Yoram Singer. 2003.Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network. In Proceedingsof HLT-NAACL 2003, pp. 252-259.

Prague, 5.12. 2016 NPFL114 Deep Learning page 12/12