lexical normalization for dutch social media texts · lexical normalization for dutch social media...
Post on 10-Jul-2020
12 Views
Preview:
TRANSCRIPT
Lexical Normalization for Dutch Social Media TextsRob van der Goot & Gertjan van Noord
Lexical Normalizationnee ! :-D kzal nog es vriendenlijk doen lolnee ! :-D ik zal nog eens vriendelijk doen lol
tgaat goed , vdg rustig aaan .Het gaat goed , vandaag rustig aan .
social ppl r anoyingsocial people are annoying
aaah buenoo esqe digo pa qe madrugara este jajajaah bueno es que digo para que madrugara este jajaja
nekomu je sarkazm detektor crknunekomu je sarkazem detektor crknil
Performance Per Corpus
0 10,000 20,000 30,000 40,000 50,000Train size (words)
0.2
0.3
0.4
0.5
0.6
0.7
ER
R
GhentNorm
TweetNorm
LexNorm1.2
LexNorm2015
Janes-Norm
ReLDI-hr
ReLDI-sr
Is normalization for Dutch more difficult?
MoNoise
Rob van der Goot and Gertjan van Noord.MoNoise: Modeling Noise Using a ModularNormalization System. In CLIN Journal 2017
Tokenizer
Generation
Orig. Wordscoren #failscoren #fail
LookupListalst hahahahaals het haha
word2vecmss dinnetjemssn dinniemissch vriendinnetjemisschien dinnetjes
Aspellgrapjee felicterengrapje feliciterengrapjes flecterengreepje fluctueren
Splitkheb datisk heb dat is
word.*ech waarschecht waarschijnlijkecho waarschuwecheltje waarschuwt
Ranking
Features:isOrig N-grams WikiWord2vec dist. N-grams TwitterAspell dist. dictisSplit lengthword.* containsAlpha
origFeats
Random ForestClassifier
New Dataset• Annotate capitalization consistently• Annotate tokenization in a separate layer• Do not include phrasal abbreviations
(‘lol’ 7→‘laughing out loud’)• Make publicly available• No Flemish 7→ Dutch• Annotate POS dev/test data• Annotate categories?a
• Annotate Universal Dependencies?
aRob Van der Goot, Rik Van Noord, and Gertjan Van Noord. A taxonomy for in-depthevaluation of normalization for user generated content. In Proceedings of LREC 2018
Beneficial for Parsing?
new pix comming tomorroe
0 1 2 3 4new (1.0)
pix (0.6)
pics (0.3)
pictures (0.1)
comming (0.3)
coming (0.6)
common (0.1)
tomoroe (0.3)
tomorrow(0.5)
more (0.2)
S
VB
NP
NN
tomorrow
VBG
coming
NP
NNS
pictures
JJ
new
new pix comming tomoroe
rootamod nsubj nmod
Rob Van der Goot andGertjan Van Noord.Parser Adaptationfor Social Media byIntegrating Normalization.In Proceedings of ACL2017
Rob Van der Gootand Gertjan VanNoord. Modeling InputUncertainty in A NeuralNetwork DependencyParser. In Proceedingsof EMNLP 2018Brussels
Try it!www.let.rug.nl/rob/monoise
top related