joint pos tagging and text normalization for informal text · pos tagging in informal text....

Joint POS Tagging and Text Normalization for Informal Text

Chen Li and Yang LiuUniversity of Texas at DallasRichardson, TX, 75080, USA{chenli,yangl}@hlt.utdallas.edu

AbstractText normalization and part-of-speech (POS) tag-ging for social media data have been investigatedrecently, however, prior work has treated them sep-arately. In this paper, we propose a joint Viterbidecoding process to determine each token’s POStag and non-standard token’s correct form at thesame time. In order to evaluate our approach, wecreate two new data sets with POS tag labels andnon-standard tokens’ correct forms. This is thefirst data set with such annotation. The experi-ment results demonstrate the effect of non-standardwords on POS tagging, and also show that our pro-posed methods perform better than the state-of-the-art systems in both POS tagging and normalization.

1 IntroductionThere has been a rapid increase in social media text in thelast few years, including the mobile phone text message(SMS), comments from the social media websites such asFacebook and Twitter, and real-time communication plat-forms like MSN and Gtalk. Unfortunately, traditional naturallanguage processing (NLP) tools sometimes perform poorlywhen processing this kind of text. One reason is that socialtext is very informal, and contains many misspelled words,abbreviations and many other non-standard tokens.

There are several ways to improve language processingperformance on the social media data. One is to lever-age normalization techniques which can automatically con-vert the non-standard tokens into the corresponding standardwords. Intuitively this will ease subsequent language pro-cessing modules for the domain of social media that containsnon-standard tokens. For example, if ‘2mr’ is converted to‘tomorrow’, a text-to-speech system will know how to pro-nounce it, a POS tagger can label it correctly, and an infor-mation extraction system can identify it as a time expres-sion. This task has received increasing attention in socialmedia language processing. Another way is to design spe-cial models and data or apply specific linguistic knowledgein this domain [Ritter et al., 2011; Owoputi et al., 2013;Ritter et al., 2010; Foster et al., 2011; Liu et al., 2011;2012]. For example, the system in [Ritter et al., 2011] re-duced the POS tagging prediction error by 41% compared

with the Stanford POS Tagger, and by 22% in parsing tweetscompared with the OpenNLP chunker tool. [Owoputi et al.,2013] created a new set of POS tags for Twitter data, andshowed improved POS tagging performance for such datawhen using their tag set and word cluster information ex-tracted from a huge Twitter corpus.

In this paper, our objective is to perform POS tagging andtext normalization at the same time. Through analysis of pre-vious POS tagging results in social media data, we find thatnon-standard tokens indeed have a negative impact on POStagging. They affect not only the accuracy of the POS tags forthemselves, but also their surrounding correct words becausecontext information is an very important feature for POS tag-ging. Therefore, we expect that explicitly performing nor-malization would improve POS tagging accuracy, for both thenon-standard words themselves and their context words. Onthe other hand, previous work in normalization mostly usedword level information, such as character sequences and pro-nunciation features. Some work [Yang and Eisenstein, 2013;Li and Liu, 2014] leveraged unsupervised methods to con-struct the semantic relationship between non-standard tokensand correct words by considering words’ context. Deeper lin-guistic information such as POS tags has not been incorpo-rated for normalization. Motivated by these, we propose tojointly perform POS tagging and text normalization, in orderto let them benefit each other. Although joint learning anddecoding approaches have been widely used in many tasks,such as joint Chinese word segmentation and POS taggingtask [Zhang and Clark, 2008], joint text sentence compres-sion and summarization [Martins and Smith, 2009], this isthe first time to apply joint decoding for normalization andPOS tagging in informal text.

Therefore, our work is related to POS tagging, a funda-mental research problem in NLP, which has countless appli-cations. One of the latest research on POS tagging in so-cial media domain is from [Gimpel et al., 2011; Owoputiet al., 2013]. They built a POS tagger for tweets using 25coarse-grained tags and also provided two date sets anno-tated with this special tag set. Then they incorporated un-supervised word cluster information into their tagging sys-tem and achieved significant improvement on tagging perfor-mance. Another line of work closely related to ours is textnormalization in social media domain. Lots of approacheshave been developed for this task, from using edit distance

Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI 2015)

1263

[Damerau, 1964; Levenshtein, 1966], to the noisy channelmodel [Cook and Stevenson, 2009; Pennell and Liu, 2010;Li and Liu, 2012a] and machine translation method [Aw etal., 2006; Pennell and Liu, 2011; Li and Liu, 2012b]. Nor-malization performance on some bench mark data has beenimproved a lot.

Our contributions in this paper are as follows: (1) To thebest of our knowledge, this is the first time that an effectiveand joint approach is proposed to combine the normalizationand POS tagging techniques to improve the performance ofthese two tasks on English social media data; (2) We createdtwo data sets for this joint task. In these two data sets, everytoken is labeled with POS tag and a correct word if it is a non-standard token; (3) We demonstrate the effectiveness of ourproposed method. Our results outperform the state-of-the-artPOS tagger and normalization systems in two different datasets. Our analysis shows the impact of non-standard wordson POS tagging and the effect of various factors in the jointmodel.

2 Data SetSince there is no prior work performing joint POS taggingand normalization of non-standard tokens at the same time,we created a data set1 for such a joint task by reusing previouswidely used data for each of these two tasks.

[Owoputi et al., 2013] released two data sets, calledOCT27 and DAILY547 respectively. These two data sets areannotated with their designed POS tag set (see their paper fordetails). All together there are 2374 tweets. We asked sixnative English speakers (they are also social network heavyusers) to find the non-standard tokens from these tweets, andalso provide the corresponding correct words according totheir knowledge. The annotation results showed that 798tweets contain at least one non-standard token (The rest 1576sentences have no non-standard tokens). We put these 798tweets in one data set, which has both POS tag labels andnon-standard tokens with their correct word forms. This dataset is called Test Data Set 1 in the following of this paper.

[Han and Baldwin, 2011] released a data set including 549tweets. Each tweet contains at least one non-standard to-ken and the corresponding correct word. In order to labelPOS tags for these tweets, we first trained a POS taggingmodel based on the 2374 tweets mentioned above, using thesame features from [Owoputi et al., 2013]. Then we appliedthis model to these 549 tweets and asked one English nativespeaker to correct the wrong tags manually. After this proce-dure, these 549 tweets also have POS tags and non-standardtokens’ annotation. We call it Test Data Set 2 in the rest of thepaper. Please note that every sentence in these two test setshas at least one non-standard token.

3 POS Tagging and Normalization Baseline3.1 POS Tagging[Owoputi et al., 2013] used a Maximum Entropy Markovmodel (MEMM) to implement a POS tagger for Twitter do-main.In addition to the contextual word features, they in-

1http://www.hlt.utdallas.edu/∼chenli/normalization pos/

cluded other features such as the cluster-based features, aword’s most frequent POS tags in Penn TreeBank2 tags, atoken-level name list feature, which fires on words fromnames from several sources. Please refer to [Owoputi et al.,2013] for more details about their POS tagging system andthe features. We built a POS tagger baseline using CRF mod-els, rather than MEMM, but kept the same features as used in[Owoputi et al., 2013].

To help understand our proposed joint POS tagging andnormalization in the next section, here we briefly explain theViterbi decoding process in POS tagging. Figure 1 shows thetrellis for part of the test word sequence ‘so u should answr urphone’. Every box with dashed lines represents a hidden state(possible POS tag) for the corresponding token. Two sourcesof information are used in decoding. One is the tag transitionprobability, p(yi|yj), from the trained model, where yi and yjare two POS tags. The other is p(yi|t2, t1t2t3), where yi is thePOS label for token t2, t1 is the previous word and t3 is thefollowing word. The reason we drop the entire sequence fromthe condition is because all the features are defined based ona three-word window.

15

should u

V

N

.

.

.

… … … …

p(N|N)

p(V|N)

p(N|V)

p(V|V)

p(N|should)

p(V|u)

.

.

.

answr

N

V

p(N|u)

p(V|should)

… …

… … … …

… …

p(N|answr)

p(V|answr)

.

.

.

N

V

Figure 1: A simple trellis of Viterbi decoding for POS tag-ging.

3.2 Normalization Model[Li and Liu, 2014] achieved state-of-the-art normalizationperformance on Test Data Set 2 by using a reranking tech-nique that combines multiple normalization systems. We usethe same normalization method in this study. Briefly, thereare three supervised and two unsupervised normalization sys-tems for each non-standard token, resulting in six candidatelists (one system provides two lists). Then a Maximum En-tropy reranking model is adopted to combine and rerank thesecandidate lists, using a rich set of features. After reranking,for each non-standard token t, the system provides a candi-date list, and a probability p(si|t) for each candidate si.

Viterbi decoding is used for sentence level normalizationafter generating the token level normalization candidates andscores. Figure 2 shows a decoding trellis for normalization.Here for the non-standard tokens, the hidden states representnormalization word candidates. For a standard word that doesnot need normalization, we can treat it as a special case anduse a state with the word itself (the normalization probabilityis 1 in this case). The scores used in decoding are: the proba-bility of a normalization candidate word sj given the current

2http://www.cis.upenn.edu/ treebank/

1264

token ti, p(sj |ti), which is from the token level normaliza-tion model; and the transition probability from a languagemodel, p(sj |sl), where sl is a candidate word for the previ-ous token ti−1 (for the standard word case, it will be just theword itself). Note that in the trellis used for normalization,the number of hidden states varies for different tokens. Thetrellis shown here is for a first-order Markov model, i.e., abigram language model is used. This is also what is used in[Li and Liu, 2014]. The states can be expanded to consider atrigram language model, which is actually what we will uselater in the joint decoding section.

15

should u … … … …

p(should| your)

p(your|u)

.

.

.

answr

answer

anser

p(you|u)

… …

… … … …

… …

.

.

.

p(should| you)

p(answer|

should)

p(anser| should)

ugly

your

ur

p(ugly|answer)

p(ugly|anser)

p(your|answer)

p(your|answer)

p(answer|answr)

p(anser|answr) p(your|ur)

p(ugly|ur)

.

.

.

you

your

Figure 2: A simple trellis of Viterbi decoding for Normaliza-tion

4 Proposed MethodFor a given sequence of words, our task is to find the normal-ized sequence and the POS tags for the words. Rather thanusing a pipeline method that performs normalization first, andthen POS tagging on the normalized sequence, we propose touse joint decoding to predict the normalized words and thePOS tags together. This is expected to avoid generating thesingle best, but likely incorrect normalization hypothesis tobe used for POS tagging, as well as be able to leverage POSinformation to help normalization.

4.1 Joint Decoding for POS and NormalizationOur proposed joint decoding approach for POS tagging andnormalization is to combine the above two decoding proce-dures together as one process. In this joint decoding process,for a token in the test sequence, its hidden states consist of anormalization word candidate and its POS tag. Since for POStagging and normalization, we use information from the pre-vious and the following word (extracting features in POS tag-ging, n-gram LM probability in normalization), we put thesecontextual words explicitly in the states.

Figure 3 shows part of the trellis for the example used pre-viously (test sequence ‘u should answr ur’). It can be thoughtof as a combination of Figure 1 and Figure 2. Let us assumethat each non-standard token (‘u’, ‘answr’ and ‘ur’) has twonormalization candidate words. They are you and your foru, answer and anser for answr, and ugly and your for ur. Ablack box with dashed lines in Figure 3 represents a state.

There are some properties of this trellis worth pointing out.In joint decoding, each state is composed of a POS tag andnormalization word. Furthermore, as mentioned earlier, weinclude the previous word and the following word in the state

(these are the normalization word candidates for the previ-ous and next word). For one state (N, should, you, answer),p(N |should, you, answer) means the probability of wordshould’s POS is Noun, given its previous token’s (u) nor-malization is ‘you’ and next token’s (answr) normalizationis ‘answer’. The green box in Figure 3 indicates the samethree-word sequence, with different POS tags (one green boxcorresponds to the states for a token in Figure 1, where theword and context are fixed, and the states are just about POStags). In this trellis, not all the transitions among the statesfor two consecutive tokens are valid. Figure 3 shows an ex-ample – a path from state (N, should, you, answer) to state(N, anser, should, your) is illegal because they do not sharethe same word sequence. This is a standard note when usingtrigram LM in decoding.

Regarding the number of states for each token, it dependson the number of POS tags (k), and the number of normal-ization word candidates of the current token (l), the previoustoken (m), and the following token (n). For a standard word,it has just one normalization candidate, the word itself. Thenumber of the hidden states of a token is l ∗m ∗ n ∗ k.

Using such defined states, we can perform Viterbi decod-ing to find the best state sequence, i.e., the best POS tags andnormalization results. The following scores need to be con-sidered to compute the forward score for each state:

• p1: the probability of the state’s POS tag given the threewords in this state;• p2: the probability of the normalization candidate for the

token;• p3: the transition probability of the POS tag from last

state to that of the current state;• p4: trigram language model probability from the pre-

vious state to the current state (current word given theprevious state’s first two words);

The first two probabilities are related to emission probabil-ities between the hidden state and the observed token, com-ing from the POS tagger and the normalization model respec-tively. We use a parameter α when adding them together.Again, for a standard word, its normalization candidate is justitself, and p2 is 1 in this case. The last two probabilities areabout the transition probability between the two states. Weuse β when combining the POS tag transition probability withthe trigram LM probability.

For a hidden state s of an observed token ti, if its normal-ization candidate is ct, its previous and following words arecti−1

and cti+1, and its POS tag is yj , we define the forward

value of this hidden state as following:

f(s) = maxs′∈S

[ f(s′) +

α ∗ p1(yj |ct, cti−1 cti+1) + p2(ct|ti) +p3(yj |pos(s′)) + β ∗ p4(ct|bigram(s′))]

(1)

in which S is the set of all the hidden states of the previ-ous observed token with valid transitions to this current state;function pos returns the state’s POS tag, and bigram returnsthe state’s current word and its previous context word. We

1265

should u … …

.

.

.

answr

… …

… …

… …

.

.

.

ur

N | should (you answer)

V|should (you answer)

V | should (you anser)

N | should (you anser)

.

.

.

.

.

.

N | anser (should your)

.

.

.

N| ugly (answer tj+1)

V| ugly (answer tj+1)

V | your (anser tj+1)

.

.

.

.

.

.

.

.

.

.

.

.

N | you (ti-1 should)

V | you (ti-1 should)

V | your (ti-1 should)

N | your (ti-1 should)

.

.

.

.

.

.

.

.

.

V | anser (should your)

p(answer|you should) P(N|N)

… …

… …

N | answer (should ugly)

V | answer (should ugly)

p(N|should, you anser)

p(V|anser, should your)

p(anser|answr)

p(N|anser, should your)

p(V|should, you anser)

N | your (anser tj+1)

p(answer|you should) P(V|N)

p(answer|answr)

Figure 3: An example trellis for joint decoding.

can use backtracking pointers to recover the highest scoringstate sequence.

4.2 Running TimeAssuming the number of normalization candidates for eachnon-standard token isN (for some words, there may be fewercandidates if the normalization systems cannot provide Ncandidates), our proposed algorithm’s running time is roughlyO(LN4K2), where L is the length of the sequence, and K isthe number of POS tags. This is because there are N3K hid-den states for a non-standard token, and there are NK statesfrom the previous token that have valid transitions to eachstate of the current token. Of course, this is a worst casescenario. In practice, a sentence has many standard wordsthat do not need normalization, and can significantly reducethe number of states to be considered (N3 factor becomes 1).Furthermore, pruning can be applied to remove the candidatestates for each position. When using the pipeline approach,the normalization decoding’s complexity is O(LN4) (similarreason as for the joint decoding, except that the POS tag isnot part of the states when trigram LM is used), and the com-plexity for POS tagging is O(LK2).

4.3 TrainingIn the description above, we assumed that the POS tagger andthe normalization model are trained separately, using theirown labeled data sets, and the two models are combined inthe joint decoding process. However, it is possible to trainthe two models in a joint fashion, if there is a fully anno-tated corpus that is labeled with POS tags and non-standardtoken’s correct form. Structured perceptron [Collins, 2002]can be applied to update the weights in each model based onthe current results using joint decoding. In addition, this kind

of training strategy is also applicable for partially annotatedcorpus. For example, if a corpus has only normalization la-bel, we can use it to train a normalization model using thejoint normalization and POS decoding results.

5 Experiments and Results5.1 Experiment SetupTo evaluate the impact of normalization on POS tagging, andthe benefit of joint decoding on both normalization and POStagging, we use the following experimental setups.

(a). POS taggingAs we mentioned in Section 2, 798 tweets out of 2,374are selected as Data Set 1. Therefore when testing onthe Data Set 1, we used the rest 1576 tweets with thePOS labels as the training data for the CRF POS taggingmodel, implemented using the Pocket CRF toolkit.Whentesting on Data Set 2, we use all the 2374 tweets to trainthe POS tagger.

(b). NormalizationFor the normalization model, all the supervised normal-ization systems are trained using the data released by[Li and Liu, 2014].3 It has 2,333 unique pairs of non-standard tokens and standard words, which are collectedfrom 2,577 Twitter messages (selected from the Edin-burgh Twitter corpus [Petrovic et al., 2010]). This train-ing data has only normalization annotation, not POS in-formation. We first used the Maximum Entropy rerank-ing for token level normalization, and then a sentencelevel decoding process (introduced in Section 3.2) was

3http://www.hlt.utdallas.edu/∼chenli/normalization/

1266

used to generate normalization results.4 We tried bigramand trigram language models during sentence level de-coding.

(c). Normalization + POS taggingIn this experiment, all the tweets are first normalized us-ing the above normalization system in (b) (with only besthypothesis), followed by POS tagging, described in (a).

(d). Oracle Normalization + POS taggingHere for each non-standard word, we use the referencenormalized words. POS tagging is then applied to thesenormalized tweets. This can be considered as an oracleperformance.

(e). Joint Decoding using Separately Trained ModelsThe two setups above use a pipeline process: normaliza-tion followed by POS tagging. Our first joint decodingexperiment uses the POS tagger and the normalizationmodels trained independently. Joint Viterbi decoding isused to combine the information from these models tomake the final prediction. The number of non-standardtoken’s candidates is set as 10, and parameter α and βare both set as 0.8.

(f). Joint Decoding using Partial Jointly Trained ModelThis one also uses joint decoding; however, we applyperceptron training strategy and joint decoding processto train a normalization model, while keeping the POSmodel fixed. We could also use this strategy to train aPOS tagging model if the 1576 tweets with POS labelsalso have non-standard tokens. In addition, we couldsimultaneously train both models if we have the corpuswith both labels. However, data with such annotations isquite limited (they are used as our test sets). Therefore,we leave the fully joint training task for future research.

5.2 Experiment ResultsTable 1 shows the POS tagging and normalization accura-cies using different setups. Note that we assume we knowwhich words are non-standard words and need normaliza-tion, similar to previous work in [Han and Baldwin, 2011;Yang and Eisenstein, 2013; Li and Liu, 2014]. We can seefrom the table that: (1) The POS tagging accuracy of the sys-tem without normalization is worse than all the others withnormalization. (2) In terms of normalization results, per-formance of the normalization system with a second-orderMarkov model is better than that using first-order. (3) Us-ing joint decoding yields better accuracy for both normaliza-tion and POS tagging than the pipeline system that performsnormalization and then POS tagging. (4) The joint decodingsystem with the normalization model trained from partiallyjoint training with the perceptron strategy outperforms thatwith the models trained independently. (5) When all the non-standard tokens are correctly normalized (oracle setup), thePOS tagging accuracy is the highest, as expected.

4This normalization result is the state-of-the-art performance onTest Set 2.

SystemTest Set 1 Test Set 2

Norm POS Norm POSPOS w/o Norm 0 90.22 0 90.54

Pipeline Norm† + POS 76.12 90.8 86.91 90.64Pipeline Norm‡ + POS 76.96 90.91 87.01 90.68

Norm Oracle + POS 100 92.05 100 91.17Joint decoding

77.03 91.04 87.15 90.72Separately Trained Model

Joint decoding 77.31 91.21 87.58 90.85Partially Joint Trained

Table 1: Normalization and POS tagging results from differ-ent systems on two data sets. All the results are accuracy (%).† Using first-order Markov Viterbi decoding in Norm system‡ Using second order Markov model in Norm system.

5.3 Impact of Candidate NumberAs mentioned in Section 4.2, the normalization candidatenumber is a key variable affecting the running time in ourproposed method. Fortunately, a good normalization modelcan already rank the most possible candidates in the top. Us-ing the token level normalization reranking results, we findthat the top 20 candidates can already provide more than 90%precision in Test Set 2, though the top 1 accuracy is far fromthat; therefore it seems unnecessary to use more than 20 can-didates in the joint decoding process. Figure 4 shows the av-erage number of hidden states for each token when varyingthe maximum number of the normalization candidates. TheY-axis uses a relative value, in comparison with that when thecandidate number is set as 1. We can see that on average theincreasing rate is not bad (the worst case isN3). In addition, atypical tweet rarely has more than 3 consecutive non-standardtokens. Table 2 shows the frequency of different numbers ofconsecutive non-standard tokens in the two test data sets. Ob-viously, most consecutive non-standard tokens are fewer than3 tokens. The average consecutive non-standard token num-ber in a tweet is 1.78 and 2.14 in the two data sets, while theaverage length of tweets in two test sets is 16.11 and 19.24respectively. Therefore, the worst complexity we discussedearlier rarely happens in practice.

5

10

15

20

25

30

2 4 6 8 10 12 14 16 18 20

Rel

ativ

e H

idde

n S

tate

Num

ber

Candidate Number

Test Set 1Test Set 2

Figure 4: The average number of hidden states for each tokenin the two test sets when varying the number of normalizationcandidates.

1267

# of consecutive 1 2 3 4 5 6non-standard tokenFreq in Test Set1 1072 128 24 3 0 2Freq in Test Set2 879 106 15 4 2 1

Table 2: Frequency of different numbers of consecutive non-standard tokens.

Intuitively deceasing the normalization candidate numberfor a non-standard token can speed up the decoding process,but hurts the normalization results and subsequently the POStagging results. Therefore a study of the trade-off between thespeed and accuracy is needed. Figure 5 shows the speed andperformance as the number of candidates varies. The speed isalso a relative value to that when the candidate number is 1.For example, in Test Set 1 when the candidate number is setas 20, its average speed of decoding a tweet is almost 70 timesslower than that when candidate number is set as 1. From thisFigure, we can see that when the candidate number is set to5, both the normalization and POS tagging accuracy do notchange much compared to when using 20 candidates, but thespeed is about 4 times faster.

76

78

80

82

84

86

88

90

92

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0

10

20

30

40

50

60

70

80

PO

S T

agge

r &

Nor

mor

lizat

ion

Acc

urac

y

Rel

ativ

e D

ecod

ing

Spe

ed

Candidate Numeber (Test Set 1)

POS Tag AccuracyNormalization Accuracy

Ave Speed Per Tweet

80

82

84

86

88

90

92

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0

10

20

30

40

50

60

70

PO

S T

agge

r &

Nor

mor

lizat

ion

Acc

urac

y

Rel

ativ

e D

ecod

ing

Spe

ed

Candidate Number (Test Set 2)

POS Tag AccuracyNormalization Accuracy

Ave Speed Per Tweet

Figure 5: Tradeoff between performance and speed.

5.4 Error AnalysisTable 3 shows the POS tagging results separately for non-standard tokens and standard words on Test Set 1. We cansee that normalization has a significant impact on the non-standard words. As expected, as normalization performanceincreases, the POS tagging errors for the non-standard wordsdecrease. In comparison, the effect of normalization on thestandard words is much smaller. In the oracle system, thePOS tagging errors for the non-standard tokens are signif-icantly lower than that of our proposed system, suggestingthere is still quite some room for improvement for normaliza-tion and POS tagging.

SystemNon-standard Standard

Token Err Word ErrPOS w/o Norm 25.46 7.82

Pipeline Norm‡ + POS 19.90 7.65Oracle 10.74 7.60

Joint decoding19.00 7.51

Partially Joint Trained

Table 3: POS tagging errors for non-standard tokens and stan-dard words in Test Set 1. ‡Using second order Markov modelin Norm system.

A detailed error analysis further shows what improvementour proposed method makes and what errors it is still mak-ing. For example, for the tweet ‘I can tell you tht my love...’, token tht is labeled as ‘verb’ when POS tagging is doneon the original tweet (i.e., no normalization is performed).When POS tagging is applied to the normalized tweet, ‘tht’is normalized as that, and the POS tag is also changed to theright one (subordinating conjunction). We noticed that jointdecoding can solve some complicated cases that are hard forthe pipeline system. Take the following tweet as an exam-ple: ‘they’re friiied after party !’. Our joint decoding suc-cessfully corrects friiied to fried and thus labels They’re asL(nominal+verbal). However, the pipeline system first cor-rected friiied as friend, and then POS tagging system labeledThey’re as D(determiner). In addition, when there are consec-utive non-standard tokens, typically the joint decoding pro-cess tends to make a better decision. For example, ‘tatscrazin’ is part of a tweet, and its correct form is ‘that’s crazy’.The pipeline system first normalizes tats to thats and crazinto crazing, and in the subsequent POS tagging step,‘crazin’ islabeled as noun, rather than adjective. But the joint decodingsystem correctly normalizes and labels both of them.

6 Conclusion and Further WorkIn this paper we proposed a novel joint decoding approachto label every token’s POS tag and correct the non-standardtokens in a tweet at the same time. This joint decoding com-bines information from the POS tagging model, the tokenlevel normalization scores, and the n-gram language modelprobabilities. Our experimental results demonstrate that nor-malization has a significant impact on POS tagging and ourproposed method also improves both POS tagging and nor-malization accuracy and outperforms the previous work forboth tasks. In addition, to our knowledge, we are the firstto provide a tweet data set that contains both POS annota-tions and text normalization annotations for English corpus.Although our proposed method is more computationally ex-pensive than the pipeline approach, it is applicable in prac-tice when choosing suitable parameters, without performanceloss. In the future, we plan to create a training set which hasboth POS tag and normalization annotation, allowing us touse the joint training strategy to train the POS tagging modeland normalization model at the same time.

1268

References[Aw et al., 2006] Aiti Aw, Min Zhang, Juan Xiao, and Jian

Su. A phrase-based statistical model for sms text normal-ization. In Processing of COLING/ACL, 2006.

[Collins, 2002] Michael Collins. Discriminative trainingmethods for hidden markov models: Theory and exper-iments with perceptron algorithms. In Proceedings ofEMNLP, 2002.

[Cook and Stevenson, 2009] Paul Cook and Suzanne Steven-son. An unsupervised model for text message normaliza-tion. In Proceedings of NAACL, 2009.

[Damerau, 1964] Fred J Damerau. A technique for computerdetection and correction of spelling errors. Communica-tions of the ACM, 7(3):171–176, 1964.

[Foster et al., 2011] Jennifer Foster, Ozlem Cetinoglu,Joachim Wagner, Joseph Le Roux, Stephen Hogan,Joakim Nivre, Deirdre Hogan, Josef Van Genabith, et al.# hardtoparse: Pos tagging and parsing the twitterverse.In Proceedings of AAAI, 2011.

[Gimpel et al., 2011] Kevin Gimpel, Nathan Schneider,Brendan O’Connor, Dipanjan Das, Daniel Mills, JacobEisenstein, Michael Heilman, Dani Yogatama, JeffreyFlanigan, and Noah A. Smith. Part-of-speech tagging fortwitter: Annotation, features, and experiments. In Pro-ceedings of the ACL, 2011.

[Han and Baldwin, 2011] Bo Han and Timothy Baldwin.Lexical normalisation of short text messages: Makn sens a#twitter. In Proceeding of ACL, 2011.

[Levenshtein, 1966] Vladimir I Levenshtein. Binary codescapable of correcting deletions, insertions and reversals.In Soviet physics doklady, volume 10, page 707, 1966.

[Li and Liu, 2012a] Chen Li and Yang Liu. Improving textnormalization using character-blocks based models andsystem combination. In Proceedings of COLING, 2012.

[Li and Liu, 2012b] Chen Li and Yang Liu. Normalization oftext messages using character- and phone-based machinetranslation approaches. In Proceedings of Interspeech,2012.

[Li and Liu, 2014] Chen Li and Yang Liu. Improving textnormalization via unsupervised model and discriminativereranking. In Proceedings of the ACL, 2014.

[Liu et al., 2011] Xiaohua Liu, Shaodian Zhang, Furu Wei,and Ming Zhou. Recognizing named entities in tweets. InProceedings of ACL, 2011.

[Liu et al., 2012] Xiaohua Liu, Ming Zhou, XiangyangZhou, Zhongyang Fu, and Furu Wei. Joint inference ofnamed entity recognition and normalization for tweets. InProceedings of ACL, 2012.

[Martins and Smith, 2009] Andre F. T. Martins and Noah A.Smith. Summarization with a joint model for sentenceextraction and compression. In Proceedings of the ACLWorkshop on Integer Linear Programming for NaturalLanguage Processing, 2009.

[Owoputi et al., 2013] Olutobi Owoputi, BrendanO’Connor, Chris Dyer, Kevin Gimpel, Nathan Schneider,and Noah A. Smith. Improved part-of-speech taggingfor online conversational text with word clusters. InProceedings of NAACL, 2013.

[Pennell and Liu, 2010] Deana Pennell and Yang Liu. Nor-malization of text messages for text-to-speech. In ICASSP,2010.

[Pennell and Liu, 2011] Deana Pennell and Yang Liu. Acharacter-level machine translation approach for normal-ization of sms abbreviations. In Proceedings of IJCNLP,2011.

[Petrovic et al., 2010] Sasa Petrovic, Miles Osborne, andVictor Lavrenko. The edinburgh twitter corpus. In Pro-ceedings of NAACL, 2010.

[Ritter et al., 2010] Alan Ritter, Colin Cherry, and BillDolan. Unsupervised modeling of twitter conversations.In Proceedings of the NAACL, 2010.

[Ritter et al., 2011] Alan Ritter, Sam Clark, Oren Etzioni,et al. Named entity recognition in tweets: an experimentalstudy. In Proceedings of EMNLP, 2011.

[Yang and Eisenstein, 2013] Yi Yang and Jacob Eisenstein.A log-linear model for unsupervised text normalization.In Proceedings of EMNLP, 2013.

[Zhang and Clark, 2008] Yue Zhang and Stephen Clark.Joint word segmentation and POS tagging using a singleperceptron. In Proceedings of ACL, 2008.

1269

joint pos tagging and text normalization for informal text · pos tagging in informal text....

Documents