multilingual novelty detection

7
Multilingual novelty detection Flora S. Tsai * , Yi Zhang, Agus T. Kwee, Wenyin Tang School of Electrical & Electronic Engineering, Nanyang Technological University, Singapore 639798, Singapore article info Keywords: Novelty detection Multilingual Stemming POS tagging Malay Chinese abstract Novelty detection aims at reducing redundant information from a chronologically ordered list of docu- ments or sentences. Other studies of novelty detection have been conducted on the English language, but few papers have addressed the problem of multilingual novelty detection. Likewise, research in mul- tilingual information retrieval have rarely been applied to novelty detection. This paper attempts to bridge the two disciplines by first describing the preprocessing steps for English, Malay and Chinese, then applying document and sentence-level novelty detection for the three languages on APWSJ and TREC 2004 Novelty Track data. Experiments on sentence-level novelty detection show similar results for all three languages, which indicates that our algorithm is suitable for multilingual novelty detection at the sentence level. However, results for document-level novelty detection show a disparity across the dif- ferent languages, with English and Malay outperforming Chinese. After applying sentence-level novelty detection to detect novel documents, we observe substantial improvements on all three languages. This demonstrates that segmenting documents into sentences improves document-level novelty detection in multiple languages, and has practical benefits for a real-time multilingual novelty detection system. Ó 2010 Elsevier Ltd. All rights reserved. 1. Introduction Novelty detection (ND) can facilitate users to quickly get useful information without going through a lot of redundant information, which is usually tedious and time-consuming. The increasing amount of information in news articles, websites, blogs (Chen, Tsai, & Chan, 2008), and social networks (Tsai, Han, Xu, & Chua, 2009) makes novelty detection gradually important and indispensable. Prior to the detection of novel information, text documents or sentences are first preprocessed by removing stop words, perform- ing word stemming, etc. Next, each incoming document or sen- tence is categorized into the relevant topic bin. Finally, within each topic bin, novelty detection searches through the time se- quence of documents or sentences and retrieves only those with enough ‘‘novel” information. This paper focuses on applying docu- ment/sentence-level novelty detection on English, Malay and Chi- nese. This task is identifying novel documents or sentences from a given group of relevant documents or sentences. Novelty detection has been performed at three different levels: event level, document-level and sentence level (Li & Croft, 2005). A variety of methods have been described for each level previously (Allan, Lavrenko, & Jin, 2000; Allan, Papka, & Lavrenko, 1998; Allan, Wade, & Bolivar, 2003; Brants, Chen, & Farahat, 2003; Islam & Ink- pen, 2008; Li & Croft, 2008; Zhang & Tsai, 2009b; Zhang, Callan, & Minka, 2002; Zhang, Sun, Wang, & Bai, 2005). Previous work for document-level ND focused on finding the document that has not been covered by prior documents. Research on sentence-level novelty detection is related to the TREC Novelty Track for finding novel sentences given a chronologically ordered list of relevant sentences (Harman, 2002; Ng, Tsai, Goh, & Chen, 2007; Soboroff, 2004; Soboroff & Harman, 2003; Zhang & Tsai, 2009b). This paper is organized as follows. Section 2 gives a brief over- view of related work on document and sentence-level ND on Eng- lish, Malay, and Chinese, respectively. Section 3 introduces the details of preprocessing steps of the three languages. A general novelty detection algorithm is described in Section 4. In Section 5, we describe an approach called Document-To-Sentence (D2S) for document-level novelty detection and investigate whether the document-level ND performance will improve if we convert the problem to the sentence level for these languages. Experiments and results are presented and discussed in Section 6. Finally, we conclude the whole paper and suggest some future works in Section 7. 2. Related work The pioneering work for English document-level ND was con- tributed by Zhang et al. (2002). In their work, whether a document is novel or not was predicted based on the distance between the new document and the documents in the history. The detected document which is very similar to any of its history documents is regarded as a redundant document. To serve users better, it 0957-4174/$ - see front matter Ó 2010 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2010.07.016 * Corresponding author. Tel.: +65 6790 6369; fax: +65 6793 3318. E-mail addresses: [email protected] (F.S. Tsai), [email protected] (Y. Zhang). Expert Systems with Applications 38 (2011) 652–658 Contents lists available at ScienceDirect Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa

Upload: flora-s-tsai

Post on 21-Jun-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Multilingual novelty detection

Expert Systems with Applications 38 (2011) 652–658

Contents lists available at ScienceDirect

Expert Systems with Applications

journal homepage: www.elsevier .com/locate /eswa

Multilingual novelty detection

Flora S. Tsai *, Yi Zhang, Agus T. Kwee, Wenyin TangSchool of Electrical & Electronic Engineering, Nanyang Technological University, Singapore 639798, Singapore

a r t i c l e i n f o

Keywords:Novelty detectionMultilingualStemmingPOS taggingMalayChinese

0957-4174/$ - see front matter � 2010 Elsevier Ltd. Adoi:10.1016/j.eswa.2010.07.016

* Corresponding author. Tel.: +65 6790 6369; fax: +E-mail addresses: [email protected] (F.S. Tsai), yi

a b s t r a c t

Novelty detection aims at reducing redundant information from a chronologically ordered list of docu-ments or sentences. Other studies of novelty detection have been conducted on the English language,but few papers have addressed the problem of multilingual novelty detection. Likewise, research in mul-tilingual information retrieval have rarely been applied to novelty detection. This paper attempts tobridge the two disciplines by first describing the preprocessing steps for English, Malay and Chinese, thenapplying document and sentence-level novelty detection for the three languages on APWSJ and TREC2004 Novelty Track data. Experiments on sentence-level novelty detection show similar results for allthree languages, which indicates that our algorithm is suitable for multilingual novelty detection atthe sentence level. However, results for document-level novelty detection show a disparity across the dif-ferent languages, with English and Malay outperforming Chinese. After applying sentence-level noveltydetection to detect novel documents, we observe substantial improvements on all three languages. Thisdemonstrates that segmenting documents into sentences improves document-level novelty detection inmultiple languages, and has practical benefits for a real-time multilingual novelty detection system.

� 2010 Elsevier Ltd. All rights reserved.

1. Introduction

Novelty detection (ND) can facilitate users to quickly get usefulinformation without going through a lot of redundant information,which is usually tedious and time-consuming. The increasingamount of information in news articles, websites, blogs (Chen, Tsai,& Chan, 2008), and social networks (Tsai, Han, Xu, & Chua, 2009)makes novelty detection gradually important and indispensable.

Prior to the detection of novel information, text documents orsentences are first preprocessed by removing stop words, perform-ing word stemming, etc. Next, each incoming document or sen-tence is categorized into the relevant topic bin. Finally, withineach topic bin, novelty detection searches through the time se-quence of documents or sentences and retrieves only those withenough ‘‘novel” information. This paper focuses on applying docu-ment/sentence-level novelty detection on English, Malay and Chi-nese. This task is identifying novel documents or sentences from agiven group of relevant documents or sentences.

Novelty detection has been performed at three different levels:event level, document-level and sentence level (Li & Croft, 2005). Avariety of methods have been described for each level previously(Allan, Lavrenko, & Jin, 2000; Allan, Papka, & Lavrenko, 1998; Allan,Wade, & Bolivar, 2003; Brants, Chen, & Farahat, 2003; Islam & Ink-pen, 2008; Li & Croft, 2008; Zhang & Tsai, 2009b; Zhang, Callan, &Minka, 2002; Zhang, Sun, Wang, & Bai, 2005). Previous work for

ll rights reserved.

65 6793 [email protected] (Y. Zhang).

document-level ND focused on finding the document that hasnot been covered by prior documents. Research on sentence-levelnovelty detection is related to the TREC Novelty Track for findingnovel sentences given a chronologically ordered list of relevantsentences (Harman, 2002; Ng, Tsai, Goh, & Chen, 2007; Soboroff,2004; Soboroff & Harman, 2003; Zhang & Tsai, 2009b).

This paper is organized as follows. Section 2 gives a brief over-view of related work on document and sentence-level ND on Eng-lish, Malay, and Chinese, respectively. Section 3 introduces thedetails of preprocessing steps of the three languages. A generalnovelty detection algorithm is described in Section 4. In Section 5,we describe an approach called Document-To-Sentence (D2S) fordocument-level novelty detection and investigate whether thedocument-level ND performance will improve if we convert theproblem to the sentence level for these languages. Experimentsand results are presented and discussed in Section 6. Finally, weconclude the whole paper and suggest some future works inSection 7.

2. Related work

The pioneering work for English document-level ND was con-tributed by Zhang et al. (2002). In their work, whether a documentis novel or not was predicted based on the distance between thenew document and the documents in the history. The detecteddocument which is very similar to any of its history documentsis regarded as a redundant document. To serve users better, it

Page 2: Multilingual novelty detection

Fig. 1. Preprocessing steps on English.

F.S. Tsai et al. / Expert Systems with Applications 38 (2011) 652–658 653

could be more helpful to further highlight novel information at thesentence level. Therefore, later studies focused on the sentence-le-vel ND, such as those reported in TREC 2002–2004 Novelty Tracks(Harman, 2002; Soboroff, 2004; Soboroff & Harman, 2003), thosecompared various novelty metrics (Allan et al., 2003; Zhao, Zheng,& Ma, 2006), and those integrated different natural language tech-niques (Li & Croft, 2008; Ng et al., 2007; Zhang & Tsai, 2009b).

For the Malay language, sentence-level novelty detection wasprevious studied (Kwee, Tsai, & Tang, 2009). In the study, lan-guage-specific preprocessing (Malay stop words removal and Ma-lay word stemming) was integrated into the ND system.Experimental results on TREC 2003 and TREC 2004 Novelty Trackdata showed that the sentence-level ND algorithm designed forEnglish can also be applied on Malay. ND studies on the Chineselanguage have been conducted on topic detection and tracking,which identifies and collects relevant stories on certain topics fromthe information stream. The study proposed an improved rele-vance model to detect the novelty information in topic trackingfeedback and modifies topic model based on this novelty informa-tion (Zheng, Zhang, Zou, Hong, & Liu, 2007). Experimental resultson Chinese datasets TDT4 and TDT2003 proved the effectivenessof topic tracking. Another study put forward a method of applyingsemantic domain language model to link detection, based on thestructure relation among contents and the semantic distributionin a story (Hong, Zhang, Fan, Liu, & Li, 2008).

However, to the best of our knowledge, research in multilingualinformation retrieval have rarely been applied to novelty detection.Many research works in multilingual information retrieval wereconducted on how to merge a unique result list that includes morerelevant documents from different language collections (Lin &Chen, 2003; Powell, French, Callan, Connell, & Viles, 2000). Like-wise, few papers have been reported on multilingual noveltydetection. This is the main focus and contribution of this paper.Our studies provide the foundation for an integrative noveltydetection system that can detect novelty in multilingualdocuments.

3. Preprocessing for different languages

3.1. English

Since the focus of this paper is on novelty detection, we beginwith a list of relevant documents or sentences that have alreadyundergone the categorization process.

The first step for English preprocessing is to remove all stopwords from documents or sentences, such as conjunctions, prepo-sitions, and articles. After stop words removal, the remainingwords are then stemmed. The inflected (or sometimes derived)words are reduced to their root forms. The preprocessing steps inEnglish novelty detection can be seen in Fig. 1. We used the Porterstemming algorithm (Porter, 1997) for English word stemming.

3.2. Malay

Because Malay shares the same alphabetical characters as Eng-lish, the preprocessing steps in the Malay language is similar tothose in the English language (Kwee et al., 2009). The differencesoccur in the different stop words list and stemming algorithm.According to Kwee et al. (2009), both stop words and stemmingrules are language-dependent, so every language has its own setof stop words and its own rules of word stemming.

For the Malay stop words list, some words were translated di-rectly from English stop words, such as ‘each’/‘setiap’, ‘thus’/‘maka’,and ‘before’/‘sebelum’. Some others were taken from Malay docu-ments, for example ‘ayuh’, ‘amboi’, and ‘alamak’.

For the Malay stemming algorithm, this paper used the stem-ming algorithm from Kwee et al. (2009). In the Malay language,there are two types of affixes: prefixes, which appear at the begin-ning of root words (e.g.‘memakan’ means ‘to eat’), and suffixes,which appear at the end of root words (e.g.‘minu-man’ means‘drinks’). Prefixes and suffixes can also appear together in a word(e.g.‘membelanjakan’ means ‘to spend’). Therefore, stemming Ma-lay words is more complicated than stemming English words be-cause prefixes and suffixes need to be removed. Our dictionaryconsists of 5300 Malay root words, which were taken from Bha-not’s Malay–English dictionary (Bhanot, 2008).

3.3. Chinese

Preprocessing for the Chinese language is more complicatedthan preprocessing for English and Malay. As the linguistical char-acteristic of Malay is similar with English, for the rest of this sec-tion, we mainly discuss the preprocessing differences betweenChinese and English.

In the Chinese language, the word is the smallest independentmeaningful element, but the character is the basic written unit.Since there is no obvious boundary between Chinese words,Chinese lexical analysis (i.e. word segmentation) is required forChinese novelty detection. Fig. 2 shows the preprocessing stepsfor Chinese novelty detection.

Chinese word segmentation is a very challenging problembecause it is often difficult to define what constitutes a word(Gao, Li, Wu, & Huang, 2005). Since there are no white spaces be-tween Chinese words or expressions, we need to identify the wordsfrom entire strings of Chinese characters. Many ambiguities existin the Chinese language, such as: ‘ ’ (‘certainly’) might be ‘ ’(‘certainly’)/‘ ’ (‘cut’)’ or ‘ ’ (‘of’)/‘ ’ (‘certainly’). This ambiguityis a big challenge for Chinese word segmentation. Moreover, thereare no obvious inflected or derived words in Chinese; thus, wordstemming cannot be applied.

Part-of-Speech (POS) tagging, a process of marking up the wordin a text corresponding to a particular part of speech, is used to re-duce the noise created by Chinese word segmentation and get abetter word list for one document or sentence. Because the conceptof a text is dependant on a few meaningful words, we can obtainthe main information by extracting these meaningful words. Theerrors in Chinese word segmentation on novelty detection can bereduced since only meaningful words are considered.

Page 3: Multilingual novelty detection

Fig. 2. Preprocessing steps on Chinese.

Fig. 3. Detection process when applying sentence-level ND on document.

654 F.S. Tsai et al. / Expert Systems with Applications 38 (2011) 652–658

We used ICTCLAS (2008) to perform word segmentation andPOS tagging in our experiments because it is an open source pro-ject that achieves a better precision in Chinese word segmentationand POS tagging (ICTCLAS, 2008). We first apply word segmenta-tion (which includes atom segmentation, N-Shortest Path roughsegmentation and unknown words recognition) on the relevantChinese documents or sentences. Atom segmentation seperateseach word to the minimal unit that cannot be split further. Theatom can be either a Chinese character, punctuation, or symbolstring. Then, rough segmentation tries to discover the correct seg-mentation with the fewest candidates. The N-Shortest Path (NSP)method (Zhang & Liu, 2002) is used for rough segmentation. Next,unknown words such as person name and location are detected tooptimize the segmentation result. Finally, we POS tag the wordsand only keep nouns, verbs, adjectives and adverbs in the word listbecause this can achieve a better novelty detection performancethan just removing some non-meaningful words (like stop words)(Zhang & Tsai, 2009a).

4. Novelty detection

Preprocessing steps on English, Malay, and Chinese languageshave already segmented a piece of text into bags of English, Malayand Chinese words. The corresponding term document matrix(TDM) or term sentence matrix (TSM) can be constructed by count-ing the term frequency (TF) of each word. Therefore, each docu-ment or sentence can be conveniently represented by a vectorwhere the TF value of each tagged word is considered as one fea-ture. The ND system predicts any incoming document or sentenceby comparing it with its history documents or sentences in thisvector space. Therefore, given a Malay or a Chinese TDM or TSM,the ND system designed for English can also be applied to Malayand Chinese.

Although there are several different geometric distance mea-sures, such as Manhattan distance and cosine similarity, we usedcosine similarity because it performed well on both documentand sentence-level ND (Allan et al., 2003; Zhang et al., 2002).

Cosine similarity, a symmetric measure related to the angle be-tween two vectors, represents a document or sentence d as a vectord = [w1(d),w2(d), . . . ,wn(d)]T, as shown in Eq. (1).

cosðdt ;diÞ ¼Pn

k¼1wkðdtÞwkðdiÞjjdtjj � jjdijj

ð1Þ

where wk(d) is the weight of kth element in the document or sen-tence weighted vector d.

In order to measure the degree of novelty directly, we convertthe cosine similarity score to the novelty score simply by (1-cosinesimilarity score) (See Eq. (2)). The cosine similarity metric com-pares the current document or sentence with each of its historydocuments or sentences seperately, where the minimum noveltyscore will be chosen as the novelty score of the current documentor sentence. If the novelty score of the document or sentence isabove the novelty score threshold nt, the document or sentenceis considered as novel.

Novelty scoreðdtÞ ¼ min16i6t�1

½1� cosðdt; diÞ� ð2Þ

where di represents one of the most recent system delivered t noveldocuments or sentences appearing before the incoming documentor sentence dt. Each unique word is treated as one dimension andterm frequency (TF) is the weight for each dimension.

5. Sentence-level ND on documents

We performed another document-level ND method called Doc-ument-To-Sentence (D2S) (Tsai & Zhang, 2010) and investigatedwhether the document-level ND will be improved if we convertthe document-level problem to the sentence level.

The novelty detection process of D2S can be seen in Fig. 3. D2Sfirst segments a document into sentences. Then it applies sen-tence-level ND on these sentences and determines the novelty ofeach sentence. Based on the prediction of each sentence, NovelRatefor each document will be calculated. Finally, when given a thresh-old, the document will be predicted as novel if its NovelRate is lar-ger than the threshold. The definition of NovelRate is:

NovelRate ¼ numðnovel sentencesÞnumðall sentencesÞ ð3Þ

where num(all_sentences) is the number of sentences in one docu-ment and num(novel_sentences) is the number of novel sentenceswhich are predicted by sentence-level ND for the document.

Page 4: Multilingual novelty detection

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Redundancy−RecallR

edun

danc

y− P

reci

sion

RF score

0.9

0.8

0.7

0.6

0.5

0.4

0.30.2

0.1

English D_ND

Malay D_ND

Chinese D_ND

Threshold=0.050.15

0.250.35

0.45

0.550.65

Fig. 4. R–PR curves of document-level ND on English, Malay and Chinese.

Table 1Statistics of experimental data.

Dataset Novel Non-novel

APWSJ 10,839 (91.10%) 1057 (8.90%)TREC2004 3454 (41.40%) 4889 (58.60%)

Table 2Categories for evaluation.

Non-novel Novel

Delivered R+ N+

Not delivered R� N�

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Recall

Prec

isio

n

F score

0.9

0.8

0.7

0.6

0.5

0.4

0.30.2

0.1

English S_ND

Malay S_ND

Chinese S_ND

0.85Threshold=0.95

0.750.65 0.55

Fig. 5. PR curves of sentence-level ND on English, Malay and Chinese.

1 http://utmk.cs.usm.my:8080/ebmt-controller2 http://code.google.com/p/google-api-translate-java

F.S. Tsai et al. / Expert Systems with Applications 38 (2011) 652–658 655

6. Experiments and results

6.1. Datasets

Two public datasets APWSJ (Zhang et al., 2002) and TREC Nov-elty Track 2004 data (Soboroff, 2004) are selected as our experi-mental datasets for document-level and sentence-level ND,respectively. APWSJ data consists of news articles from AssociatedPress (AP) and Wall Street Journal (WSJ). There are 50 topics fromQ101 to Q150 in APWSJ and five topics (Q131, Q142, Q145, Q147,Q150) are excluded from the experiments because they lack non-novel documents (Zhao et al., 2006). The assessors provide two de-grees of judgements on non-novel documents, absolute redundantand somewhat redundant. In this experiment, we adopt the strictdefinition used in Zhang et al. (2002) where only absolute redun-dant documents are regarded as non-novel. TREC 2004 NoveltyTrack data is developed from AQUAINT collection. Both relevantand novel sentences are selected by TREC’s assessors. The statisticsof these two datasets are summarized in Table 1.

6.2. Evaluation measures

In our evaluations, redundancy precision (RP), redundancy re-call (RR) and redundancy F-Score (RF) are used to evaluate the per-formance of document-level ND (Zhang et al., 2002). Precision (P),recall (R) and F-Score (F) are used in evaluating the performance forsentence-level ND (Allan et al., 2003). Moreover, we also draw theredundancy precision–recall (R–PR) and precision–recall (PR)curves. The larger the area under the PR curve/R–PR curve, the bet-ter the algorithm. Redundancy precision, redundancy recall, preci-sion and recall of a given topic are defined as:

RP ¼ R�

R� þ N�; RR ¼ R�

R� þ Rþð4Þ

P ¼ Nþ

Nþ þ Rþ; R ¼ Nþ

Nþ þ N�ð5Þ

where R+, R�, N+, N� correspond to the number of documents/sen-tences that fall into each category (see Table 2).

Based on all the topics’ RP/P and RR/R, we could get the averageRP/P and average RR/R by calculating the arithmetic mean of thesescores on all topics. Then, the average redundancy F-Score (RF)/F-Score (F) is obtained by the harmonic average of the average RP/Pand average RR/R.

6.3. Experimental results

We applied document and sentence-level ND on English, Malay,and Chinese separately. In this experimental study, the focus wasnovelty detection rather than relevant documents/sentences cate-gorization. Therefore, our experiments started with all given rele-vant documents/sentences, from which the novel documents/sentences should be identified.

Since the datasets that we used for document-level ND and sen-tence-level ND both were written in English, we first translatedthem into Malay and Chinese. During this process, we investigatedissues on machine translation and manually corrected translation.

For Malay, we conducted tests on a subset of topics translatedusing two methods. First, we used only machine translation Exam-

ple-Based Machine Translation (EBMT)1, then, we manually cor-rected the machine translation. We discovered that the averageresults (F-Score, precision and recall) were only slightly different(<1%) between machine and manually corrected translations. There-fore, machine translation was performed for the remainingdocuments.

For Chinese, we compared the automatically translated 107 textin TREC 2004 Novelty Track using Google Translate API2 with themanually corrected translation and found that there is a negligibledifference (<2%) in precision and F-Score. Thus, machine translationwe also performed for the remaining Chinese documents. From theseexperiments, we can conclude that the noise in machine translationfor both Malay and Chinese had little impact on our actual results.

Then on the English and Malay datasets, we applied the prepro-cessing steps discussed in Sections 3.1 and 3.2, including stop word

Page 5: Multilingual novelty detection

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Redundancy−Recall

Red

unda

ncy−

Prec

isio

n

RF score

0.7

0.6

0.5

0.4

0.3

0.2

0.1

English D_ND

Engish D2S

0.15

0.25

0.35

0.45

Novel Rate=0.3

Threshold=0.05

0.40.5

0.6

0.7

0.8

0.9

0.95

1.0

0.8

0.1 0.2 0.3 0.4 0.5 0.6 0.70

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Red

unda

ncy−

Prec

isio

n

Redundancy−Recall

RF score

0.7

0.6

0.5

0.4

0.3

0.2

0.1

Malay D_ND

Malay D2S0.15

0.25

0.35

0.30.40.5

Novel Rate=0.2

0.60.7

0.80.9

0.95

1.0

Threshold=0.05

0.1 0.2 0.3 0.4 0.5 0.6 0.70

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Redundancy−Recall

Red

unda

ncy−

Prec

isio

n

RF score

0.7

0.6

0.5

0.4

0.3

0.2

0.1

Chinese D_ND

Chinese D2S

0.3

Novel Rate=0.2

0.4 0.50.6

0.7

0.8

0.95

0.9

Threshold=0.05

0.15

0.20

0.25

0.35

0.85

Fig. 6. R–PR curves for D-ND and D2S.

656 F.S. Tsai et al. / Expert Systems with Applications 38 (2011) 652–658

removal and word stemming. For Chinese datasets, we segmentedthe documents/sentences into words and then performed POS tag-ging to acquire the candidate words for the space vector. In the fol-lowing sections, we will discuss the performances on document-level ND and sentence-level ND.

6.3.1. Document-level ND performanceBased on the vectors of English, Malay and Chinese documents,

we calculated the similarities between documents and predictedthe novelty for each document on each language dataset. In thisexperiment, the history document set contained 10 previous docu-ments. An incoming document will be compared with all the systemdelivered 10 novel documents. If the novelty score is above thenovelty score threshold nt, the document is considered to be novel.

We evaluated the performances of document-level ND on threelanguages by setting a series of novelty score threshold (nt) and drewthe R–PR curves. Thresholds used were between 0.05 and 0.65 with astep of 0.10. The R–PR curves can be seen in Fig. 4. The grey dashedlines show contours at intervals of 0.1 points of RF-Score.

From Fig. 4, we have the following observations and interpreta-tions on the experimental results.

(1) The document-level ND algorithm designed for English canalso be applied on Malay and Chinese. Each language needsto preprocess the text into a bag of words and then computethe novelty score for each document. Finally, the novelty ofeach document is predicted when given a novelty scorethreshold.

Page 6: Multilingual novelty detection

0.1 0.2 0.3 0.4 0.5 0.6 0.7

0.1

0.2

0.3

0.4

0.5

0.6

Red

unda

ncy−

Prec

isio

n

Redundancy−Recall

RF score

0.6

0.5

0.4

0.3

0.2

0.1

English D_ND

Engish D2S

Malay D2S

Chinese D2S

0.25

0.35

0.45

Novel Rate=0.2

0.95

0.9

0.850.8

1.01.0

0.95

0.9

0.9

0.8

0.80.15

0.3

Threshold=0.05

0.95

Fig. 7. R–PR curves for English D-ND and D2S on English, Malay and Chinese.

F.S. Tsai et al. / Expert Systems with Applications 38 (2011) 652–658 657

(2) Except for using different stop word lists and word stem-ming algorithms (which are language specific), the ND per-formance on Malay is fairly similar with that on English.One main reason is that the English and the Malay languagesshare the same alphabetical characters.

(3) The performance on document-level ND on Chinese isslightly lower than those on English and Malay. This is prob-ably due to the different linguistical characteristics of eachlanguage so that the influence of preprocessing steps oneach language is dissimilar. Furthermore, the preprocessingquality on Chinese documents is not as good as those onEnglish and Malay. If we ignore the translation noise, theprecision of Chinese word segmentation is actually lower.In addition, the errors in word segmentation will influencethe result of POS tagging. These issues make tokenizingand POS tagging extremely difficult for the Chinese text.

6.3.2. Sentence-level ND performanceWe also performed sentence-level ND on the three languages’

datasets. Whether an incoming sentence is novel is predicted bycomparing with the most recent system delivered 1000 novel sen-tences. Fig. 5 shows the PR curves of sentence-level ND on thethree languages when given a series of novelty score thresholds,varying from 0.05 to 0.95 with an equal step of 0.10. The greydashed lines show contours at intervals of 0.1 points of F-Score.

As seen from Fig. 5, the following observations and interpreta-tions can be made.

(1) Similarly to document-level ND, sentence-level ND on Eng-lish also can be applied on Malay and Chinese. Moreover,the ND performance on Malay is almost the same as thaton English.

(2) It is interesting that the performance on sentence-level NDon Chinese is almost the same as those on English andMalay. The reason is because the ND performance on thesentence level is not so sensitive to preprocessing steps asthat on the document-level. If the similarity computationis based on the sentence level, the word segmentation andPOS tagging errors will not have a big influence on theresults. Moreover, one document usually contains severalsentences so that the preprocessing errors of each sentencewill aggregate at the document-level, causing the ND perfor-mance to decrease. Furthermore, using the same machinetranslator, sentences sharing many common words are eas-ier to be translated in the same way than documents withsome common words. Additionally, compared with sen-tence-level novelty detection, document-level noveltydetection is more difficult because nearly every documentcontains some new information.

6.3.3. Document-level ND using D2SAs demonstrated above, the ND algorithm tuned for the English

language can also be applied on Malay and Chinese. Furthermore,the linguistic differences among English, Malay and Chinese resultin the similar document-level ND performance on English and Ma-lay, but a lower performance on Chinese. However, sentence-levelND on English, Malay and Chinese can obtain similar results. In or-der to reduce the negative influence of lower precision in prepro-cessing on Chinese document-level ND, we performed theDocument-To-Sentence (D2S) method on APWSJ datasets of thethree languages and compared the performances with the originaldocument-level ND (D-ND). For D2S, we set a fixed novelty scorethreshold nt = 0.05 on the sentence-level ND. Then, when given dif-ferent NovelRate thresholds, we plot the R–PR curves.

From Fig. 6, we noticed that D2S improves the performance overD-ND on several NovelRate thresholds on these three languages.

The ND performance on Chinese improved significantly whenusing D2S instead of D-ND. Fig. 7 shows that after applying D2S,the document-level ND on the three languages can achieve a betterperformance than English D-ND. It indicates that the imprecisepreprocessing influence actually is restrained at the sentence level,and the ND performance can be improved if we segment the doc-ument first into sentences, apply sentence-level ND, and use Nov-elRate to judge the novelty of the document. Moreover, it seemsmore precise to predict the novelty of a document when the cosinesimilarity is calculated at the sentence level. The experimental re-sults will be a good evidence that document-level ND on differentlanguages will get a better performance if we first convert theproblem to the sentence level, then utilize sentence-level tech-niques to solve the problem.

7. Conclusions

This paper studied issues relating to multilingual novelty detec-tion, which, to our knowledge, has not been sufficiently addressedin previous studies. We first described the preprocessing steps forEnglish, Malay and Chinese, then performed document and sen-tence-level novelty detection for the different languages.

We conducted experiments on APWSJ and TREC 2004 NoveltyTrack to evaluate the novelty detection performances on the docu-ment and sentence levels for each language. This paper showedthat our novelty detection algorithm, originally developed andtested on English documents, can also be applied on Malay andChinese documents. Similar results were observed on the sen-tence-level novelty detection in all three languages, which indi-cates that our algorithm is suitable for multilingual noveltydetection at the sentence level. However, results for document-level novelty detection showed a disparity across the different lan-guages, with English and Malay outperforming Chinese. This isprobably because the sentence-level preprocessing was more accu-rate than the document-level preprocessing in Chinese.

Therefore, we applied sentence-level novelty detection to de-tect novel documents, and obtained overall improvements in allthree languages. This indicates that the negative influence of pre-processing can be effectively restrained at the sentence level; thus,it is generally more effective to utilize sentence-level noveltydetection to predict the novelty of documents.

Our future work will integrate the algorithms into a real-timemultilingual novelty detection system that can detect novelty indifferent languages. This is especially beneficial for countries insoutheast Asia, where documents may have a mixture of all threelanguages.

Page 7: Multilingual novelty detection

658 F.S. Tsai et al. / Expert Systems with Applications 38 (2011) 652–658

References

Allan, J., Papka, R., & Lavrenko, V. (1998). On-line new event detection and tracking.In SIGIR 1998, Melbourne, Australia (pp. 37–45).

Allan, J., Lavrenko, V., & Jin, H. (2000). First story detection in TDT is hard. In CIKM2000, McLean VA, USA (pp. 374–381).

Allan, J., Wade, C., & Bolivar, A. (2003). Retrieval and novelty detection at thesentence level. In SIGIR 2003, Toronto, Canada (pp. 314–321). ACM.

Bhanot, D.K. (2008). The first online Malay–English dictionary. <http://dictionary.bhanot.net/>.

Brants, T., Chen, F., & Farahat, A. (2003). A system for new event detection. In SIGIR2003, Toronto, Canada (pp. 330–337).

Chen, Y., Tsai, F. S., & Chan, K. L. (2008). Machine learning techniques for businessblog search and mining. Expert Systems Application, 35(3), 581–590.

Gao, J., Li, M., Wu, A., & Huang, C.-N. (2005). Chinese word segmentation and namedentity recognition: A pragmatic approach. Computational Linguistics, 31(4),531–574.

Harman, D. (2002). Overview of the TREC 2002 Novelty Track. In TREC 2002 – The11th Text Retrieval Conference (pp. 46–55).

Hong, Y., Zhang, Y., Fan, J., Liu, T., & Li, S. (2008). Chinese topic link detection basedon semantic domain language model. Journal of Software, 19(9), 2265–2275.

ICTCLAS (2008). <http://ictclas.org/index.html>.Islam, A., & Inkpen, D. (2008). Semantic text similarity using corpus-based word

similarity and string similarity. ACM Transactions on Knowledge Discovery fromData (TKDD), 2(2).

Kwee, A.T., Tsai, F.S., & Tang, W. (2009). Sentence-level novelty detection in Englishand Malay. In Lecture notes in computer science (LNCS) (Vol. 5476. pp. 40–51).

Li, X., & Croft, W.B. (2005). Novelty detection based on sentence level patterns. InCIKM 2005 (pp. 744–751).

Li, X., & Croft, W. B. (2008). An information-pattern-based approach to noveltydetection. Information Processing and Management: An International Journal,44(3), 1159–1188.

Lin, W.- C., Chen, & H.- H. (2003). Merging mechanisms in multilingual informationretrieval. In Lecture notes in computer science (pp. 175–186).

Ng, K. W., Tsai, F. S., Goh, K. C., & Chen, L. (2007). Novelty detection for textdocuments using named entity recognition. In 2007, 6th International conferenceon information, communications and signal processing, ICICS (pp. 1–5).

Porter, M. (1997). An algorithm for suffix stripping. Readings in Information Retrieval,313–316.

Powell, A.L., French, J.C., Callan, J., Connell, M., & Viles, C.L. (2000). The impact ofdatabase selection on distributed searching. In SIGIR 2000, Athens, Greece (pp.232–239).

Soboroff, I. (2004). Overview of the TREC 2004 Novelty Track. In TREC 2004 – The13th text retrieval conference.

Soboroff, I., & Harman, D. (2003). Overview of the TREC 2003 Novelty Track. In TREC2003 – The 12th text retrieval conference.

Tsai, F. S., Han, W., Xu, J., & Chua, H. C. (2009). Design and development of a mobilepeer-to-peer social networking application. Expert Systems Application, 36(8),11077–11087.

Tsai, F.S., & Zhang, Y. (2010). D2S: Document-to-sentence framework for noveltydetection. Knowledge and information systems.

Zhang, Y., & Tsai, F.S. (2009a). Chinese novelty mining. In EMNLP ’09: proceedings ofthe conference on empirical methods in natural language processing.

Zhang, Y., & Tsai, F.S. (2009b). Combining named entities and tags for novelsentence detection. In Proceedings of the WSDM ’09 ACM workshop on exploitingsemantic annotations in information retrieval ESAIR 2009 (pp. 30–34).

Zhang, Y., Callan, J., & Minka, T. (2002). Novelty and redundancy detection inadaptive filtering. In ACM SIGIR 2002, Tampere, Finland (pp. 81–88).

Zhang, H., & Liu, Q. (2002). Model of Chinese words rough segmentation based on n-shortest paths method. Journal of Chinese Information Processing, 15, 1–7.

Zhang, H.-P., Sun, J., Wang, B., & Bai, S. (2005). Computation on sentence semanticdistance for novelty detection. Journal of Computer Science and Technology, 20(3),331–337.

Zhao, L., Zheng, M., & Ma, S. (2006). The nature of novelty detection. InformationRetrieval, 9, 527–541.

Zheng, W., Zhang, Y., Zou, B., Hong, Y., & Liu, T. (2007). Research of Chinese topictracking based on relevance model. In Proceedings of 9th Chinese nationalconference on computational linguistics.