experiments in term weighting for novelty mining

8
Experiments in term weighting for novelty mining Flora S. Tsai , Agus T. Kwee School of Electrical & Electronic Engineering, Nanyang Technological University, Singapore 639798, Singapore article info Keywords: Novelty mining Novelty detection Term weighting Binary Term frequency Inverse document frequency Threshold Novelty dataset abstract Obtaining new information in a short time is becoming crucial in today’s economy. A lot of information both offline or online is easily acquired, exacerbating the problem of information overload. Novelty min- ing detects documents/sentences that contain novel or new information and presents those results directly to users (Tang, Tsai, & Chen, 2010). Many methods and algorithms for novelty mining have pre- viously been studied, but none have compared and discussed the impact of term weighting on the eval- uation measures. This paper performed experiments to recommend the best term weighting function for both document and sentence-level novelty mining. Ó 2011 Elsevier Ltd. All rights reserved. 1. Introduction Rapid advances in Information Technology has shifted the dis- semination of information to online content such as blogs (Tsai, Chen, & Chan, 2007), social networks (Tsai, Han, Xu, & Chua, 2009), mobile information (Tsai, Etoh, Xie, Lee, & Yang, 2010), Web services (Yee, Tiong, Tsai, & Kanagasabai, 2009), security threats (Tsai, 2009), etc. Therefore, with the huge amount of online information, there is an increasing need to identify novel and rele- vant information out of the mass of incoming documents (Ng, Tsai, Goh, & Chen, 2007). Novelty mining (NM), or novelty detection, is a process to filter out repeated or redundant information and to present documents/ sentences that have novel information based on a given threshold (Tang & Tsai, 2009). Therefore novelty mining can facilitate users to quickly get useful information without going through a lot of redundant information, which is usually tedious and time- consuming (Zhang & Tsai, 2009b). Previous work for document-level or sentence-level novelty mining is to find the document or sentence that has not been cov- ered by prior documents or sentences and tend to apply some promising content-oriented techniques (Kwee, Tsai, & Tang, 2009; Ng et al., 2007; Ong, Kwee, & Tsai, 2009; Tang, Kwee, & Tsai, 2009; Tsai & Chan, 2010; Tsai, Kwee, Tang, & Chan, 2010; Zhang & Tsai, 2009a). Sentence-level novelty mining is related to the TREC novelty track for finding novel sentences given a chronologically ordered list of relevant sentences (Harman, 2002; Soboroff, 2004; Soboroff & Harman, 2003). The overall process of novelty mining is as follows. Prior to the detection of novel information, text documents or sentences are first preprocessed by removing stop words, performing word stem- ming, etc. (Zhang & Tsai, 2009a). Next, each incoming document or sentence is categorized into the relevant topic bin (Zhang, Tsai, & Kwee, 2011). Finally, within each topic bin, novelty mining searches through the time sequence of documents or sentences and retrieves only those with enough ‘‘novel’’ information (Tsai & Chan, 2011). In order to predict the novelty, each incoming docu- ment or sentence needs to be compared with the previous docu- ments or sentences and calculates the similarity of current document or sentence (Kwee & Tsai, 2009). Therefore, we need a similarity metric to compute the novelty of current document or sentence. Similarity metrics that can be used are word overlap, co- sine similarity, new word count, etc. (Tsai, Tang, & Chan, 2010). Other works utilize ontological knowledge, especially taxonomy, such as WordNet (Ng et al., 2007; Zhang & Tsai, 2009a), synonym dictionary (Franz, Ittycheriah, McCarley, & Ward, 2001), HowNet (Eichmann & Srinivasan, 2002), etc. In this paper, we compare and evaluate several term weighting functions and their perfor- mance on document-level and sentence-level novelty mining. The rest of the paper is organized as follows. The related work of term weighting function is presented in Section 2. Sections 3–6 show the details of datasets, metrics, term weighting function, evaluation measurement, and percentage of novel documents used in this paper. In Section 7, experiments and results are discussed. Finally, we conclude the paper with a discussion and future work in Section 8. 2. Literature review Papers from the TREC novelty track (Harman, 2002; Soboroff, 2004; Soboroff & Harman, 2003) proposed novelty metrics that can be broadly grouped into two categories, i.e. metrics for sen- tences that are represented by language models and those that 0957-4174/$ - see front matter Ó 2011 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2011.04.218 Corresponding author. Tel.: +65 6790 6369; fax: +65 6793 3318. E-mail addresses: [email protected] (F.S. Tsai), [email protected] (A.T. Kwee). Expert Systems with Applications 38 (2011) 14094–14101 Contents lists available at ScienceDirect Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa

Upload: flora-s-tsai

Post on 26-Jun-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Experiments in term weighting for novelty mining

Expert Systems with Applications 38 (2011) 14094–14101

Contents lists available at ScienceDirect

Expert Systems with Applications

journal homepage: www.elsevier .com/locate /eswa

Experiments in term weighting for novelty mining

Flora S. Tsai ⇑, Agus T. KweeSchool of Electrical & Electronic Engineering, Nanyang Technological University, Singapore 639798, Singapore

a r t i c l e i n f o

Keywords:Novelty miningNovelty detectionTerm weightingBinaryTerm frequencyInverse document frequencyThresholdNovelty dataset

0957-4174/$ - see front matter � 2011 Elsevier Ltd. Adoi:10.1016/j.eswa.2011.04.218

⇑ Corresponding author. Tel.: +65 6790 6369; fax: +E-mail addresses: [email protected] (F.S. Tsai), atk

a b s t r a c t

Obtaining new information in a short time is becoming crucial in today’s economy. A lot of informationboth offline or online is easily acquired, exacerbating the problem of information overload. Novelty min-ing detects documents/sentences that contain novel or new information and presents those resultsdirectly to users (Tang, Tsai, & Chen, 2010). Many methods and algorithms for novelty mining have pre-viously been studied, but none have compared and discussed the impact of term weighting on the eval-uation measures. This paper performed experiments to recommend the best term weighting function forboth document and sentence-level novelty mining.

� 2011 Elsevier Ltd. All rights reserved.

1. Introduction

Rapid advances in Information Technology has shifted the dis-semination of information to online content such as blogs (Tsai,Chen, & Chan, 2007), social networks (Tsai, Han, Xu, & Chua,2009), mobile information (Tsai, Etoh, Xie, Lee, & Yang, 2010),Web services (Yee, Tiong, Tsai, & Kanagasabai, 2009), securitythreats (Tsai, 2009), etc. Therefore, with the huge amount of onlineinformation, there is an increasing need to identify novel and rele-vant information out of the mass of incoming documents (Ng, Tsai,Goh, & Chen, 2007).

Novelty mining (NM), or novelty detection, is a process to filterout repeated or redundant information and to present documents/sentences that have novel information based on a given threshold(Tang & Tsai, 2009). Therefore novelty mining can facilitate users toquickly get useful information without going through a lot ofredundant information, which is usually tedious and time-consuming (Zhang & Tsai, 2009b).

Previous work for document-level or sentence-level noveltymining is to find the document or sentence that has not been cov-ered by prior documents or sentences and tend to apply somepromising content-oriented techniques (Kwee, Tsai, & Tang,2009; Ng et al., 2007; Ong, Kwee, & Tsai, 2009; Tang, Kwee, & Tsai,2009; Tsai & Chan, 2010; Tsai, Kwee, Tang, & Chan, 2010; Zhang &Tsai, 2009a). Sentence-level novelty mining is related to the TRECnovelty track for finding novel sentences given a chronologicallyordered list of relevant sentences (Harman, 2002; Soboroff, 2004;Soboroff & Harman, 2003).

The overall process of novelty mining is as follows. Prior to thedetection of novel information, text documents or sentences are

ll rights reserved.

65 6793 [email protected] (A.T. Kwee).

first preprocessed by removing stop words, performing word stem-ming, etc. (Zhang & Tsai, 2009a). Next, each incoming document orsentence is categorized into the relevant topic bin (Zhang, Tsai, &Kwee, 2011). Finally, within each topic bin, novelty miningsearches through the time sequence of documents or sentencesand retrieves only those with enough ‘‘novel’’ information (Tsai &Chan, 2011). In order to predict the novelty, each incoming docu-ment or sentence needs to be compared with the previous docu-ments or sentences and calculates the similarity of currentdocument or sentence (Kwee & Tsai, 2009). Therefore, we need asimilarity metric to compute the novelty of current document orsentence. Similarity metrics that can be used are word overlap, co-sine similarity, new word count, etc. (Tsai, Tang, & Chan, 2010).Other works utilize ontological knowledge, especially taxonomy,such as WordNet (Ng et al., 2007; Zhang & Tsai, 2009a), synonymdictionary (Franz, Ittycheriah, McCarley, & Ward, 2001), HowNet(Eichmann & Srinivasan, 2002), etc. In this paper, we compareand evaluate several term weighting functions and their perfor-mance on document-level and sentence-level novelty mining.

The rest of the paper is organized as follows. The related work ofterm weighting function is presented in Section 2. Sections 3–6show the details of datasets, metrics, term weighting function,evaluation measurement, and percentage of novel documents usedin this paper. In Section 7, experiments and results are discussed.Finally, we conclude the paper with a discussion and future workin Section 8.

2. Literature review

Papers from the TREC novelty track (Harman, 2002; Soboroff,2004; Soboroff & Harman, 2003) proposed novelty metrics thatcan be broadly grouped into two categories, i.e. metrics for sen-tences that are represented by language models and those that

Page 2: Experiments in term weighting for novelty mining

Table 2Global term weighting.

Global term weighting (gij)

None 1Inverse document frequency (IDF) log2 n=

PjBinaryðfijÞ

h i

GFIDFP

jfij=P

jBinaryðfijÞNormal 1=

ffiffiffiffiffiffiffiffiffiffiffiPj f

2ij

q

Probabilistic inverse log2 ðn�P

jBinaryðfijÞÞ=P

jBinaryðfijÞh i

Table 3Normalization function.

Normalization (nij)

None 1Cosine 1=

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPjðlijgiÞ

2q

Table 4Statistics of TIPSTER dataset.

Topic 50 (Q101–Q150)⁄

Total 316907

Relevant Non-relevantNovel 10839 (91.11%) 0Non-novel 1057 (8.89%) 305011

⁄ Five topics (Q131, Q142, Q145, Q147, and Q150) are excluded.

F.S. Tsai, A.T. Kwee / Expert Systems with Applications 38 (2011) 14094–14101 14095

are represented by vector space models. In the language model,each sentence is represented by a probability distribution of wordsand the information theoretic metric Kullback–Leibler divergence(Allan, Wade, & Bolivar, 2003; Zhang, Callan, & Minka, 2002) is usu-ally used to measure the dissimilarity between two probability dis-tributions. In the vector space model, each sentence is representedby a weighted vector, and cosine similarity is a typical metric tocalculate the similarity between two vectors. The novelty metricscan also be grouped based on whether it is symmetrical or not.Symmetric novelty metrics such as cosine similarity and Jaccardsimilarity evaluate the similarity between sentences without con-cerning the order between them, while nonsymmetric noveltymetrics such as new word count, set difference and overlappingmetric can take the order information into account (Tsai & Tanget al., 2010). Zhang et al. reported the good performance of cosinesimilarity for novelty mining at the document level (Zhang et al.,2002). Therefore, in this paper we use the cosine similarity metric.In text mining, algorithms and kernels typically operate on datathat are presented in form of large sparse Term Document Matrix(TDM) (Chen, Tsai, & Chan, 2008). TDM is a two-dimensional ma-trix, where each row represents a term/word and each column rep-resents a document. Note that the ‘‘document’’ here can be adocument, a sentence or even a title (such as book titles), basedon a specific task. According to Chen et al. (2008), TDM can be rep-resented as in (1).

TDM : A ¼ ½aij�mxn ¼

a11 a12 � � � a1n

a21 a22 � � � a2n

..

. ... . .

. ...

am1 am2 . . . amn

266664

377775

mxn

ð1Þ

where, each element is TDM is expressed as the production of threefactors:

aij ¼ lij . . . gi . . . oij ð2Þ

lij is the local weight factor of term i in document j;gi is the global weight factor of term i;o is the normalization factor, which is used to moderate biastoward longer documents;m is the number of terms in dictionary;n is the number of documents.

There are various local weight, global weight and normalizedfactor for choice (Chen et al., 2008) as shown in Tables 1–3.

Table 5Statistics of TREC 2003 dataset.

Topic 50 (N1–N50)Total 39820

Relevant Non-relevantNovel 10226 (65.73%) 0Non-novel 5331 (34.27%) 24263

3. Datasets

3.1. TIPSTER

This dataset is also called APWSJ dataset (Zhang et al., 2002)consisting of news articles from Associated Press (AP) and WallStreet Journal (WSJ). It is a document-level dataset with 316907documents on 50 topics (Q101–Q150). Similar to a previous study(Zhang & Tsai, 2009b), five topics (Q131, Q142, Q145, Q147, andQ150) were excluded from the experiments because they lacked

Table 1Local term weighting.

Local term weighting (lij)

Term frequency (TF) fij

Binary Binary (fij)Logarithmic Log2(1 + fij)Alternate log Binary (fij)log2(1 + fij)

human redundancy assessments (all documents were novel). Thestatistics of TIPSTER are shown in Table 4.

3.2. TREC 2003 novelty track

Fifty new topics were created by NIST assessor for the TREC2003 novelty track (Soboroff & Harman, 2003). Twenty-five topicsfocused on events and the other twenty-five focused on opinionsabout controversial issues. For each topic, statement of informationneed was created by the assessor and was used to query docu-ments from the collection using the NIST PRISE search engine.The document collection used was the AQUAINT Corpus of EnglishNews Text assem-bled for the TREC 2002 question answering track.This corpus is comprised of documents from three differentsources: the AP newswire from year 1998 to 2000, the New YorkTimes newswire from year 1998 to 2000, and the (English portionof the) Xinhua News Agency from year 1996 to 2000. There are

Table 6Statistics of TREC 2004 dataset.

Topic 50 (N51–N100)Total 52447

Relevant Non-relevantNovel 3454 (41.40%) 0Non-novel 4889 (58.60%) 44104

Page 3: Experiments in term weighting for novelty mining

Table 7Percentage of novel documents in TREC 2003d.

Topic 50 (N1–N50)Total 1250

Document novelty threshold

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Novel 1042 1011 954 896 841 734 614 515 40483.36% 80.88% 76.32% 71.68% 67.28% 58.72% 49.12% 41.20% 32. 32%

Non-novel 208 239 296 354 409 516 636 735 84516.64% 19.12% 23.68% 28.32% 32.72% 41.28% 50.88% 58.80% 67. 68%

Table 8Percentage of novel documents in TREC 2004d.

Topic 50 (N51–N100)Total 1808

Document novelty threshold

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Novel 850 781 666 572 501 357 274 214 16247.01% 43.20% 36.84% 31.64% 27.71% 19.75% 15.15% 11.84% 8.96%

Non-novel 958 1027 1142 1236 1307 1451 1534 1594 164652.99% 56.80% 63.16% 68.36% 72.29% 80.25% 84.85% 88.16% 91.04%

Table 9Topic information of business blog dataset.

topic_id topic_title topic_description

1 product product2 company company3 marketing marketing4 finance finance

Table 10Statistics of business blog dataset.

Topic 4 (1–4)Total 1269

Relevant Non-relevantNovel 1211 (95.43%) 0Non-novel 58 (4.57%) 0

Table 11Term weighting functions to be compared.

Term weighting functions Expressions

TF (Term frequency) fij

Binary Binary(fij)TF.IDF (Term frequency/Inverse document

frequency)fijlog2(n + 1)/log2(dfij + 0.5)

Where fij is the term frequency of word i in document j, i.e. dj;Binary (x) is the binary function with Binary (x) = 1 when x > 0

and Binary (x) = 0, otherwise;Length (d) is the total number of terms in document

d (after stop word removal);n is the number of documents in the collection;

Note: In detecting the current document, the collection refers to the documents onthe history list.

Table 12Percentage of novel documents category.

Low Medium High

Approximate percentage < 33% 33%–66% >66%

14096 F.S. Tsai, A.T. Kwee / Expert Systems with Applications 38 (2011) 14094–14101

approximately 1,033,000 documents and 3 gigabytes of text in thecollection. The choice of the collection was motivated by a desire toincrease the amount of redundancy in the relevant set as comparedto last year track. The 25 relevant documents for each topic wereordered chronologically for system processing, which is easilyaccomplished for a newswire collection. Several sentences in eachdocument were labeled as relevant to the topic. Of these relevantsentences, some were labeled as new sentences. After data analy-sis, there are 39,820 sentences, with 15,557 relevant sentences. Ta-ble 5 shows the statistics of this dataset.

3.3. TREC 2004 novelty track

The NIST assessors created fifty new topics for the 2004 noveltytrack (Soboroff, 2004). As was done in TREC 2003 novelty track, thetopics were of two types. Twenty-five topics concerned events,such as Lady Diana car accident in 1997 and Japan nuclear acci-dent, and twenty-five topics focused on opinions, such as opinionon abortion pill RU-486 and opinion on gay boys scouts banned.The total of 52,447 sentences was chosen for TREC 2004 novelty

Table 13Percentage of novel documents of all datasets.

Dataset Percentage ofnovel documents

Dataset novelty

TIPSTER 91.11% HighTREC 2003 novelty track 65.73% HighTREC 2004 novelty track 41.40% MediumTREC 2003d novelty track Variable VariableTREC 2004d novelty track Variable VariableBusiness blog 95.43% High

Table 14Dataset novelty for TREC 2003d and TREC 2004d novelty track.

Dataset novelty TREC 2003d TREC 2004d

Low 32.32% (0.9)⁄ 15.15% (0.7)⁄

Medium 49.12% (0.7)⁄ 47.01% (0.1)⁄

High 80.88% (0.2)⁄ –

⁄ Percentage of novel documents (novel documents threshold).

Page 4: Experiments in term weighting for novelty mining

F.S. Tsai, A.T. Kwee / Expert Systems with Applications 38 (2011) 14094–14101 14097

track. 8343 sentences are treated as relevant sentences. The statis-tics of TREC 2004 novelty track is shown in Table 6.

3.4. TREC 2003d novelty track

As the original dataset for the TREC 2003 novelty track is in thesentence level, we then constructed the document-level datasetwhich later we will call it TREC 2003d. As each topic consists of25 documents, the total of 1250 documents was created. As otherdatasets, this dataset also needs the assessment to determine thenovelty of each document. This paper proposes to use the availablesentences’ assessment for the newly created document to deter-mine the novelty of documents. For each document, we compute

0 0.1 0.2 0.3 0.4 0.50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Reca

Pre

cisi

on

Document−Level NM on TREC 2004 (per

Threshold = 1

0.95

0.85

Fig. 1. Low percentage of novel

0 0.1 0.2 0.3 0.4 0.50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Reca

Pre

cisi

on

Document−Level NM on TREC 2003 (perc

Threshold = 1

0.95

Fig. 2. Medium percentage of nov

the ratio between the number of novel sentences and the numberof relevant sentences (Eq. (3)).

Rdi ¼nnovel

nrelevantð3Þ

where Rdi denotes as ratio of document di which later be regarded asnovelty of the document;

nnovel denotes as number of novel sentences in document di;nrelevant denotes as number of relevant sentences in document di.

For example, document 1 consists of 20 sentences. Fifteen ofthese sentences are relevant to the topic and only 10 sentences

0.6 0.7 0.8 0.9 1

F score

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

ll

centage novel documents = 20.47%)

TFBinaryTF.IDF

0.9

0.75

documents on TREC 2004d.

0.6 0.7 0.8 0.9 1

F score

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

ll

entage novel documents = 49.12%)

TFBinaryTF.IDF

0.85

0.75

el documents on TREC 2003d.

Page 5: Experiments in term weighting for novelty mining

14098 F.S. Tsai, A.T. Kwee / Expert Systems with Applications 38 (2011) 14094–14101

are marked as novel sentences. Substituting values into Eq. (3), weget Rdi = 10/15 = 0.67. It means that document 1 has 67% of novelinformation. After we get the value of Rdi, we need to compare thisvalue with a threshold to determine whether the document isgoing to be marked as novel document. Documents that have thevalue of Rdi equal or greater than the threshold are treated as noveldocuments (Tang & Tsai, 2010). The statistics for this dataset isshown in Table 7.

3.5. TREC 2004d novelty track

Similar to TREC 2003 novelty track, this dataset is for thesentence level. Documents were constructed from the availablesentences and a total of 1808 documents were created. This

0 0.1 0.2 0.3 0.4 0.50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Reca

Pre

cisi

on

Document−Level NM TREC 2003 (perce

Threshold = 1

0.95 0.8

Fig. 3. High percentage of novel

0 0.1 0.2 0.3 0.4 0.50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Reca

Pre

cisi

on

Document−Level NM on TIPSTER (perc

0.950.85

Threshold = 1

Fig. 4. PRF curve on

newly created dataset will be called TREC 2004d. The proce-dure to determine the novelty of each document was per-formed. Refer to TREC 2003d novelty track for more details.Complete statistics for TREC 2004d novelty track can be seenin Table 8.

3.6. Business blog

This dataset consists of 4 topics, which can be seen in theTable 9.

We also created our own assessment by asking an externalassessor to manually judge the entire document collection. The re-sults are regarded as the ground truth for our evaluation. The sum-mary of the statistics is shown in the Table 10.

0.6 0.7 0.8 0.9 1

F score

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

ll

ntage novel documents = 80.88%)

TFBinaryTF.IDF

5

documents on TREC 2003d.

0.6 0.7 0.8 0.9 1

F score

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

ll

entage novel documents = 91.11%)

TFBinaryTF.IDF

0.75

TIPSTER dataset.

Page 6: Experiments in term weighting for novelty mining

F.S. Tsai, A.T. Kwee / Expert Systems with Applications 38 (2011) 14094–14101 14099

4. Metrics and term weighting functions

The most popular metric, cosine similarity was used in this pa-per as the metric to compute the similarity between the currentdocument or sentence and each of its history documents or sen-tences, respectively, which later determines the novelty score (N)for the current document or sentence as shown in Eq. (4).

NcosðstÞ ¼ min16i6t�1

1� cosðdt ;diÞ ð4Þ

cosðdt ;diÞ ¼Pn

k¼1wkðdtÞ �wkðdiÞkdtk � kdik

where Ncos(d) denotes the novelty score based on cosine similarityof document d and wk(d) is the weight of kth element in documentweighted vector d and the term weighting functions are shown inTable 11.

0 0.1 0.2 0.3 0.4 0.50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Reca

Pre

cisi

on

Document−Level NM on Business Blogs (pe

Threshold = 1

Fig. 5. PRF curve on bus

0 0.1 0.2 0.3 0.4 0.50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Reca

Pre

cisi

on

Sentence−Level NM

Threshold = 1

0.95

0.85

Fig. 6. PRF curve on TR

5. Evaluation measures

Precision (P), recall (R), F score (F), and precision-recall-F Score(PRF) curve are used in this paper for evaluating the performanceof novelty mining process. The larger the area under the PRF curve,the better the performance (Zhang & Tsai, 2009b). Precision, recall,and F score on a certain topic are defined in Eq. (5).

P ¼ MS

ð5Þ

R ¼ MA

F score ¼ 2 � P � RP þ R

ð6Þ

0.6 0.7 0.8 0.9 1

F score

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

ll

rcentage novel documents = 95.48%)

TFBinaryTF.IDF

0.95

iness blog dataset.

0.6 0.7 0.8 0.9 1

F score

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

ll

on TREC 2003

TFBinaryTF.IDF

0.75 0.65

EC 2003 dataset.

Page 7: Experiments in term weighting for novelty mining

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1F score

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

Recall

Pre

cisi

on

Sentence−Level NM on TREC 2004

TFBinaryTF.IDF

Threshold = 1

0.95

0.85

0.75

Fig. 7. PRF curve on TREC 2004 dataset.

14100 F.S. Tsai, A.T. Kwee / Expert Systems with Applications 38 (2011) 14094–14101

where M denotes as number of matched novel documents retrievedby the system and the ground truth;

S denotes as the number of novel documents retrieved by thesystem;A denotes as number of novel documents retrieved by theassessors.

6. Percentage of novel documents

This percentage can be classified into three categories, low,medium, and high percentage of novel document. Refer to the Ta-ble 12 for approximate values of each category.

Table 12 tells us that a dataset is considered low novelty if thepercentage of novel documents is below 33%, as medium novelty ifthe percentage of novel document is between 33% and 66%, andhigh novelty otherwise. Most of our datasets have fixed percentageof novel documents, except TREC 2003d/TREC 2004d novelty track,where the percentage of novel documents can be changed accord-ing to the threshold. Table 13 is the summary of the datasets’novelty.

Table 15Summary of the best and the worst term weighting function.

Novelty dataset Best Worst

TREC 2003d novelty track Binary TF.IDFTREC 2004d novelty track TF TF.IDFTIPSTER Binary TF.IDFBusiness blog TF BinaryTREC 2003 TF.IDF TFTREC 2004 TF.IDF TF

7. Experiments

In this experiment, we use cosine similarity as our metric andthen compared different term weighting functions as listed inTable 11, at the document level using TIPSTER, TREC 2003d/2004d novelty track, and business blogs (BizBlogs07) novelty data(Liang, Tsai, & Kwee, 2009) and at the sentence level using TREC2003/2004 novelty track data. For datasets that have fixed percent-age of novel documents, we perform novelty mining using cosinesimilarity as the metric and TF, TF.IDF, and binary as the termweighting functions. The thresholds used were between 0 and 1.After obtaining the results, we calculate precision, recall, and Fscore of each topic, and average them over the topics. Finally, weplot the PRF curve, where the grey dashed lines show contours atintervals of 0.1 points of F score. For datasets that have a variablepercentage of novel documents (i.e. TREC 2003d/2004d noveltytrack), we select some thresholds to obtain low, medium, and high

percentage of novel documents for TREC 2003d/2004d as shown inTable 14.

The PRF curves for TREC 2003d and TREC 2004d are given inFigs. 1–3.

As seen in the graphs, on the low percentage of novel docu-ments, TF outperforms the binary term weighting function. How-ever, as the percentage of novel documents increases, binaryterm weighting function works better. TF.IDF performs the worstfor data with a low percentage of novel documents. On data withmedium percentage of novel documents, TF.IDF outperforms TFin high-recall cases. For data with a high percentage of novel doc-uments, TF.IDF outperforms TF on the high-precision cases.

Figs. 4 and 5 show the document-level novelty mining on TIP-STER data and business blog data.

From the figures, we can see that binary term weighting func-tion performs slightly better on TIPSTER and TF performs slightlybetter on business blog. We also tried novelty mining on the sen-tence level on both TREC 2003/04 novelty track dataset. The perfor-mance is shown in Figs. 6 and 7.

From the above figures, we can conclude that on the sentence-level novel mining TF.IDF term weighting function outperformsboth TF and binary term weighting function. The summary of re-sults for all datasets is shown in Table 15.

We can learn from the table that for the document-level noveltymining, the best term weighting function is binary or TF and theworst term weighting function is TF.IDF. Whereas on the sen-tence-level novelty mining the best term weighting function isTF.IDF and the worst is TF. The summary is provided in the Table16.

Page 8: Experiments in term weighting for novelty mining

Table 16The best and the worst term weighting function for the document-level and sentence-level novelty mining.

Best Worst

Document-level NM Binary, TF TF.IDFSentence-level NM TF.IDF TF

F.S. Tsai, A.T. Kwee / Expert Systems with Applications 38 (2011) 14094–14101 14101

8. Conclusions and future work

This paper performed a thorough comparative review to recom-mend the best term weighting function (TF, TF.IDF, and binary) forboth document and sentence-level novelty mining. Although dif-ferent methods and algorithms for novelty mining have previouslybeen studied, none have compared and discussed the impact ofterm weighting on the novelty mining performance. Four datasetswere used for document-level novelty mining: TIPSTER, businessblogs (BizBlogs07), and TREC 2003 & TREC 2004 novelty track data-sets. For TREC 2003 and TREC 2004 novelty track on the documentlevel, document-level data with different novel percentages werecreated. For sentence-level novelty mining, TREC 2003 and TREC2004 novelty track datasets were used. The experimental studiesindicate that, overall, binary was the best term weighting functionfor document-level novelty mining and TF.IDF was the best termweighting function for sentence-level novelty mining. For datasetswith a low percentage of novel documents, TF outperformed thebinary term weighting function. For data with a high percentageof novel documents, TF.IDF outperformed TF on the high-precisioncases. These results can be used as guidelines for choosing the bestterm weighting function for novelty mining across a broad range ofdata.

References

Allan, J., Wade, C., & Bolivar, A. (2003). Retrieval and novelty detection at thesentence level. In SIGIR 2003: Proceedings of the 26th annual international ACMSIGIR conference on research and development in information retrieval (pp. 314–321).

Chen, Y., Tsai, F. S., & Chan, K. L. (2008). Machine learning techniques for businessblog search and mining. Expert Systems with Applications, 35(3), 581–590.

Eichmann, D., & Srinivasan, P. (2002). Novel results and some answers. InProceedings of TREC 2002 – The 11th text retrieval conference (pp. 1–7).

Franz, M., Ittycheriah, A., McCarley, J., & Ward, T. (2001). First story detection:Combining similarity and novelty based approach. In Topic detection andtracking workshop (pp. 1–11).

Harman, D. (2002). Overview of the TREC 2002 novelty track. In Proceedings of TREC2002 – The 11th text retrieval conference (pp. 46–55).

Kwee, A. T., & Tsai, F. S. (2009). Mobile novelty mining. International Journal ofAdvanced Pervasive and Ubiquitous Computing, 1(4), 43–68.

Kwee, A. T., Tsai, F. S., & Tang, W. (2009). Sentence-level novelty detection in Englishand Malay. Lecture Notes in Computer Science (LNCS), 5476, 40–51.

Liang, H., Tsai, F. S., & Kwee, A. T. (2009). Detecting novel business blogs. In ICICS2009: Proceedings of the 7th IEEE international conference on information,communications and signal processing (pp. 1–5).

Ng, K. W., Tsai, F. S., Goh, K. C., & Chen, L. (2007). Novelty detection for textdocuments using named entity recognition. In ICICS 2007: 6th internationalconference on information, communications and signal processing, (pp. 1–5).

Ong, C. L., Kwee, A. T., & Tsai, F. S. (2009). Database optimization for noveltydetection. In ICICS 2009: Proceedings of the 7th IEEE international conference oninformation, communications and signal processing (pp. 1–5).

Soboroff, I. (2004). Overview of the TREC 2004 novelty track. In Proceedings of TREC2004 – the 13th text retrieval conference (pp. 1–16).

Soboroff, I., & Harman, D. (2003). Overview of the TREC 2003 novelty track. InProceedings of TREC 2003 – the 12th text retrieval conference (pp. 38–53).

Tang, W., & Tsai, F. S. (2009). Threshold setting and performance monitoring fornovel text mining. In Society for industrial and applied mathematics – 9th SIAMinternational conference on data mining, proceedings in applied mathematics (vol.3, pp. 1310–1319).

Tang, W., Kwee, A. T., & Tsai, F. S. (2009). Accessing contextual information forinteractive novelty detection. In European conference on information retrieval(ECIR) workshop on contextual information access, seeking and retrieval evaluation(pp. 1–4).

Tang, W., & Tsai, F. S. (2010). Adaptive threshold setting for novelty mining. In Textmining: Applications and theory (pp. 129–148). Wiley.

Tang, W., Tsai, F. S., & Chen, L. (2010). Blended metrics for novel sentence mining.Expert Systems with Applications, 37(7), 5172–5177.

Tsai, F. S. (2009). Network intrusion detection using association rules. InternationalJournal of Recent Trends in Engineering, 2(2), 202–204.

Tsai, F. S., & Chan, K. L. (2011). An intelligent system for sentence retrieval andnovelty mining. International Journal of Knowledge Engineering and Data Mining,1(3), 235–253.

Tsai, F. S., & Chan, K. L. (2010). Redundancy and novelty mining in the businessblogosphere. The Learning Organization, 17(6), 490–499.

Tsai, F. S., Chen, Y., & Chan, K. L. (2007). Probabilistic techniques for corporate blogmining. Lecture Notes in Computer Science (LNCS), 4819, 35–44.

Tsai, F. S., Etoh, M., Xie, X., Lee, W.-C., & Yang, Q. (2010). Introduction to mobileinformation retrieval. IEEE Intelligent Systems, 25(1), 11–15.

Tsai, F. S., Han, W., Xu, J., & Chua, H. C. (2009). Design and development of a mobilepeer-to-peer social networking application. Expert Systems with Applications,36(8), 11077–11087.

Tsai, F. S., Kwee, A. T., Tang, W., & Chan, K. L. (2010). Adaptable services for noveltymining. International Journal of Systems and Service-Oriented Engineering, 1(2),69–85.

Tsai, F. S., Tang, W., & Chan, K. L. (2010). Evaluation of metrics for sentence-levelnovelty mining. Information Sciences, 180(12), 2359–2374.

Yee, K. Y., Tiong, A. W., Tsai, F. S., & Kanagasabai, R. (2009). OntoMobiLe: A genericontology-centric service-oriented architecture for mobile learning. In Tenthinternational conference on mobile data management: systems, services andmiddleware, workshop on mobile media retrieval (pp. 631–636). IEEE.

Zhang, Y., & Tsai, F. S. (2009a). Combining named entities and tags for novelsentence detection. In ESAIR 2009: Proceedings of the WSDM workshop onexploiting semantic annotations in information retrieval (pp. 30–34).

Zhang, Y., & Tsai, F. S., (2009b). Chinese novelty mining. In EMNLP 2009: Proceedingsof the conference on empirical methods in natural language processing (pp. 1561–1570).

Zhang, Y., Callan, J., & Minka, T. (2002). Novelty and redundancy detection inadaptive filtering. In SIGIR 2002: Proceedings of the 25th annual international ACMSIGIR conference on research and development in information retrieval (pp. 81–88).

Zhang, Y., Tsai, F. S., & Kwee, A. T. (2011). Multilingual sentence categorization andnovelty mining. Information Processing and Management: An InternationalJournal, 1–19.