information retrieval

Information Retrieval: Applications to English and Arabic Documents

by

Fadoua Ataa Allah

Dissertation submitted to the Faculty of Science - Rabat of the

University of Mohamed V - Agdal in fulfillment

of the requirements for the degree of

Doctor of Philosophy

2008

AbstractAbstractAbstractAbstract

Arabic information retrieval has become a focus of research and commercial development

due to the vital necessity of such tools for people in the electronic age. The number of Arabic-

speaking Internet users is assumed to achieve 43 millions during this year1; however, on the

other side, few full search engines are available to Arabic-speaking users. This dissertation

focuses on three naturally related areas of research: information retrieval, document

clustering, and dimensionality reduction.

In information retrieval, we propose an Arabic information retrieval system, based on light

stemming in the pre-processing phase, and on the Okapi BM-25 weighting scheme and the

latent semantic analysis model in the processing phase. This system has been suggested after

performing and analyzing many experiments dealing with Arabic natural language processing

and different weighting schemes found in literature. Moreover, it has been compared with

another proposed system based on noun phrase indexation.

In clustering, we propose to use the diffusion map space based on the cosine kernel and the

singular value decomposition (that we denote by the cosine diffusion map space) for clustering

documents. We illustrate experimentally, using the k-means clustering algorithm, the

robustness of document indexation in this space compared to the Salton’s space. We discuss

the problems of the reduced dimension determination related to the singular value

decomposition method and the choice of clusters’ number, and we provide some solutions for

these issues. We provide some statistical results and discuss how the k-means algorithm

performs better in the latent semantic analysis model space than in the cosine diffusion map

space in the case of two clusters, but not in the case of multi-clusters. We propose a new

approach for online clustering, based on the cosine diffusion map and the updating singular

value decomposition method.

Concerning dimensionality reduction, we use singular value decomposition technique

for feature transformation, while we propose to supplement this reduction by a

generic term extracting algorithm for features selection in the context of information

retrieval.

1 http://www.abc.net.au/science/news/stories/s1623945.htm, Retrieved on 10-05-2007.

DedicationDedicationDedicationDedication

AcknowledgementsAcknowledgementsAcknowledgementsAcknowledgements

I

Table of Contents List of Tables ................................................................................................................V

List of Figures............................................................................................................ VII

List of Abbreviations ...................................................................................................IX

Chapter 1 Introduction ...................................................................................................1

1. 1. Research Contributions..........................................................................................2

1. 2. Thesis Layout & Brief Overview of Chapters .......................................................3

Chapter 2 Literature Review..........................................................................................5

2. 1. Introduction............................................................................................................5

2. 2. Document Retrieval ...............................................................................................5

2.2.1. DOCUMENT RETRIEVAL MODELS.................................................................................... 5

2.2.1.1. Set-theoretic Models..................................................................................................... 6

2.2.1.2. Algebraic Models ......................................................................................................... 7

2.2.1.3. Probabilistic Models..................................................................................................... 7

2.2.1.4. Hybrid Models.............................................................................................................. 8

2.2.2. INTRODUCTION TO VECTOR SPACE MODELS.................................................................. 8

2. 3. Document Clustering ...........................................................................................10

2.3.1. DEFINITION .................................................................................................................... 11

2.3.2. CLUSTERING DOCUMENT IN THE CONTEXT OF DOCUMENT RETRIEVAL ...................... 11

2.3.2.1. Cluster Generation...................................................................................................... 11

2.3.2.2. Cluster Search............................................................................................................. 12

2.3.3. CLUSTERING METHODS’ TAXONOMY ............................................................................ 12

2.3.3.1. Hierarchical Clustering............................................................................................... 14

2.3.3.2. Partitional Clustering.................................................................................................. 14

2.3.3.3. Graph-Theoretic Clustering........................................................................................ 15

2.3.3.4. Incremental Clustering ............................................................................................... 15

2.3.4. DOCUMENT CLUSTERING METHODS USED FOR IR........................................................ 16

2. 4. Dimensionality Reduction ...................................................................................16

2.4.1. TERM TRANSFORMATION.............................................................................................. 17

2.4.2. TERM SELECTION........................................................................................................... 18

2.4.2.1. Definition.................................................................................................................... 18

2.4.2.2. Feature Selection Methods ......................................................................................... 18

2. 5. Studied Languages ...............................................................................................20

2.5.1. ENGLISH LANGUAGE ..................................................................................................... 20

2.5.2. ARABIC LANGUAGE....................................................................................................... 21

2.5.3. ARABIC FORMS.............................................................................................................. 21

2.5.4. ARABIC LANGUAGE CHARACTERISTICS........................................................................ 22

Table of Contents

II

2.5.4.1. Arabic Morphology .................................................................................................... 24

2.5.4.2. Word-form Structures................................................................................................. 25

2.5.5. ANOMALIES ................................................................................................................... 27

2.5.5.1. Agglutination.............................................................................................................. 27

2.5.5.2. The Vowelless Nature of the Arabic Language.......................................................... 27

2.5.6. EARLY WORK ................................................................................................................ 28

2.5.6.1. Full-form-based IR ..................................................................................................... 28

2.5.6.2. Morphology-based IR................................................................................................. 29

2.5.6.3. Statistical Stemmers ................................................................................................... 30

2. 6. Arabic Corpus ......................................................................................................31

2.6.1. AFP CORPUS.................................................................................................................. 31

2.6.2. AL-HAYAT NEWSPAPER................................................................................................ 31

2.6.3. ARABIC GIGAWORD....................................................................................................... 32

2.6.4. TREEBANKS ................................................................................................................... 32

2.6.5. OTHER EFFORTS............................................................................................................. 33

2. 7. Summary..............................................................................................................33

Chapter 3 Latent Semantic Model ...............................................................................34

3. 1. Introduction..........................................................................................................34

3. 2. Model Description ...............................................................................................34

3.2.1. TERM-DOCUMENT REPRESENTATION............................................................................ 35

3.2.2. WEIGHTING.................................................................................................................... 35

3.2.3. COMPUTING THE SVD ...................................................................................................39

3.2.4. QUERY PROJECTION AND MATCHING............................................................................ 41

3. 3. Applications and Results......................................................................................43

3.3.1. DATA .............................................................................................................................. 43

3.3.2. EXPERIMENTS................................................................................................................ 44

3.3.2.1. Weighting Schemes Impact........................................................................................ 44

3.3.2.2. Reduced Dimension k................................................................................................. 46

3.3.2.3. Latent Semantic Model Effectiveness ........................................................................ 47

3. 4. Summary..............................................................................................................48

Chapter 4 Document Clustering based on Diffusion Map...........................................49

4. 1. Introduction..........................................................................................................49

4. 2. Construction of the Diffusion Map......................................................................49

4.2.1. DIFFUSION SPACE.......................................................................................................... 49

4.2.2. DIFFUSION KERNELS......................................................................................................51

4.2.3. DIMENSIONALITY REDUCTION ...................................................................................... 51

4.2.3.1. Singular Value Decomposition................................................................................... 52

4.2.3.2. SVD-Updating............................................................................................................ 54

Table of Contents

III

4. 3. Clustering Algorithms..........................................................................................56

4.3.1. K-MEANS ALGORITHM ................................................................................................... 56

4.3.2. SINGLE-PASS CLUSTERING ALGORITHM ....................................................................... 57

4.3.3. THE OSPDM ALGORITHM ............................................................................................. 58

4. 4. Experiments and Results......................................................................................59

4.4.1. CLASSICAL CLUSTERING............................................................................................... 59

4.4.2. ON-LINE CLUSTERING.................................................................................................... 80

4. 5. Summary..............................................................................................................81

Chapter 5 Term Selection ............................................................................................83

5. 1. Introduction..........................................................................................................83

5. 2. Generic Terms Definition ....................................................................................83

5. 3. Generic Terms Extraction....................................................................................83

5.3.1. SPHERICAL K-MEANS ..................................................................................................... 87

5.3.2. GENERIC TERM EXTRACTING ALGORITHM ................................................................... 87

5. 4. Experiments and Results......................................................................................89

5. 5. The GTE Algorithm Advantage and Limitation..................................................92

5. 6. Summary..............................................................................................................93

Chapter 6 Information Retrieval in Arabic Language .................................................94

6. 1. Introduction..........................................................................................................94

6. 2. Creating the Test Set............................................................................................94

6.2.1. MOTIVATION .................................................................................................................. 94

6.2.2. REFERENCE CORPUS......................................................................................................95

6.2.2.1. Description ................................................................................................................. 95

6.2.2.2. Corpus Assessments ................................................................................................... 97

6.2.3. ANALYSIS CORPUS........................................................................................................ 99

6. 3. Experimental Protocol .......................................................................................100

6.3.1. CORPUS PROCESSING................................................................................................... 100

6.3.1.1. Arabic Corpus Pre-processing.................................................................................. 100

6.3.1.2. Processing Stage....................................................................................................... 103

6.3.2. EVALUATIONS .............................................................................................................. 103

6.3.2.1. Weighting Schemes’ Impact..................................................................................... 103

6.3.2.2. Basic Language Processing Usefulness.................................................................... 104

6.3.2.3. The LSA Model Benefit ........................................................................................... 106

6.3.2.4. The Impact of Weighting Query............................................................................... 107

6.3.2.5. Non Phrase Indexation ............................................................................................. 108

6. 4. Summary............................................................................................................111

Chapter 7 Conclusion and Future Work ....................................................................113

7. 1. Conclusion .........................................................................................................113

Table of Contents

IV

7. 2. Limitations .........................................................................................................113

7. 3. Prospects ............................................................................................................114

Appendix A Natural Language Processing................................................................115

A.1. Introduction........................................................................................................115

A.2. Basic Techniques ...............................................................................................115

A.2.1. N-GRAMS .................................................................................................................... 115

A.2.2. TOKENIZATION............................................................................................................ 115

A.2.3. TRANSLITERATION......................................................................................................116

A.2.4. STEMMING .................................................................................................................. 117

A.2.5. STOP WORDS............................................................................................................... 118

A.3. Advanced Techniques........................................................................................119

A.3.1. ROOT........................................................................................................................... 119

A.3.2. POS TAGGING............................................................................................................. 120

A.3.3. CHUNKING .................................................................................................................. 120

A.3.4. NOUN PHRASE EXTRACTION....................................................................................... 121

Appendix B Weighting Schemes’ Notations .............................................................122

Appendix C Evaluation Metrics.................................................................................124

C.1. Introduction ........................................................................................................124

C.2. IR Evaluation Metrics ........................................................................................124

C.2.1. PRECISION................................................................................................................... 124

C.2.2. RECALL ....................................................................................................................... 125

C.2.3. INTERPOLATED RECALL-PRECISION CURVE............................................................... 126

C.3. Clustering Evaluation.........................................................................................127

C.3.1. ACCURACY.................................................................................................................. 127

C.3.2. MUTUAL INFORMATION.............................................................................................. 128

Appendix D Principal Angles ....................................................................................129

References..................................................................................................................130

V

List of Tables Table 2.1. Arabic letters...............................................................................................22

Table 2.2. Different shapes of the letter “غ” ‘gh’ (Ghayn). ........................................22

Table 2.3. Ambiguity caused by the absence of vowels in the words “آ��” ‘ktb’ and

“ ر��” ‘mdrsp’. ..................................................................................................23

Table 2.4. Some templates generated from roots with examples from the root (‘آ��’

“ktb”). ..................................................................................................................24

Table 2.5. Derivations from a borrowed word............................................................25

Table 3.1. Comparison between Different Versions of the Standard Query Method..42

Table 3.2. Size of collections........................................................................................43

Table 3.3. Result of weighting schemes in increasing order for Cisi corpus..............44

Table 3.4. Result of weighting schemes in increasing order for Cran corpus.............45

Table 3.5. Result of weighting schemes in increasing order for Med corpus..............45

Table 3.6. Result of weighting schemes in increasing order for Cisi-Med corpus......46

Table 3.7. The best reduced dimension for each weighting scheme in the case of four

corpuses...............................................................................................................47

Table 4.1. Performance of different embedding representations using k-means for the

set Cisi and Med...................................................................................................61

Table 4.2. The process running time for the cosine and the Gaussian kernels...........61

Table 4.3. Performance of k-means in cosine diffusion, Salton and LSA spaces for the

set Cisi and Med...................................................................................................64

Table 4.4. Measure of the difference between the approximated and the histogram

distributions.........................................................................................................66

Table 4.5. Performances of different embedding representations using k-means for the

set Cran, Cisi and Med........................................................................................67

Table 4.6. Performance of k-means in cosine diffusion, Salton, and LSA spaces for the

set Cran, Cisi and Med........................................................................................68

Table 4.7. Measure of the difference between the approximated and the histogram

distributions.........................................................................................................70

Table 4.8. Performance of different embedding cosine diffusion and LSA

representations using k-means for the set Cran, Cisi, Med and Reuters_1.........72

Table 4.9. Performance of k-means in Cosine diffusion, Salton and LSA spaces for the

set Cran, Cisi, Med and Reuters_1......................................................................72

Table 4.10. The confusion matrix for the set Cran-Cisi-Med-Reuters_1....................73

Table 4.11. The confusion matrix for the set S in 2-dimension cosine diffusion space.

..............................................................................................................................74

List of Tables

VI

Table 4.12. The resultant confusion matrix.................................................................74

Table 4.13. Mutual information of different embedding cosine diffusion

representations using k-means to exclude the cluster C2 from the set Cran, Cisi,

Med and Reuters_1..............................................................................................75

Table 4.14. Performance of different embedded cosine diffusion representations using

k-means for the set S............................................................................................75

Table 4.15. The confusion matrix for the set Cran-Cisi-Med-Reuters_1 clustered into

4 clusters in the 4-dimention cosine diffusion space...........................................75


representations using k-means for the set Cran, Cisi, Med and Reuters_2.........76

Table 4.17. Performance of k-means in Cosine diffusion, Salton and LSA spaces for

the set Cran, Cisi, Med and Reuters_2................................................................77


representations using k-means for Reuters..........................................................77

Table 4.19. Performance of k-means in Cosine diffusion, Salton and LSA spaces for

Reuters.................................................................................................................77

Table 4.20. The statistical results for the performance of k-means algorithm in cosine

diffusion and LSA spaces.....................................................................................80

Table 4.21. Performances of the single-pass clustering..............................................81

Table 5.1. Index size in the native and Noun phrase spaces........................................90

Table 5.2. The MIAP measure for the collection Cisi in different indexes..................90

Table 5.3. The MIAP measure for the collection Cran in different indexes................91

Table 5.4. The MIAP measure for the collection Med in different indexes.................91

Table 5.5. LSA performance in the native and Noun phrase spaces...........................92

Table 6.1. [AR-ENV] Corpus Statistics.......................................................................96

Table 6.2. An example illustrating the typical approach to query term selection.......96

Table 6.3. Token-to-type ratios for fragments of different lengths, from various

corpora.................................................................................................................98

Table A.1. Buckwalter Transliteration......................................................................117

Table A.2. Prefixes and suffixes list...........................................................................118

Table B.1. List of term weighting components..........................................................123

VII

List of Figures Figure 2.1. A taxonomy of clustering approaches.......................................................13

Figure 3.1. A pictorial representation of the SVD. The shaded areas of U and V, as

well as the diagonal line in S, represent Ak, the reduced representation of the

original term-document matrix A.........................................................................40

Figure 3.2. The interpolated recall-precision curves of the LSA and the VSM models.

..............................................................................................................................48

Figure 4.1. Average cosine of the principal angles between 64 concept subspace and

various singular subspaces for the CLASSIC data set.........................................53

Figure 4.2. Average cosine of the principal angles between 64 concept subspace and

various singular subspaces for the NSF data set.................................................53

Figure 4.3. Representation of our data set in various diffusion spaces.......................60

Figure 4.4. Representation of our data set in Cosine and Gaussian diffusion spaces

for various t time iterations..................................................................................63

Figure 4.5. Representation of the first 100 singular values of the cosine diffusion map

on the set Cisi and Med........................................................................................64

Figure 4.6. Representation of the first 100 singular values of the Cisi and Med term-

document matrix...................................................................................................65

Figure 4.7. Histogram representation of the cluster C1 documents............................66

Figure 4.8. Histogram representation of the cluster C2 documents............................66

Figure 4.9. Representation of the first 100 singular values of the cosine diffusion map

on the cluster C1..................................................................................................67

Figure 4.10. Representation of the first 100 singular values of the cosine diffusion

map on the cluster C2..........................................................................................67


space on the set Cran, Cisi and Med...................................................................68

Figure 4.12. Representation of the first 100 singular values of the Cran, Cisi and Med

term-document matrix..........................................................................................68

Figure 4.13. Histogram representation of the cluster C1 documents..........................69




map on cluster C1................................................................................................70


map on cluster C2................................................................................................71


map on cluster C3................................................................................................71

List of Figures

VIII


map on the set Cran, Cisi, Med and Reuters_1...................................................72

Figure 4.20. Representation of the first clusters of the hierarchical clustering..........73

Figure 4.21.Representation of the first 100 singular values of the cosine diffusion map

on the data set S...................................................................................................73

Figure 4.22. Representation of the Set S clusters.........................................................74


map on the set Cran, Cisi, Med and Reuters_2...................................................76


map on Reuters....................................................................................................77

Figure 4.25. The LSA and Diffusion Map processes....................................................79

Figure 5.1. Top-Level Flowchart of GTE Algorithm...................................................89

Figure 6.1. Zipf’ law and word frequency versus rank in the [AR-ENV] collection...98

Figure 6.2. Token-to-type ratios (TTR) for the [AR-ENV] collection..........................99

Figure 6.3. A standardized information retrieval system...........................................100

Figure 6.4. An information retrieval system for Arabic language.............................101

Figure 6.5. Comparison between the performances of the LSA model for five

weighting schemes.............................................................................................104

Figure 6.6. Language processing benefit...................................................................105

Figure 6.7. A new information retrieval system suggested for Arabic language.......106

Figure 6.8. A comparison between the performances of the VMS and the LSA models.

............................................................................................................................107

Figure 6.9. Weighting queries’ impact.......................................................................108

Figure 6.10. Arabic Information Retrieval System based on NP Extraction.............109

Figure 6.11. Influence of the NP and the singles terms indexations on the IRS

performance.......................................................................................................110

Figure C.1. The computation of Recall and Precision...............................................124

Figure C.2. The Precision Recall trade-off................................................................125

Figure C.3. Interpolated Recall Precision Curve......................................................127

IX

List of Abbreviations Acc: Accuracy

AFN: Affinity Set

AFP: Agence France Presse

AIR: Arabic Information Retrieval

AIRS: Arabic Information Retrieval System

AP: Average Precision

BNS: Bi-Normal Separation

CCA: Corpus of Contemporary Arabic

CHI: 2χ -test

CQ: Characteristic Quotient

DF: Document Frequency

DM: Diffusion Map

ELRA: European Language Resources distribution Agency

GPLVM: Gaussian Process Latent Variable Model

GTE: Generic Term Extracting

HPSG: Head-driven Phrase Structure Grammar

ICA: Independent Component Analysis

ICA’: International Corpus of Arabic

ICE: International Corpus of English

IG: Information Gain

IR: Information Retrieval

IRP: Interpolated Recall-Precision

IRS: Information Retrieval System

ISOMAPS: ISOmetric MAPS

LLE: Locally Linear Embedding

LSA: Latent Semantic Analysis

LTSA: Local Tangent Space Alignment

MDS: Multidimensional Scaling

MI: Mutual Information

MIAP: Mean Interpolated Average Precision

NLP: Natural Language Processing

List of Abbreviations

X

nonrel: non-relevant

NP: Noun Phrase

OSPDM: On-line Single-Pass Clustering based on Diffusion Map

P2P: Peer-To-Peer

PCA: Principle Component Analysis

POS: Part Of Speech

Pr: Probability

R&D: Research and Development

rel: relevant

RSV: Retrieval Status Value

SOM: Self-Organizing Maps

SVD: Singular Value Decomposition

SVM: Support Vector Machine

TREC: Text REtrieval Conference

TS: Term Strength

TTR: Token-to-Type Ratio

TDT: Topic Detection and Tracking

VSM: Vector-Space Model

___________________________________________________________________________________________

Fadoua Ataa Allah’s Thesis

1

Chapter 1 Introduction

The advent of the World Wide Web has increased the importance of information retrieval. Instead of

going to the local library to look for information, people search the Web. Thus, the relative number of

manual versus computer-assisted searches for information has shifted dramatically in the past few years.

This has accentuated the need for automated information retrieval for extremely large document

collections, in order to help in reading, understanding, indexing and tracking the available literature. For

this reason, researchers in document retrieval, computational linguistics and textual data mining are

working on the development of methods to process these data and present them in a usable and suitable

format for many written languages where Arabic is one.

Known as the second2 most widely spoken language in the world, Arabic knows an important

increasing of the speaking Internet users’ number. In 2002 was about 4.4 million [ACS04], and 16

million in 2004, while the research commissioned from Dubai-based Internet researcher Madar shows

that this number could jump to 43 million in 20083. However, at present there are relatively few standard

Arabic search engines known. Despite their availability, according to Hermann Havermann (managing

director of German Internet tech firm Seekport, and founder member of the project Arabic search engine

SAWAFI), they are not considered as “full” Arabic engines. As announced in the Reuters article news4,

Hermann Havermann confirmed that “There is no [full] Arabic internet search engine on the market.

You find so-called search engines, but they involve a directory search, not a local search”.

The fact that any improved access to Arabic text will have profound implications for cross-cultural

communication, economic development, and international security, encourage us to take an interest

more particularly in this language.

The limited number of research in the Arabic document retrieval area over 20 years, began by the

arabization of the MINISIS system [Alg87] then the development of the Micro-AIRS system [Alka91],

are all dominated by the use of statistical methods to automatically match natural language user queries

against records. There has been interest in using natural language processing to enhance term matching

by using root, stem, and n-gram, as is highlighted in Text REtrieval Conference TREC-2001 [GeO01].

However yet to 2005, the effect of stemming upon stopwords was not studied; the Latent Semantic

2http://encarta.msn.com/media_701500404/Languages_Spoken_by_More_Than_10_Million_People.html, Microsoft ®

Encarta ® 2006, Retrieved on 10-05-2007. 3 http://www.abc.net.au/science/news/stories/s1623945.htm, Retrieved on 10-05-2007. 4 ‘Arabic search engine may boost content’, by Andrew Hammond, in Reuters, on April 26th, 2006. Retrieved on 10-05-2007.

Introduction

2 ____________________________________________________________________________________________


Analysis model (LSA), developed in the early 1990s [DDF90] and known by its high capacity of

resolving the synonymy and the polysemy problems, was not utilized; and neither the indexation by

phrases was used.

We are motivated by the fact that the use of the LSA model, in an attempt to discover structure and

implicit meanings “hidden”, may meet the challenges of the wide use of synonyms offered in Arabic.

The employment of several weighting schemes, taking into account the term importance within both

documents and query; and the use of Arabic natural language processing, based on spelling mutation,

stemming, stopword removal and noun phrase extraction, make the study more interesting.

The first objective of our study is the improvement of the computing similarity score between

documents and a query for Arabic documents; however, this study has been extended to consider other

aspects. Many studies have proved that clustering is an important tool in information retrieval for

constructing taxonomy of a documents’ collection, by forming groups of closely-related documents

[FrB92, FaO95, HeP96, Leu01]. Based on the Cluster Hypothesis: “closely associated documents tend

to be relevant to the same requests” [Van79], clustering is used to accelerate query processing by

considering only a small number of clusters’ representatives, rather than the entire corpus. Typically, we

think that reducing the corpus dimension by using some features selection methods may also help a user

to find relevant information more quickly. Thus, we have been interested in developing new clustering

methods in off and on-line cases, and extending the generic term extraction method to reduce the storage

capacity for retrieval task.

1. 1. Research Contributions In the objective of improving the performance and the complexity of document retrieval systems, ten

major contributions are proposed in this thesis:

- Studying the Weighting Schemes found in the current text retrieval literature to discover the best

one, while the Latent Semantic model is used.

- Utilizing the Diffusion map for off-line document clustering, and improving its performance by

using the Cosine distance.

- Comparing the k-means algorithm performance in the Salton, LSA and cosine diffusion spaces.

- Proposing two postulates indicating the appropriate reduced dimension to use for clustering,

and the optimal number of clusters.

- Developing a new method for on-line clustering, based on Diffusion map and updating singular

value decomposition.

- Analyzing the benefit of extracting Generic Terms in decreasing the Data Storage capacity,

required for document retrieval.

Introduction

___________________________________________________________________________________________


3

- Creating an Arabic Retrieval Test Collection, where documents affecting a scientific field

specialized in the environment, and queries structured into two categories would help to examine

the performance difference between the case of short “2 or 3 words” and long queries

“sentence”.

- Applying the Latent Semantic model to the Arabic language in attempt to meet the challenges of

the wide use of synonyms offered by this language.

- Analyzing the Weighting Schemes influence on the use of some Arabic language processing.

- Studying the effect of representing the Arabic document content by Noun Phrase in the

improvement of the proposed automatic document retrieval system based on the two previous

contributions.

1. 2. Thesis Layout & Brief Overview of Chapters This thesis comprises seven chapters and four appendixes, briefly described as follows:

Chapter 2 reviews document retrieval, and document clustering. It surveys prior research on

dimensionality reduction techniques especially features selection methods. It focuses on

Arabic language characteristics, earlier vector space retrieval models and corpora used in

this language.

Chapter 3 describes the latent semantic analysis model by outlining the term-document

presentation, analyzing the weighting schemes found in the current text retrieval

literature. It explains the singular value decomposition method, and reviews the three

standard LSA query methods. It introduces a bunch of the English test data collections

used in this work, and evaluates the different weighting schemes presented before. It

compares the performances of the LSA and the standard vector space models.

Chapter 4 presents the diffusion map approach, and shows its efficiency on off-line documents

clustering task, when a cosine kernel is used. It validates two postulates indicating the

appropriate reduced dimension to use for clustering, as well as the optimal number of

clusters to use in that dimension. Furthermore, it proposes a new single pass approach for

the on-line document clustering, based on the diffusion map and the updating singular

value decomposition.

Chapter 5 introduces the generic term extraction method, and analyzes the impact of using this

method in reducing the storage capacity in the cases of document retrieval.

Chapter 6 describes the development of Arabic retrieval text collections. It studies the existing

Arabic natural language techniques, and implements them in a new Arabic document

retrieval system based on the latent semantic analysis model. It examines and discuses the

Introduction

4 ____________________________________________________________________________________________


effectiveness of different index terms on these collections.

Chapter 7 summarizes the research and concludes with its major achievements and possible

directions that could be considered for future research.

Appendix A presents all natural language processing used and mentioned in this work.

Appendix B reviews the weighting schemes’ notations.

Appendix C outlines the commonly evaluation metrics used in retrieval and clustering evaluation

tasks, more specifically those used in this thesis.

Appendix D recalls the quantities known as principal angles, used to measure the closeness of

subspaces.

___________________________________________________________________________________________


5

Chapter 2 Literature Review 2. 1. Introduction

In an attempt to build an Arabic document retrieval system, we have been interested in studying

some specific and elementary tools and tasks contributing to the development of the system components.

These tools include document retrieval models, document clustering algorithms, and dimension

reduction techniques, in addition to Arabic language characteristics. In this chapter, we introduce these

elements, and survey some of their prior research.

2. 2. Document Retrieval The problem of finding relevant information is not new. Early systems tried to classify knowledge

into a set of known fixed categories. The first of these was completed in 1668 by the English

philosopher John Wilkins [Sub92]. The problem with this approach is that categorizers commonly do

not place documents into the categories where searchers expect to find them. No matter what categories

a user thinks of, these categories will not match what someone searching will find. For example, users of

e-mail systems place mails in folders or categories only to spend countless hours trying to find the same

documents because they cannot remember what category they used, or the category they are sure they

used does not contain the relevant document. Effective and efficient search techniques are needed to

help users quickly find the information they are looking for. Another approach is to try to understand the

content of the documents, ideally, by loading them into the computer for reading and understanding

before users would ask any questions; involving by that, the use of a document retrieval system.

The elementary definition of document retrieval is the matching of some stated user query against

useful parts of free-text records. These records could be any type of mainly unstructured text, such as

bibliographic records, newspaper articles, or paragraphs in a manual. User queries could range from

multi-sentence full descriptions of an information need to a few words. However, this definition is not

informative enough, because a document can be relevant even though it does not use the same words as

those provided in the query. The user is not generally interested in retrieving documents with exactly the

same words, but with the concepts that those words represent. To this end, many models are discussed.

2.2.1. Document Retrieval Models Several events recently occurred that have a major effect on the progress of document retrieval

research. First, the evolution of computer hardware, making the running of sophisticated search

algorithms against massive amounts of data with acceptable response times more realistic. Second, the

Internet access requirements for effective text searching systems. These two events have contributed to

Literature Review

6 ____________________________________________________________________________________________


create an interest in accelerating research to produce more effective search methodologies, including

more use of natural language processing techniques.

A great variety of document retrieval models is described in the information retrieval literature.

Based on a mathematic view, the techniques currently in use could be classed into four types: Boolean

or set-theoretic, vector or algebraic, probabilistic, and hybrid models.

A model is characterized by four parameters:

- Representations for documents and queries.

- Matching strategies for assessing the relevance of documents to a user query.

- Methods for ranking query output.

- Mechanisms for acquiring user-relevance feedback.

In the following paragraphs, we describe instances of each type in the context of the model

parameters.

2.2.1.1. Set-theoretic Models The standard Boolean model [WaK79, BuK81, SaM83] represents documents by a set of index

terms, each of which is viewed as a Boolean variable and valued as True if it is present in a document.

No term weighting is allowed. Queries are specified as arbitrary Boolean expressions formed by linking

terms through the standard logical operators: AND, OR, and NOT. Retrieval status value (RSV) is a

measure of the query-document similarity. In the Boolean model, RSV equals 1 if the query expression

evaluates to True; RSV is 0 otherwise. All documents whose RSV equals to 1 are considered relevant to

the query.

Even if this model is simple, and user queries can employ arbitrarily complex expressions, still the

retrieval performance tends to be poor. It is not possible to rank the output since all retrieved documents

have the same RSV, and weights can not be assigned to query terms. The results are often counter-

intuitive. For example, if the user query specifies 10 terms linked by the logical connective AND, a

document that has 9 of these terms is not retrieved. User relevance feedback is often used in IR systems

to improve retrieval effectiveness [SaB90]. Typically, a user is asked to indicate the relevance or

irrelevance of a few documents placed at the top of the output. Since the output is not ranked, however,

the selection of documents for relevance feedback elicitation is difficult.

The fuzzy-set model [Rad79, Boo80, Egg04] is based on fuzzy-set theory which allows partial

membership in a set, as compared with conventional set theory which does not. It redefines logical

operators appropriately to include partial set membership, and processes user queries in a manner similar

to the case of the Boolean model. Nevertheless, IR systems based on the fuzzy-set model have proved

nearly as incapable of discriminating among the retrieved output as systems based on the Boolean

Literature Review

___________________________________________________________________________________________


7

model.

The strict Boolean and fuzzy-set models are preferable to other models in terms of computational

requirements, which are low in terms of both the disk space required for storing document

representations and the algorithmic complexity of indexing and computing query-document similarities.

2.2.1.2. Algebraic Models The algebraic model represents documents and queries usually as vectors, matrices or tuples. Those

vectors, matrices or tuples are transformed by the use of a finite number of algebraic operations to a one-

dimensional similarity measurement, to indicate the query-document’s RSV. The higher the RSV, the

greater is the document’s relevance to the query.

The strength of this model lies in its simplicity, and term weighting allowance. Relevance feedback

can be easily incorporated into it. However, the rich expressiveness of query specification inherent in the

Boolean model is sacrificed.

This kind of models includes: Standard vector-space known as Salton model (highlighted in Section

2.2.2) [SaM83], Generalized vector space model [WZW85], Latent semantic model (detailed in Chapter

3) [DDF90], and Topic-based vector space model [BeK03].

2.2.1.3. Probabilistic Models The probabilistic model, introduced by Robertson and Sparck Jones [RoS76], attempts to capture the

IR problem within a probabilistic framework. To that end, the model takes the term dependencies and

relationships into account; and tries to estimate the probability of finding a document interesting for a

user, by specifying the major parameters such as the weights of the query terms and the form of the

query-document similarity.

The model is based on two main parameters Pr(rel) and Pr(nonrel), the probabilities of relevance and

non-relevance of a document to a user query. These parameters are computed using the probabilistic

term weights [RoS76, GRG97], and the actual terms present in the document. Relevance is assumed to

be a binary property so that Pr(rel) = 1-Pr(nonrel). In addition, the model uses two cost parameters, a1

and a2, to represent the loss associated with the retrieval of an irrelevant document and non-retrieval of a

relevant document, respectively.

The model may use an interaction with a user to improve its estimation, and requires term-

occurrence probabilities in the relevant and irrelevant parts of the document collection, which are

difficult to estimate. However, the model serves an important function for characterizing retrieval

processes and provides a theoretical justification for practices previously used on an empirical basis (for

example, the introduction of certain term-weighting systems).

Literature Review

8 ____________________________________________________________________________________________


This model includes: Binary independence retrieval [RoS76], Uncertain inference [CLR98],

Language models [PoC98], Divergence from randomness models [AmR02].

2.2.1.4. Hybrid Models Many techniques are considered as hybrid models. Those are a combination of some models

included in the three seen classes. For example: Extended Boolean model (set-theoretic & algebraic)

[Lee94], Inference network retrieval [TuC91] (set-theoretic &probabilistic).

According to our best knowledge, the recent used model for Arabic language, before our work

[BoA05] where latent semantic model is utilized, was the standard vector space model [SaM83]. For this

reason, we have been interested in the algebraic models, more particularly those based on vectors, to

begin our study.

2.2.2. Introduction to Vector Space Models Based on the assumption that the meaning of a document can be derived from the document's

constituent terms, vector-space models represent documents as vectors of terms ),...,2

,1

(m

tttd = where

)1( mii

t ≤≤ is a non-negative value denoting the single or multiple occurrences of term i in document d.

Thus, each unique term in the document collection corresponds to a dimension in the space. Similarly, a

query is represented as a vector )',...,2',

1'(

mtttq = where term )1(' mi

it ≤≤ is a non-negative value

denoting the number of occurrences of i

t ' (or, merely a 1 to signify the occurrence of term i

t ' ) in the

query [BeC87]. Both the document vectors and the query vector provide the locations of the objects in

the term-document space. By computing the distance between the query and other objects in the space,

objects with similar semantic content to the query will presumably be retrieved.

Vector-space models that do not attempt to collapse the dimensions of the space treat each term

independently, essentially mimicking an inverted index [FrB92]. However, vector-space models are

more flexible than inverted indices since each term can be individually weighted, allowing that term to

become more or less important within a document or the entire document collection as a whole. Also, by

applying different similarity measures to compare queries to terms and documents, properties of the

document collection can be emphasized or de-emphasized.

For example, the dot product similarity measure dqdqM .),( = finds the distance between the query

and a document in the space, where the operation “.” is the inner product multiplication, with the inner

product of two m vectors ><= ix X and Y ><= iy defined to be . Y. X1∑

=

=m

iii yx .

Literature Review

___________________________________________________________________________________________


9

The inner product or the dot product favors long documents over short ones since they contain more

terms and hence their product increases.

On the other hand by computing the angle between the query and a document rather than the

distance, the cosine similarity measure dq

dqdq

.

.),cos( = deemphasizes the lengths of the vectors.

YX . is the inner product defined above, and X is the Euclidian length of the vector X.

∑=

=m

iixX

1

2

In some cases, the directions of the vectors are a more reliable indication of the semantic similarities of

the objects than the distance between the objects in the term-document space [FrB92].

Vector-space models, by placing documents and queries in a term-document space and computing

similarities between the queries and the documents, allow the results of a query to be ranked according

to the similarity measure used. Unlike lexical matching techniques that provide no ranking or a very

crude ranking scheme (for example, ranking one document before another document because it contains

more occurrences of the search terms), the vector-space models, by basing their rankings on the

Euclidean distance or the angle measure between the query and documents in the space, are able to

automatically guide the user to documents that might be more conceptually similar and of greater use

than other documents.

Vector-space models, specifically the latent semantic model, were developed to eliminate many of

the problems associated with exact, lexical matching techniques. In particular, since words often have

multiple meanings (polysemy), it is difficult for a lexical matching technique to differentiate between

two documents that share a given word, but use it differently, without understanding the context in

which the word was used. Also, since there are many ways to describe a given concept (synonymy),

related documents may not use the same terminology to describe their shared concepts. A query using

the terminology of one document will not retrieve the other related documents. In the worst case, a

query using terminology different from that used by related documents in the collection may not retrieve

any documents using lexical matching, even though the collection contains related documents [BeC87].

For example, a text collection contains documents on house ownership and web home pages with some

others using the word house only, some using the word home only, and some using both words. For a

query on home ownership, traditional lexical matching methods fail to retrieve documents using the

word house only, which are obviously related to the query. For the same query on home ownership,

lexical matching methods will also retrieve irrelevant documents about web home pages.

Literature Review

10 ____________________________________________________________________________________________


2. 3. Document Clustering Document clustering has been studied in the field of document retrieval for several decades. In the

aim to reduce the time search, the first approaches were attempted by Salton [Sal68], Litofsky [Lit69],

Crouch [Cro72], Van Rijsbergen [Van72], Prywes & Smith [PrS72], and Fritzche [Fri73]. Based on

these studies, Van Rijsbergen specifies, in his book [Van79], that while choosing a cluster method to use

in experimental document retrieval, two, often conflicting, criteria are frequently used.

The first one, and the most important in his point of view, is the theoretical soundness of the

method, meaning that the method should satisfy certain criteria of adequacy. Below, we list some of the

most important of these criteria:

1) The method produces a clustering which is unlikely to be altered drastically when further objects

are incorporated, i.e. it is stable under growth;

2) The method is stable in the sense that small errors in the description of the objects lead to small

changes in the clustering;

3) The method is independent of the initial ordering of the objects.

These conditions have been adapted from Jardine and Sibson [JaS71]. The point is that any cluster

method which does not satisfy these conditions is unlikely to produce any meaningful experimental

results.

The second criterion for choice, considered as the overriding consideration in the majority of

document retrieval experimental works, is the efficiency of the clustering process in terms of speed and

storage requirements. Efficiency is really a property of the algorithm implementing the cluster method.

It is sometimes useful to distinguish the cluster method from its algorithm, but in the context of

document retrieval this distinction becomes slightly less useful, since many cluster methods are defined

by their algorithm, so no explicit mathematical formulation exists.

The current information explosion, fueled by the availability of hypermedia and the World-Wide

Web, has led to the generation of an ever-increasing volume of data, posing a growing challenge for

information retrieval systems to efficiently store and retrieve this information [WMB94]. A major issue

that document databases are now facing is the extremely high rate of update. Several practitioners have

complained that existing clustering algorithms are not suitable for maintaining clusters in such a

dynamic environment, and they have been struggling with the problem of updating clusters without

frequently performing complete re-clustering [CaD90, Can93, Cha94]. To overcome this problem, on-

line clustering approaches have been proposed.

In the following, we explain the clustering procedure in the context of document retrieval, we survey

a clustering methods’ taxonomy by focusing on needed categories, and we give an overview of some

recent studies in both classical and on-line clustering fields, after specifying the definition of the

Literature Review

___________________________________________________________________________________________


11

clustering by comparing this approach to other classification approaches.

2.3.1. Definition In supervised classification, or discriminant analysis, a collection of labeled (pre-classified) patterns

is provided; the problem is to label a newly encountered, yet unlabeled, pattern. Typically, the given

labeled (training) patterns are used to learn the descriptions of classes which in turn are used to label a

new pattern. In the case of clustering (unsupervised classification), the problem is to group a given

collection of unlabeled patterns into meaningful clusters. In a sense, labels are associated with clusters

also, but these category labels are data driven; that is, they are obtained solely from the data.

2.3.2. Clustering Document in the Context of Document Retrieval Under the clustering basic idea that similar documents are grouped together to form clusters, and the

so-called cluster hypothesis, closely associated documents tend to be relevant to the same requests.

Grouping similar documents will accelerate the searching, especially if we create hierarchies of clusters

by grouping clusters to form super-clusters and so on, we have been interested in surveying and studying

this approach.

On the other hand, even if clustering is a traditional approach in text retrieval context [FaO95], but

seeing that the knowledge of traditional methods is useful as a background information for the newer

developments, and the variations or the extensions of these methods are in the heart of newer methods,

we consider this study to be of potential value.

To this end, two document clustering procedures will be involved: cluster generation and cluster

search [SaW78].

2.3.2.1. Cluster Generation A cluster generation method first consists of the indexation of documents, then their partitioning into

groups. Many cluster generation methods have been proposed. Unfortunately, no single method meets

both requirement for soundness and efficiency. Thus, there are two classes of methods:

- “Sound” methods that are based on the document-document similarity matrix.

- Iterative methods that are more efficient and proceed directly from the document vectors.

a- Methods based on the Similarity matrix These methods usually require O(n2) time (or more, where n is the documents’ number), and

apply graph theoretic techniques (see Section 2.3.3). A document-to-document similarity function

has to be chosen, to measure how closely two documents are related.

Literature Review

12 ____________________________________________________________________________________________


b- Iterative Methods This class consists of methods that operate in less than quadratic time (that is O(nlogn) or

O(n2/logn)) on the average [FaO95]. These methods are based directly on the item (document)

descriptions and they do not require the similarity matrix to be computed in advance. The price for

the increased efficiency is the sacrifice of the theoretical soundness; the final classification

depends on the order that the documents are processed, or else on the existence of a set of “seed-

points” around which the classes are to be constructed.

Although some experimental evidence exists indicating that iterative methods can be effective for

information retrieval purposes [Dat71], specifically in on-line clustering [KWX01, KlJ04, KJR06], most

researchers prefer to work with the theoretically more attractive hierarchical grouping methods, while

attempting, at the same time, to save computation time. This can be done in various ways by applying

the expensive clustering process to a subset of the documents only and then assigning the remaining un-

clustered items to the resulting classes; or by using only a subset of the properties for clustering

purposes instead of the full keyword vectors; or finally by utilizing an initial classification and applying

the hierarchical grouping process within each of the initial classes only [Did73, Cro77, Van79].

2.3.2.2. Cluster Search Search method may be conducted by identifying clusters that appear most similar to a given query

item. It is carried out by first comparing a query formulation with the cluster centroids. This may then be

followed by a comparison between the query and those documents, whose corresponding query-centroid

similarity was found to be sufficiently large in the earlier comparison. Thus, searches can be conducted

rapidly because a large portion of documents are immediately rejected, the search being concentrated in

areas where substantial similarities are detectable between queries and cluster centroids.

2.3.3. Clustering Methods’ Taxonomy Many taxonometric representations of clustering methodology are possible. Based on the discussion

in Jain et al. [JMF99], data clustering methods can be distinguished between hierarchical and partitional

approaches. Hierarchical algorithms produce a nested series of partitions, by finding successive clusters

using previously established ones, whereas partitional algorithms produce only one, by determining all

clusters at once. But this taxonomy, represented in Figure 2.1, must be supplemented by a specification

of cross-cutting issues that may (in principle) affect all of the different approaches regardless of their

placement in the taxonomy.

Literature Review

___________________________________________________________________________________________


13

Figure 2.1. A taxonomy of clustering approaches.

- Agglomerative vs. divisive [JaD88, KaR90]: An agglomerative clustering (bottom-up) starts with

one-point (singleton) clusters and recursively merges two or more most appropriate clusters. A

divisive clustering (top-down) starts with one cluster of all data points and recursively splits the

most appropriate cluster. The process continues until a stopping criterion (frequently, the

requested number k of clusters) is achieved.

- Monothetic vs. polythetic [Bec59]: A monothetic class is defined in terms of characteristics that

are both necessary and sufficient in order to identify members of that class. This way of defining

a class is also termed the Aristotelian definition of a class [Van79]. A polythetic class is defined

in terms of a broad set of criteria that are neither necessary nor sufficient. Each member of the

category must possess a certain minimal number of defining characteristics, but none of the

features has to be found in each member of the category. This way of defining classes is

associated with Wittgenstein's concept of “family resemblances” [Van79]. Monothetic is a type

in which all members are identical on all characteristics. Whereas, polythetic is a type in which

all members are similar, but not identical.

- Hard vs. fuzzy: A hard clustering algorithm allocates each pattern to a single cluster during its

operation and in its output. A fuzzy clustering method assigns degrees of membership in several

clusters, that do not have hierarchical relations with each other, to each input pattern. A fuzzy

clustering can be converted to a hard clustering by assigning each pattern to the cluster with the

largest measure of membership.

- Deterministic vs. stochastic: This issue is most relevant to partitional approaches designed to

optimize a squared error function. This optimization can be accomplished using traditional

Literature Review

14 ____________________________________________________________________________________________


techniques or through a random search of the state space consisting of all possible labelings.

- Incremental vs. non-incremental: This issue arises when the pattern set to be clustered is large,

and constraints on execution time or memory space affect the architecture of the algorithm. The

early history of clustering methodology does not contain many examples of clustering algorithms

designed to work with large data sets, but the advent of data mining has fostered the

development of clustering algorithms that minimize the number of scans through the pattern set,

reduce the number of patterns examined during execution, or reduce the size of data structures

used in the algorithm’s operations [JMF99].

2.3.3.1. Hierarchical Clustering Hierarchical clustering builds a tree of clusters, also known as a dendrogram. Every cluster node

contains child clusters; sibling clusters partition the points or items covered by their common parent.

Such an approach allows exploring data on different levels of granularity, easy handling of any

similarity or distance forms, and application to any attribute types. However, it has disadvantages related

to the vagueness of termination criteria, and the fact that its algorithm does not revisit once constructed

(intermediate) clusters in the purpose of their improvement.

Most hierarchical clustering algorithms are variants of the single-link [SnS73], where each item in a

class is linked to at least one other point in the class; and complete-link algorithms [Kin67], where each

item is linked to all other points in the class.

2.3.3.2. Partitional Clustering A partitional clustering algorithm obtains a single partition of the data instead of a clustering

structure, such as the dendrogram. Partitional methods have advantages in applications involving large

data sets for which the construction of a dendrogram is computationally prohibitive. A problem

accompanying the use of a partitional algorithm is the choice of the number of desired output clusters.

The partitional techniques usually produce clusters by optimizing a criterion function defined either

locally (on a subset of the feature vectors) or globally (defined over all of the feature vectors).

Combinatorial search of the set of possible labelings for an optimum value of a criterion is clearly

computationally prohibitive. In practice, therefore, the algorithm is typically run multiple times with

different starting states, and the best configuration obtained from all of the runs is used as the output

clustering.

The most intuitive and frequently used criterion function in partitional clustering techniques is the

squared error criterion, which tends to work well with isolated and compact clusters. The k-means is the

simplest and most commonly used algorithm employing a squared error criterion [Mac67] (see Section

Literature Review

___________________________________________________________________________________________


15

4.3.1 for more details concerning this algorithm). Several variants of the k-means algorithm have been

reported in the literature. One of them will be studied in Chapter 5.

2.3.3.3. Graph-Theoretic Clustering Graph-theoretic clustering is basically a partitional taxonomy subclass, but even hierarchical

approaches are related to this category of algorithms, view the fact that single-link clusters are sub-

graphs of the minimum spanning tree of the data [GoR69, Zah71], and complete-link clusters are

maximal complete sub-graphs related to the node colorability of graphs [BaH76]. In graph-theoretic

algorithms, the data is represented as nodes in a graph and the dissimilarity between two objects is the

“length” of the edge between the corresponding nodes. In several methods, a cluster is a sub-graph that

remains connected after the removal of the longest edges of the graph [JaD88]; for example, in [Zah71]

the minimal spanning tree of the original graph is built and then the longest edges are deleted. However,

some other graph-theoretic methods rely on the extraction of cliques [AGG98], and are then more

related to squared error methods.

Based on graph-theoretic clustering, there has been significant interest recently in spectral clustering

using kernel methods [NJW02]. Spectral clustering techniques make use of the spectrum of the

similarity matrix of the data to cluster the points, instead of the distances between these points. The

implementation of a spectral clustering algorithm is formulated as graph partition problem where the

weight of each edge is the similarity between points that correspond to vertex connected by the edge,

with a goal of finding the minimum weight cuts in the graph. This problem can be addressed by the

means of linear algebra methods, in particular by the eigenvalue decomposition techniques, from which

the term “spectral” derives. These methods can roughly be divided into two main categories: Spectral

graph cuts [Wei99] Containing ratio-cut [HaK92], Normalized cut [ShM00], and Min–max cut

[DHZ01]; and Eigenmaps methods [RoS00, ZhZ02], such as Laplacian eigenmaps [BeN03], and

Hessian eigenmaps [DoG03].

2.3.3.4. Incremental Clustering Incremental clustering is based on assumption that it is possible to consider data points one at a time

and assign them to existing clusters. A new data point is assigned to a cluster without affecting the

existing clusters significantly. This kind of algorithm is employed to improve the chances of finding the

global optimum. Data are stored in the secondary memory and data points are transferred to the main

memory one at a time for clustering. Only the cluster representations are stored permanently in the main

memory to alleviate space limitations [Dun03, AMC05]. Therefore, space requirements of the

incremental algorithm is very small, necessary only for the centroids of the clusters and this algorithm is

Literature Review

16 ____________________________________________________________________________________________


iterative and therefore their time requirements are also small.

2.3.4. Document Clustering Methods Used for IR Many “Sound” document clustering methods have been proposed in the context of information

retrieval. Single-link is one of the first methods used for this purpose [Van79]. However, a disadvantage

of this method, and probably of every cluster generation method is that they require (at least) one

empirically decided constant: A threshold on the similarity measure or a desirable number of clusters.

This constant greatly affects the final partitioning.

The method proposed by Zahn [Zah71] is an attempt to circumvent this problem. He suggests

finding a minimum spanning tree for the given set of points (documents) and then deleting the

“inconsistent” edges. An edge is inconsistent if its length l is much larger than the average length lavg of

its incident edges. The connected components of the resulting graph are the suggested clusters. Again,

the method is based on an empirically defined constant (threshold on the definition of “inconsistent”

edge). However, the results of the method are not very sensitive on the value of this constant.

Many iterative methods have appeared in the literature. The simplest and fastest one seems to be the

“single pass” method [SaW78].

Hybrid methods may be used. Salton and McGill [SaM83] suggest using an iterative method to

create a rough partition of the documents into clusters and then applying a graph-theoretic method to

subdivide each of the previous clusters. Another hybrid approach is mentioned by Van-Rijsbergen

[Van79]. Some documents are sampled from the document collection and core-clustering is constructed

using an O(n2) method for the sample of documents. The remainder of the documents is assigned to the

existing clusters using a fast assignment strategy.

2. 4. Dimensionality Reduction

As the storage technologies evolve, the amount of available data explodes in both dimensions:

samples number and input space dimension. Therefore, one needs dimension reduction techniques to

explore and to analyze his huge data sets. Often many dimensions are irrelevant, in high dimensional

data. These irrelevant dimensions can confuse analysis algorithms by hiding useful information in noisy

data. As the number of dimensions in a dataset increases, distance measures become increasingly

meaningless. Additional dimensions spread out the points until they are almost equidistant from each

other, in very high dimensions.

Various dimensionality reduction methods have been proposed including both term transformation

and term selection techniques. Feature transformation techniques attempt to generate an optimal

Literature Review

___________________________________________________________________________________________


17

dimension of “synthetic” terms by creating combinations of the original terms. These techniques are

very successful in uncovering latent structure in datasets. However, since they preserve the relative

distances between documents, they are less effective when there are large numbers of irrelevant terms

that hide the difference between sets of similar documents in a sea of noise. In addition, seeing that the

synthetic terms are combinations of the originals, it may be very difficult to interpret the synthetic terms

in the context of the domain. However, term selection methods have the advantage to select most

relevant dimensions from a dataset, and reveal groups of documents that are similar within a subset of

their terms.

2.4.1. Term Transformation Term transformation techniques, known also by term extraction, are applying a mapping of the

multidimensional space into a space of fewer dimensions. This means that the original term space is

transformed by applying algebraic transformation methods. These methods can be broadly classified

into two groups: linear and non-linear methods.

- Linear techniques include independent component analysis (ICA) [Com94], principle

component analysis (PCA) [Dun89], factor analysis [LaM71], and singular value

decomposition (SVD, detailed in Section 3.2.3) [GoV89].

- Non-linear methods are by themselves subdivided into two groups: those providing a

mapping and those giving a visualization. The non-linear mapping methods include

techniques such as kernel PCA [SSM99], and Gaussian process latent variable models

(GPLVM) [Law03]. While non-linear visualization methods are based on proximity data that

is distance measurement, include such as Locally Linear Embedding (LLE) [RoS00], Hessian

LLE [DoG03], Laplacian Eigenmaps [BeN03], Multidimensional Scaling (MDS) [BoG97],

Isometric Maps (ISOMAPS) [TSL00], and Local Tangent Space Alignment (LTSA)

[ZhZ02].

The transformations generally preserve the original, relative distances between documents. Term

transformation is often a preprocessing step, allowing analysis algorithm to use just a few of the newly

created synthetic terms. A few algorithms have incorporated the use of such transformations to identify

important terms and iteratively improve their performance [HiK99, DHZ02]. While often very useful,

these techniques do not actually remove any of the original terms from consideration. Thus, information

from irrelevant dimensions is preserved, making these techniques ineffective at revealing sets of similar

documents when there are large numbers of irrelevant terms that mask the sets. Another disadvantage of

using combinations of terms is that they are difficult to interpret, often making the algorithm results less

useful. Because of this, term transformations are best suited to datasets where most of the dimensions

Literature Review

18 ____________________________________________________________________________________________


are relevant, while many are highly correlated or redundant.

2.4.2. Term Selection

2.4.2.1. Definition Term selection (also known as subset selection) generally refers to the way of selecting a set of

feature terms which is more informative in executing a given machine learning task while removing

irrelevant or redundant terms. This process ultimately leads to the reduction of dimensionality of the

original term space, but the selected term set should contain sufficient or more reliable information

about the original data set. To this end, many criteria are used [BlL97, LiM98, PLL01, YuL03].

There are two approaches for term selection:

Forward selection starts with no terms and adds them one by one, at each step adding the one that

decreases the error the most, until no further addition does significantly decrease the error.

Backward selection starts with all the terms and removes them one by one, at each step removing

the one that decreases the error the most (or increases it only slightly), until no further removal increases

the error significantly.

2.4.2.2. Feature Selection Methods Term selection methods have relied heavily on the analysis of the characteristics of a given data set

through statistical or information-theoretical measures. For text learning tasks, they primarily count on

the vocabulary-specific characteristics of given textual data set to identify good term features. Although

the statistics itself does not care about the meaning of text, these methods have been proved to be useful

for text learning tasks (e.g., classification and clustering) [SAS04].

Many feature selection approaches have been proposed. We suggest to review chronologically some

of these approaches.

Kira and Rendell [KiR92] described a statistical feature selection algorithm called RELIEF that uses

instance based learning to assign a relevance weight to each feature.

John et al. [JKP94] addressed the problem of irrelevant features and the subset selection problem.

They presented definitions for irrelevance and for two degrees of relevance (weak and strong). They also

state that features selected should depend not only on the features and the target concept, but also on the

induction algorithm. Further, they claim that the filter model approach to subset selection should be

replaced with the wrapper model.

Pudil et al. [PNK94] presented “floating” search methods in feature selection. These are sequential

search methods characterized by a dynamically changing number of features included or eliminated at

each step. They were shown to give very good results and to be computationally more effective than the

Literature Review

___________________________________________________________________________________________


19

branch and bound method.

Koller and Sahami [KoS96] examined a method for feature subset selection based on Information

Theory: they presented a theoretically justified model for optimal feature selection based on using cross-

entropy to minimize the amount of predictive information lost during feature elimination.

Jain and Zongker [JaZ97] considered various feature subset selection algorithms and found that the

sequential forward floating selection algorithm, proposed by Pudil et al. [PNK94], dominated the other

algorithms tested.

Dash and Liu [DaL97] gave a survey of feature selection methods for classification.

In a comparative study of feature selection methods in statistical learning of text categorization (with

a focus on aggressive dimensionality reduction), Yang and Pedersen [YaP97] evaluated document

frequency (DF), information gain (IG), mutual information (MI), a 2χ -test (CHI) and term strength

(TS); and found IG and CHI to be the most effective.

Blum and Langley [BlL97] focused on two key issues: the problem of selecting relevant features and

the problem of selecting relevant examples.

Kohavi and John [KoJ97] introduced wrappers for feature subset selection. Their approach searches

for an optimal feature subset tailored to a particular learning algorithm and a particular training set.

Yang and Honavar [YaH98] used a genetic algorithm for feature subset selection.

Liu and Motoda [LiM98] wrote their book on feature selection which offers an overview of the

methods developed since the 1970s and provides a general framework in order to examine these

methods and categorize them.

Vesanto and Ahola [VeA99] proposed to visually detect correlation using a self-organizing maps

based approach (SOM).

Makarenkov and Legendre [MaL01] try to approximate an ultra-metric in the Euclidian space or to

preserve the set of the k-nearest neighbors.

Weston et al. [WMC01] introduced a method of feature selection for SVMs which is based upon

finding those features which minimize bounds on the leaveone-out error. The method was shown to be

superior to some standard feature selection algorithms on the data sets tested.

Xing et al. [XJK01] successfully applied feature selection methods (using a hybrid of filter and

wrapper approaches) to a classification problem in molecular biology involving only 72 data points in a

7130 dimensional space. They also investigated regularization methods as an alternative to feature

selection, and showed that feature selection methods were preferable in the problem they tackled.

Mitra et al. [MMP02] use a similarity measure that corresponds to the lowest eigenvalue of

correlation matrix between two features.

Literature Review

20 ____________________________________________________________________________________________


See Miller [Mil02] for a book on subset selection in regression.

Forman [For03] presented an empirical comparison of twelve feature selection methods. Results

revealed the surprising performance of a new feature selection metric, ‘Bi-Normal Separation’ (BNS).

Dhillon et al. [DKN03] present two term selection techniques, the first based on the term variance

quality measure, while the second is based on co-occurrence of “similar” terms in “the same context”.

Guyon and Elisseeff [GuE03] gave an introduction to variable and feature selection. They

recommend using a linear predictor of your choice (e.g. a linear SVM) and select variables in two

alternate ways: (1) with a nested subset selection method performing forward or backward selection or

with multiplicative updates; (2) with a variable ranking method using correlation coefficient or mutual

information.

Guérif et al. [GBJ05] used a similar idea to Vesanto and Ahola’s work [VeA99] and integrated a

weighting mechanism in the SOM training algorithm to reduce the redundancy side effects.

More recently, some approaches have been proposed to address the difficult issue of irrelevant

features elimination in the unsupervised learning context [Bla06, GuB06]. These approaches use quality

measures of partition such as the Davies-Bouldin index [DaB79, GuB06], the Wemmert and Gancarski

index or the entropy [Bla06], in addition to Guérif and Bennani [GuB07] where they have extend the w-

k-means algorithm proposed by Huang et al. [HNR05] to the SOM framework and rely their feature

selection approach on the weighting coefficients learned during the optimization process.

2. 5. Studied Languages

2.5.1. English Language English is a West Germanic language originating in England. It was the second5 widely spoken

language in the world, and is used extensively as a second language and as an official language

throughout the world, especially in Commonwealth countries, and in many international organizations.

English is the dominant international language in communication, science, business, aviation,

entertainment, radio and diplomacy. The influence of the British Empire is the primary reason for the

initial spread of the language far beyond the British Isles. Following World War II, the growing

economic and cultural influence of the United States has significantly accelerated the spread of the

language.

Hence many studies have been interested in this language. Thus, it possesses a very rich free corpus

data-base, which helped us to evaluate a bunch of our studies.

5 http://www.photius.com/rankings/languages2.html, Ethnologue, 13th Edition, Barbara F. Grimes, Editor. © 1996,

Summer Institute of Linguistics, Retrieved on 10-05-2007.

Literature Review

___________________________________________________________________________________________


21

2.5.2. Arabic Language Arabic is currently the second most widely spoken language in the world, with an estimated number

of native speakers larger than 422 million6. Arabic is the official language in more than 24 countries7.

Since it is also the language of religious instruction in Islam, many more speakers have at least a passive

knowledge of the language. Until the advent of Islam in the seventh century CE, Arabic was primarily a

regional language. The Qur’an, Islam’s holy book, was revealed to the Prophet Muhammad (Peace be

upon him) in Arabic, thereby giving the language great religious significance.

Muslims believe that to fully understand the message of the Qur’an, it must be read in its original

language: Arabic. Thus, the importance of the Arabic language extends well beyond the borders of the

Arab world. There are over 1.5 billion Muslims worldwide, and they all strive to learn Arabic in order to

read and pray in the language of revelation. Hence, Arabic has seen a very rapid growth. Statistics show

that since 1995, when the first Arabic newspaper “Asharq Alawsat” (Middle East) was launched online8,

the number of Arabic websites has been growing exponentially. By 2000 there were about 20 thousand

Arabic sites on the web, and by 2006 the number was estimated at around 100 million.

2.5.3. Arabic Forms There are three Forms of Arabic that are Classical, Modern Standard, and Colloquial: The Qur’an

became the fixed standard for Arabic, particularly for the written form of the language. Arabs consider

the “Classical Arabic” of the Qur’an as the ultimate in linguistic beauty and perfection. The

contemporary “Modern Standard Arabic,” based on the classical form of the language, is used in

literature, print media, and formal communication such as news broadcasts; while, the “Colloquial

Arabic” or locally spoken dialect varies from country to country and region to region throughout the

Arab world.

The written Arabic has changed comparatively little since the seventh century; spoken Arabic has

assumed many local and regional variations. It has also incorporated foreign words; for example, in the

twentieth century, many new non-Arabic words have found their way into the language, particularly

terms relating to modern technology. Although there are Modern Standard Arabic equivalents for

“computer”, “telephone”, “television”, and “radio” most Arabs, in speaking, will use the English or

French versions of these words.

6 http://encarta.msn.com/media_701500404/Languages_Spoken_by_More_Than_10_Million_People.html, Microsoft ®

Encarta ® 2006, Retrieved on 10-05-2007. 7 http://en.wikipedia.org/wiki/List_of_official_languages, Retrieved on 10-05-2007. 8 www.asharqalawsat.com.

Literature Review

22 ____________________________________________________________________________________________


2.5.4. Arabic Language Characteristics Arabic is a Semitic language, like Hebrew, Aramaic, and Amharic. Unlike Latin-based alphabets, the

orientation of writing in Arabic is from right-to-left. The Arabic alphabet consists of 28 letters, many of

which parallel letters in the Roman alphabet (see Table 2.1). The letters are strung together to form

words in one way only, there is no distinction between printing and cursive as there is in English.

Neither are there capital and lowercase letters, all the letters are the same.

Arabic letter Corresponding Pronunciation Arabic letter Corresponding Pronunciation

a* Alif � d Daad ا

� b Baa t Taa

� t Taa � z Thaa

� th Thaa ع ‘ Ayn

gh Ghayn غ j Jiim ج

h Haa � f Faa ح

kh Kha � q Qaaf خ

d Daal � k Kaaf د

dh Thaal � l Laam ذ

r Raa � m Miim ر

� z Zaayn ز n Nuun

� s Siin � h Haa

sh Shiin ! w* Waaw

" s Saad ي y* Yaa

* when Alif , waw or ya is used as a consonant

Table 2.1. Arabic letters.

The shape of the letter, however, changes depending on neighboring characters and their placement

within a word. Table 2.2 shows the four different shapes of the letter “غ” ‘gh’ (Ghayn). In general, all

the letters are connected to one another except ( ا$و $ر $ز $د $ذ ) which could not be attached on the left.

Isolated End Middle Beginning

'& غ &(� )�

Table 2.2. Different shapes of the letter “غ” ‘gh’ (Ghayn).

Literature Review

___________________________________________________________________________________________


23

In arabic, there are three long vowels aa, ii and uu represented by the letters “ا” ‘a’ (alif) [a:], “ ’y‘ ” ي

(yaa) [i:], and “! ” ‘w’ (waaw) [u:] respectively. Diacritics ( ـــ,ـــ,ـــ,ـــ,ـــ,ـــ,ـــ,ـــ ), called respectively “ 4�5� ”

‘fatha’ [æ], “آ78ة” ‘kasra’ [i], “ �9:;” ‘damma’ [u], “�<�5�4” ‘fathatayn’[an], “� ,kasratayn’ [in]‘ ”آ78=>

“�<�9:;” ‘Dammatayn’ [un], “ن!?�” ‘sukuun’ [Ґa-] (no vowel), and “9ةA” ‘shaddah’, are placed above or

below consonant to mark short vowels and gemination or tashdeed (consonant doubling), and make the

difference between words having the same representation. For example, if we consider the English

words “some”, “sum” and “same”, non-diacritization (vowellessness) would reduce these words to

“sm”. For Arabic examples see Table 2.3 (refer to Table A.1 for the mapping between the Arabic letters

and their Buckwalter transliteration). Diacritics appear in the Holy Qur’an, and with less consistency in

other religious texts, classical poetry, textbooks children and foreign learners. However, they appear

occasionally in complex texts to avoid ambiguity. In spite everyday writing, the reader recognizes the

words as a result of experience as well as the context.

Word 1st interpretation 2nd interpretation 3rd interpretation

آ��

“ktb”

آ��

“kataba” Wrote

آ��

“kutiba”

Has been

written

آ��

“kutubN” Books

ر��

“mdrsp”

ر��

“madorasapN” School

��Bر

“mudar~isapN” Teacher

ر9��

“mudarãsapN” Taught

Table 2.3. Ambiguity caused by the absence of vowels in the words “آ��” ‘ktb’ and “ ر��” ‘mdrsp’.

In addition to singular and plural constructs, Arabic has a form called “dual” that indicates precisely

two of something. For example, a pen is “�CD” (qalam), two pens are “�<:CD” (qalamayn), and pens are

“ As in French, Spanish, and many other languages, Arabic nouns are either feminine or .(aqlaam) ”أFDم

masculine, and the verbs and adjectives that refer to them must agree in gender. In written Arabic, case

endings are used to designate parts of speech (subject, object, prepositional phrase, etc.), in a similar

fashion to Latin and German.

English imposes a large number of constraints on word order. Arabic, however, is distinguished by

its high syntactical flexibility. This flexibility includes: the omission of some prepositional phrases

associated with verbs; the possibility of using several prepositions with the same verb while preserving

the meaning; allowing more than one matching case between the verb and the verbal subject, and the

adjective and its broken plural, and the sharpness of pronominalization phenomena where the pronouns

usually indicate the original positions of words before their extra-positioning, fronting and omission. In

other words, Arabic allows a great deal of freedom in the ordering of words in a sentence. Thus, the

Literature Review

24 ____________________________________________________________________________________________


syntax of the sentence can vary according to transformational mechanisms such extra-position, fronting

and omission, or according to syntactic replacement such as an agent noun in place of a verb.

2.5.4.1. Arabic Morphology Arabic words are divided into three types: noun, verb, and particle [Abd87, Als99]. Nouns and verbs

are derived from a closed set of around 10,000 roots [Ibn90]. Generally speaking, in English, the root is

sometimes called the word base or stem; it is the part of the word that remains after the removal of

affixes [Alku91]. In Arabic, however, the base or stem is different from the root [Ala90]. In Arabic, the

root is the original form of the word before any transformation process. The roots are commonly three or

four letters and are rarely five letters. Most Arabic words are derived from roots according to specific

rules, by applying templates and schemes, in order to construct groups of words whose meanings relate

to each other. Extensive word families are constructed by adding prefixes, infixes (letters added inside

the word), and suffixes to a root. There are about 900 patterns [AlA04a]; some of them are more

complex when the gemination is used. Table 2.4 shows an example of 3 letter roots’ templates. Those

words that are not derived from roots do not seem to follow a similar set of well-defined rules. They are

divided into two kinds: primitive (not derived from another a verbal root) like lion “ا�” ‘Asd’, or

borrowing from foreign languages like oxide “<8اآ” ‘Aksyd’. Instead they may have group showing

their family resemblances (for an example see Table 2.5).

CCC �H4 ktb آ�� Writing notion

CaCaCa �H4 kataba ��آ Wrote

CaACiC �IJ4 kaAtib �=Jآ Writer

CaCuwC ل!H4 katuwb آ�!ب skilled writer

CiCaAC لJH4 kitaAb بJ�آ Book

CuCay~iC �B<H4 Kutay~ib �B<�آ handbook

maCOCaC �HM makOtab ��? Desk

maCOCaCap �CHM makOtabap �N�? Library

CaCaACiyC �<IJH4 kataAtiyb �<=J�آ Quran school

“C” stands for the letters that a part of the root. An underlined “C” stands for a letter that is doubled. “a, i, u, …” designate vowels, and “m” represents a derivation consonant.

Table 2.4. Some templates generated from roots with examples from the root (‘آ��’ “ktb”).

Literature Review

___________________________________________________________________________________________


25

Arabic Word

Transliteration Pattern Transliteration English Word

akOsada �CH4 faEOlala Oxidize< أآ8

Oآ8 Mu&akOsad �CHM mufaEOlal Oxidized

aksadah �CCH4 faEOlalah Oxidation< أآ8ة

P= ta>aksud �CHM= tafaEOlul Oxidationآ8

Table 2.5. Derivations from a borrowed word.

When using an Arabic dictionary, one does not simply look up words alphabetically. The three letter

root must first be determined, and then it can be located alphabetically in the dictionary. Under the root,

the different words that belong to that word family are listed. The number of unique Arabic words (or

surface forms) is estimated to be 6 x 10^10 words [Att00].

In this thesis, a word is any Arabic surface form, a stem is a word without any prefixes or suffixes,

and a root is a linguistic unit of meaning, which has no prefix, suffix, or infix (for more details see the

Appendix A). However, often irregular roots, which contain double or weak letters, lead to stems and

words that have letters from the root that are deleted or replaced.

2.5.4.2. Word-form Structures Arabic word-forms can be roughly considered as equivalent to graphic words. In the Arabic

language, a word can signify a full sentence, due to its composed structure based on grammar elements’

agglutination, where the prefixes and suffixes contribute into its form.

Prefixes

Arabic prefixes are sets of letters and articles attached to the beginning of the lexical word and

written as part of it. A small inventory of the prefixes in Arabic yields the following grammatical

categories: The definite article “ال” ‘Al’ (the), the connectives “ف” ‘fa’ and “ wa’ (and), the‘ ”و

prepositions “ب” ‘bi’ (with), “ “ ,li’ (for, to)‘ ”ل ”س“ ka’ (as), the particle of the future used by verbs‘ ”ك

‘sa’ and the conjunctive particle “ل” ‘li’ (in order to), the negation “T” ‘lA’ and the conditional particle

“ We must take into consideration that in written .(alif-hamza) ’<‘ ”أ“ la’, the interrogative particle‘ ”ل

language the vowel is omitted. This means that, in practice, those grammatical categories are reduced to

no more than one consonant which is written onto the word; with exception of the definite article,

negation, and the interrogative particle alif-hamza. The one consonant prefixes are called consonant

particles. The fact that they consist of one sole consonant complicates their identification in a text.

Literature Review

26 ____________________________________________________________________________________________


Indeed, many words in Arabic do start by one of these consonants. It is true that many Arabic words are

composed of three consonants, but this is not always the case. It might be possible to identify those

prefixes by comparing prefixed words to a huge database of lexical forms in order to define which

words contain prefixes and which do not. The outcome of this process, however, is not clear at all.

However, the above mentioned prefixes do also occur in combination. This means that in practice

two or three prefixes can be linked to a word. The three most frequently used combinations of prefixes

are: (1) a combination between a connective and a preposition (for instance: “وب” ‘wa-bi’, in written

language “وب” ‘wb’, meaning: and with), (2) a combination between a preposition and the article (for

instance: “لJ&” ‘bi-Al’, in written language “لJ&” ‘bAl’, meaning: with the) and (3) a combination of three

particles, which is most commonly the combination between a connective, a preposition and the article

(for instance: “لJ&و” ‘wa-bi-Al’, in written language “لJ&و” ‘wbAl’, meaning and with the).

Suffixes

In Arabic, suffixes are sets of letters, articles, and pronouns attached to the end of the word and

written as part of it. There are 17 used as possessive suffixes. Besides, there is the suffix of the “ا” ‘A’

(alif) which is used as an undefined accusative; and there is the suffix of the energetic, the “9ن” ‘nã’

(nna). The possessive suffixes consist of one or two consonants. It is obvious that one consonant suffix

is more difficult to identify than two. Moreover, there will always remain combinations which are

ambiguous. The suffix “Jه” ‘hA’, for instance, of the third person singular feminine can easily be mixed

up with the undefined accusative “ا” ‘A’ (alif) of a word ending with the consonant “�” ‘h’.

The representation below shows a possible word structure. Note that the writing and reading of an

Arabic word are from right to left.

Postfix Suffix Scheme Prefix Antefix

- The antefixes are prepositions or conjunctions.

- The prefixes and suffixes express grammatical features and indicate the functions: noun case,

verb mood and modalities (number, gender, person …).

- The postfixes are personal pronouns.

- The scheme is a stem.

Example: The word “ ؟ JWXآ79وY�=أ ” ‘>atata*akãrunanaA’ express in English the sentence: “Do you

remember us ?”. the segmentation of this word gives the following constituents:

JX | وXـ | =Yآ79 | ت | أ

Antefix : “ أ ” ‘>’ Interrogative conjunction.

Prefix : “ت” ‘ta’ Verbal prefix.

Literature Review

___________________________________________________________________________________________


27

Scheme: “79آY=” ‘atata*akãr’.

Suffix: “ .wn’ Verbal suffix‘ ”وXـ

Postfix : “JX” ‘naA’ Pronoun suffix.

2.5.5. Anomalies As is generally known, the Arabic language is complicated for natural language processing because

of the combination of two main language characteristics. The first is the agglutinative nature of the

language and the second is the aspect of the vowellessness of the language which causes problems of

ambiguity at different levels, and complicates the identification of words.

2.5.5.1. Agglutination The first problem is the identification of words in sentences. As in most European languages, Arabic

words can, to a certain degree, be identified in computer terms as a string of characters between blanks.

Two blanks in a text serve as a marker for the separation of strings of characters, but those strings of

characters do not always coincide with words. Some Arabic grammatical categories which are

considered words in other languages appear to be affixes. Those affixes are directly linked to the words

in Arabic (as is explicated in Section 2.5.4.2), which means that a string of characters between two

blanks can contain more than one word so that multiword combinations are found which are not

separated by blanks.

The string is ambiguous as the affixes could be an attached particle or a part of the word. Thus a

form such as “\?C4” ‘flky’ can be read as “\?C4” (falaky) meaning astronomer, or “\?C4” (falikay) which

means then for, or “B\?C4” (falikayyi) which means then for ironing or burning. Note that there is no

deterministic way to tell whether the first letter is part of word or the prefix.

2.5.5.2. The Vowelless Nature of the Arabic Language The second problem in Arabic is the vowellessness of the words in sentences. This causes problems

not only on the previous mentioned multiword combinations, but also on word level. The vowellessness

affects the meaning of words. As an example, we take the string of characters consisting of the

consonants “ك” (kaf) and “ل” (lam). A reader could identify these two consonants as the noun “آ�”

(kullo) which means “all”, or the verb “آ�” (kul) (verb “أآ�” (akala) for the second person singular

masculine) which means “he eats”. Another example of such ambiguity is the string “]4ا” ‘mdAfE’,

which could means according to its spelling “]4ا” ‘mudaAfiE’ “defender” or “]4ا” ‘madaAfiE’

“cannons”.

Also it affects the grammatical labeling of words, which is especially the case for verbs. The

different persons of the verb form, both in the present and past tenses, are in most cases only identifiable

Literature Review

28 ____________________________________________________________________________________________


by means of vowels which are omitted. The verb form “ آ�� ” ‘ktbt’, for example, can refer to four

possible persons: i.e. “ �N�آ ” ‘katabOtu’ for the first person singular, “ �N�آ ” ‘katabOta’ for the second

person singular masculine, “ �N�آ ” ‘katabOti’ for the second person singular feminine and “ �N�آ ”

‘katabatO’ for the third person singular feminine. It is almost impossible for a computer program to

determine the subject of these verbs. Only the context can help in defining the correct persons of a verb

form. In this respect some help might be expected from a minimal form of text categorization. Indeed, in

newspaper text, the first person singular is less likely to occur, whereas in literature this person might

occur more abundantly. Nevertheless, it seems quite difficult to tag texts automatically when they are

not vocalized or when the larger context cannot be taken into account.

2.5.6. Early Work Although academia has made significant achievements in the Arabic text retrieval field, the complex

morphological structure of the Arabic language provides many challenges. Hence, research and

development (R&D) in the Arabic text still has a long way to go.

Existing Arabic text retrieval systems could be classified in three groups Full-form-based IR,

Morphology-based IR and Non-rule-based IR.

2.5.6.1. Full-form-based IR Despite academic research, most of the commercial Arabic IR systems presented by search engines

are very primitive, all using a very basic string matching search. This includes what is being classified as

native Arabic search engines, which means owned or managed in or by Arab companies or institutions,

such as Sakhr web engine Johaina-sakhr and ayna; those considered as Unicode multilingual engines

such as AltaVista and Google; and web directories, where documents are classified based on the subject

categorization, such as Naseej the first Arabic portal site on the Internet launched in early 1997 to serve

the growing number of Arabs on the Internet, Al-Murshid, Art Arab, and Yasalaam9.

The issue with these types of search engines, where the search is literal, is their limitations.

Although, they are based on the simplest search and retrieval method which has the advantage of all the

returned documents without a doubt contain the exact term for which the user is looking, they also have

the biggest disadvantage that many, if not most, of the documents containing the terms in different forms

will be missed. Given the many ambiguities of Arabic written, the success rate of these engines is quite

low. For example, if the user searches for “بJآ� ” (kitaab), which means “book” in English, he or she

will not find documents that only contain “ بJ�?^ا ” (al-kitaab), which means “the book”.

9 All these links are retrieved on 10-05-2007.

Literature Review

___________________________________________________________________________________________


29

2.5.6.2. Morphology-based IR The efforts that have been made in the academic environment to evaluate more sophisticated systems

give an idea about the next generation of the Arabic search engines. Evaluation has been performed on

systems using multiple approaches of incorporating morphology. Different proposed classifications of

Arabic morphological analysis techniques, found in literature, are reviewed in the work of Al-Sughaiyer

and Al-Kharashi [AlA04a]. However, in this work we adapt the Larkey et al. classification [LBC02],

where they proposed classifying Arabic stemmers into four different classes, namely, manually

constructed dictionaries, algorithmic light stemmers which remove prefixes and suffixes, morphological

analyses which attempt to find roots, and statistical stemmers.

Constructed Dictionaries: Manually constructed dictionaries of words with stemming information

are in surprisingly wide use. Al-Kharashi and Evens worked with small text collections, for which they

manually built dictionaries of roots and stems for each word to be indexed [AlE94]. Tim Buckwalter10

developed a set of lexicons of Arabic stems, prefixes, and suffixes, with truth tables indicating legal

combinations. The BBN group used this table-based stemmer in TREC - 2001 [XFW01].

Algorithmic Light Stemmers: Light stemming refers to a process of stripping off a small set of

prefixes and/or suffixes, without trying to deal with infixes, or recognize patterns and find roots

[LBC02, Dar03]. Although light stemming can correctly conflate many variants of words into large stem

classes, it can fail to conflate other forms that should go together. For example, broken (irregular)

plurals for nouns and adjectives do not get conflated with their singular forms, and past tense verbs do

not get conflated with their present tense forms, because they retain some affixes and internal

differences, like the noun “ود�” (soduud) the plural of “�” (sad) which means “dam”.

Morphological Analyzers: Several morphological analyzers have been developed for Arabic

[AlA89, Als96, Bee96, KhG9911, DDJ01, GPD04, TEC05] but few have received a standard IR

evaluation. Such analyzers find the root, or any number of possible roots for each word. Since most

verbs and nouns in Arabic are derived from triliteral (or, rarely, quadriliteral) roots, identifying the

underlying root of each word theoretically retrieves most of the documents containing a given search

term regardless of form. However, there are some significant challenges with this approach.

Determining the root for a given word is extremely difficult, since it requires a detailed morphological,

syntactic and semantic analysis of the text to fully disambiguate the root forms. The issue is complicated

further by the fact that not all words are derived from roots. For example, loan words (words borrowed

from another language) are not based on root forms, although there are even exceptions to this rule. For

10 Buckwalter, T. Qamus: Arabic lexicography, http://www.qamus.org/lexicon.htm, Retrieved on 10-10-2007 11 http://zeus.cs.pacificu.edu/shereen/research.htm#stemming, Retrieved on 10-10-2007

Literature Review

30 ____________________________________________________________________________________________


example, some loans that have a structure similar to triliteral roots, such as the English word film “�C<4”,

are handled grammatically as if they were root-based, adding to the complexity of this type of search.

Finally, the root can serve as the foundation for a wide variety of words with related meanings. The root

which means “to ,(kataba) ” آ�� “ ktb’ is used for many words related to writing; including‘ ” آ�� “

write”; “ بJآ� ” (kitaab), which means “book”; “ ��? ” (maktab), which means “office”; and “ �=Jآ ”

(kaatib), which means “author”. But the same root is also used for “regiment/battalion”: “ �N<آ� ”

(katyba). As a result, searching based on root forms results in very high recall, but precision is usually

quite low.

2.5.6.3. Statistical Stemmers In statistical stemmer class, we distinguish between two kinds of stemmers, those consisting in

grouping word variants using clustering techniques and n-gram. The former model consists in grouping

words that result in a common root after applying a specific algorithm as a conflation or equivalence

class. These equivalence classes are not overlapping, where each word belongs to exactly one class.

Based on the co-occurrence analysis and a variant of EMIM (expected mutual information) [Van79,

ChH89], which measures the proportion of word co-occurrences that are over and above what would be

expected by chance, statistical stemmers for Arabic language were used to refine stem-based and root-

based stemmers [LBC02]; whereas, they were applied also to n-gram stemmer for English and Spanish

languages [XuC98].

Statistical stemming applied to the best Arabic stemmers (Darwish light stemmer modified by

Larkey [LBC02], and Khoja root-based stemmer [KhG99]12) changes classes a great deal, but does not

improve (or hurt) overall retrieval performance. This may be suspected to the clustering method having

high bias against low frequency variants.

The second statistical model, n-gram, generates a document vector by moving a window of n

characters in length through the text, enabling a statistical language description by learning the

apparition probability of each group of these n characters.

N-gram stemmers have different challenges primarily caused by the significantly larger number of

unique terms in an Arabic corpus, and the peculiarities imposed by the Arabic infix structure that

reduces the rate of correct n-gram matching.

Published comparison studies of using stems against using roots for information retrieval are

discrepant. Older studies revealed that words sharing a root are semantically related, and root indexing is

12 http://zeus.cs.pacificu.edu/shereen/research.htm#stemming, Retrieved on 10-10-2007

Literature Review

___________________________________________________________________________________________


31

reported to outperform stem and word indexing on retrieval performance [HKE97, AAE99, Dar02].

However, later works on the TREC collection showed two different results. Darwish (as cited by Larkey

et al. [LBC02]) found no consistent difference between root and stem while Al-Jlayl & Frieder,

Goweder et al. and Thagva et al. [AlF02, GPD04, TEC05] showed that stem-based retrieval is more

effective than root-based retrieval. The older studies showing the superiority of roots over stems are

based on small and nonstandard test collections, making results non-justifiable.

Similarly, the work of Larkey et al. [LBC02] showed that the statistical stemmer, based on co-

occurrence, still inferior to good light stemming and morphological analysis. In addition, the work of

Mustafa and Al-Radaideh [MuA04] indicated that the digram method offers a better performance than

trigram with respect to conflation precision and conflation recall ratios, but in either case, the n-gram

approach does not appear to provide a good performance compared to the light stemming approach.

Hence, we could conclude that Al-Stem (Darwish’ stem-based stemmer, modified by Larkey), up the

day of this study have been effectuated, is the best known and published stemmer.

2. 6. Arabic Corpus In attempts to study and evaluate IR systems, morphological analyzers, and machine translation

systems for Arabic language, researchers initiated the creation of corpora. Among these Arabic corpora,

we find some available such as AFP corpus, Al-Hayat newspaper, Arabic Gigaword, Treebanks, and

ICA’ . However, at the exception of the Initial Version of the ICA’ , including about 448 files of totaling

size approximately of 13.5MB in uncompressed form, that has been made available for free, all the other

corpus are not free.

2.6.1. AFP Corpus In 2001 LDC released the Arabic Newswire catalog number LDC2001T55, a corpus composed of

articles from the Agence France Presse (AFP) Arabic Newswire. The corpus size is 869 megabytes

divided over 383,872 documents. The corpus was tagged using SGML and was trans-coded to Unicode

(UTF-8). The corpus includes articles from May 13th 1994 to December 20th 2000 with approximately

76 million tokens and 666 094 unique words.

2.6.2. Al-Hayat Newspaper Al-Hayat newspaper is a collection from the European Language Resources Distribution Agency

(ELRA) distributed under the catalog reference ELRA W0030 Arabic Data Set. The corpus was

developed in the course of a research project at the University of Essex, in collaboration with the Open

University.

The corpus contains Al-Hayat newspaper articles with value added for Language Engineering and

Literature Review

32 ____________________________________________________________________________________________


Information Retrieval applications development purposes.

The data have been distributed into seven subject-specific databases, thus following the Al-Hayat

subject tags: General, Car, Computer, News, Economics, Science, and Sport.

Mark-up, numbers, special characters and punctuation have been removed. The size of the total file

is 268 MB. The dataset contains 18 639 264 distinct tokens in 42 591 articles, organized in 7 domains.

2.6.3. Arabic Gigaword In 2003 LDC also released Arabic Gigaword catalog number LDC2003T12, a bigger and richer

corpora compiled from different sources that includes Agence France Presse, Al Hayat News Agency,

Al Nahar News Agency and Xinhua News Agency. There are 319 files, totalling approximately 1.1GB

in compressed form (4348 MB uncompressed, and 391 619 words).

Besides this technical information, little is known about investigation of these collections and their

limitations in terms of richness and representativeness.

2.6.4. Treebanks A treebank is a text corpus in which each sentence has been annotated with syntactic structure.

Syntactic structure is commonly represented as a tree structure, hence the name treebank. Treebanks can

be used in corpus linguistics for studying syntactic phenomena or in computational linguistics for

training or testing parsers.

Treebanks are often created on top of a corpus that has already been annotated with part-of-speech

tags. In turn, treebanks are sometimes enhanced with semantic or other linguistic information.

Treebanks can be created completely manually, where linguists annotate each sentence with

syntactic structure, or semi-automatically, where a parser assigns some syntactic structure which

linguists then check and, if necessary, correct.

Some treebanks follow a specific linguistic theory in their syntactic annotation (e.g. the

BulTreeBank13 follows head-driven phrase structure grammar (HPSG)) but most try to be less theory-

specific. However, two main groups can be distinguished: treebanks that annotate phrase structure such

the Penn Arabic Treebank14, and those that annotate dependency structure such the Prague Arabic

Dependency Treebank15.

13 http://www.bultreebank.org/, Retrieved on 10-8-2007. 14 http://www.ircs.upenn.edu/arabic/, Retrieved on 10-8-2007. 15 http://ufal.mff.cuni.cz/padt/PADT_1.0/index.html, Retrieved on 10-8-2007.

Literature Review

___________________________________________________________________________________________


33

2.6.5. Other Efforts International Corpus of Arabic ICA’ by Al-Sulaiti and Atwell [AlA04b, AlA05] is an example

among the efforts to build suitable resources that could be made available to researchers in areas related

to Arabic. ICA’ is designed on the principles of the ICE (International Corpus of English). This project

was an idea that aims for a new corpus that will include a wide range of sources representing

contemporary Arabic. Initial Version of the ICA’ CCA (Corpus of Contemporary Arabic) has been

made available for free use at http://www.comp.leeds.ac.uk/eric/latifa/research.htm16.

Others corpus are listed in the Latifa Al-Sulaiti Web site17, and in the NEMLAR survey on Arabic

language resources and tools of 2005 [NiC05].

2. 7. Summary In this chapter, we highlight the different models of information retrieval, especially vector space

model. We give taxonomy of clustering algorithms, and explain the usefulness and the use of clustering

in the information retrieval process. We introduce dimension reduction techniques, and review

chronologically features selection methods used for clustering. Moreover, we present the Arabic

language characteristics, and underline previous work undertaken in the aim of improving Arabic

retrieval. Finally, we present the existent available Arabic corpora.

16 Retrieved on 10-8-2007. 17 http://www.comp.leeds.ac.uk/eric/latifa/arabic_corpora.htm, Retrieved on 10-8-2007.

___________________________________________________________________________________________


34

Chapter 3 Latent Semantic Model 3. 1. Introduction

As storage becomes more plentiful and less expensive, the amount of electronic information is

growing at an exponential rate, and our ability to search that information and derive useful facts is

becoming more cumbersome unless new techniques are developed. Traditional lexical (or Boolean)

document retrieval techniques become less useful. Large, heterogeneous collections are difficult to

search since the sheer volume of unranked documents returned in response to a query is overwhelming

the user. Vector-space approaches to document retrieval, on the other hand, allow the user to search for

concepts rather than specific words and rank the results of the search according to their relative

similarity to the query. One vector-space approach, Latent Semantic Analysis (LSA), is capable of

achieving significant retrieval performance gains over standard lexical retrieval techniques (see

[Dum91]) by employing a reduced-rank model of the term-document space.

LSA [DDF90], because the way of representing terms and documents in a term-document space, and

modeling the implicit higher order structure in the association of terms and documents, is considered as

vector-space approach to conceptual document retrieval. It is useful in situations where traditional

lexical document retrieval approaches fail. LSA estimates the semantic content of the documents in a

collection and uses that estimate to rank the documents in order to decrease relevance to a user's query.

Since the search is based on the concepts contained in the documents rather than the document's

constituent terms, LSA can retrieve documents related to a user's query even when the query and the

documents do not share any common terms.

In the following, we describe the latent semantic analysis model components including the term-

document presentation, the weighting schemes phase, the singular value decomposition method, and the

standard query methods used in LSA. Moreover, we evaluate the impact of the weighting schemes, and

we compare the LSA performance to the standard Vector-Space Model (VSM).

3. 2. Model Description The latent semantic document retrieval model builds upon the prior research in document retrieval

and, using the singular value decomposition (SVD) [GoV89] to reduce the dimensions of the term-

document space, attempts to solve the synonymy and polysemy problems (Section 2.2.2) that plague

automatic document retrieval systems. LSA explicitly represents terms and documents in a rich, high-

dimensional space, allowing the underlying (“latent”), semantic relationships between terms and

documents to be exploited during searching.

Latent Semantic Model

___________________________________________________________________________________________


35

LSA relies on the constituent terms of a document to suggest the document's semantic content.

However, the LSA model views the terms in a document as somewhat unreliable indicators of the

concepts contained in the document. It assumes that the variability of word choice partially obscures the

semantic structure of the document. By reducing the dimensionality of the term-document space, the

underlying, semantic relationships between documents are revealed, and much of the “noise”

(differences in word usage, terms that do not help distinguish documents, etc.) is eliminated. LSA

statistically analyses the patterns of word usage across the entire document collection, placing

documents with similar word usage patterns near each other in the term-document space, and allowing

semantically-related documents to be near each other even though they may not share terms [Dum91].

Compared to other document retrieval techniques, LSA performs surprisingly well. In one test,

Dumais [Dum91] found LSA provided 30% more related documents than standard word-based retrieval

techniques when searching the standard Med collection (see Section 3.3.1). Over five standard document

collections, the same study indicated LSA performed an average of 20% better than lexical retrieval

techniques. In addition, LSA is fully automatic and easy to use, requiring no complex expressions or

syntax to represent the query.

The following sections detail the LSA model steps.

3.2.1. Term-Document Representation In the LSA model, terms and documents are firstly represented by an nm× incidence matrix

A=[ ija ] . Each of the m unique terms in the document collection are assigned a row in the matrix, while

each of the n documents in the collection is assigned a column in the matrix. The non-zero elementija

indicates not only that term i occurs in document j, but also the number of times the term appears in that

document. Since the number of terms in a given document is typically far less than the number of terms

in the entire document collection, A is usually very sparse [BeC87].

3.2.2. Weighting The benefits of weighting are well-known in the document retrieval community [Jon72, Dum91,

Dum92]. LSA typically uses both a local and global weighting scheme to increase or decrease the

relative importance of terms within documents and across the entire document collection, respectively.

A combination of the local and global weighting functions is applied to each non-zero element of A,

)(),( iGjiLaij ×= , (1)

or )(

),(

iG

jiLaij = (2)

where L(i,j) is the local weighting function for term i indicating its importance in the document j, and


36 ____________________________________________________________________________________________


G(i) is the global weighting function for term i indicating its overall importance in the collection.

In addition to the formula (1) [Dum91, Dum92] and formula

(2)18, in this dissertation, also the formula (3) [BeB99] is utilized,

)(

)(),(

jN

iGjiLaij

×= (3)

where N(j), the document length normalization, is used to penalize the term weight for the document j

in accordance with its length. Such weighting functions are used to differentially treat terms and

documents to reflect knowledge that is beyond the collection of the documents.

Some popular local weighting schemes include [Dum91, Dum92]:

- Term Frequency: tf or ijf is the integer representing the number of times term i appears in

document j.

- Binary Weighting: is equal to 1 if a term occurs in the document and 0 otherwise,

=>

otherwise

ijfa

ifij

0

11.

- log(Term Frequency + 1): is used to damp the effects of large differences in frequencies, such

that an additional occurrence of term i in document j is considered more important at smaller

term frequency levels than at larger levels.

Four well-known global weightings are: Normal, GfIdf, Idf, and Entropy. Each is defined in terms of the term frequencyijf , the document frequencyidf , which is the number of documents in which term i

occurs, and the global frequencyigf , which is the total number of times term i occurs in the whole

collection. N is the number of documents, and M is the number of terms in the collection.

- Normal: ∑N

j ijf2

1

It normalizes the length of each row (term) to 1. This has the effect of giving high weight to

infrequent terms. However, it depends only on the sum of the squared frequencies and not the

distribution of those frequencies.

- Gfldf: idf

igf,

- Idf:

idf

N2log

GfIdf and Idf are closely related. Both of them weight terms inversely by the number of different

18 http://lsa.colorado.edu/~quesadaj/pdf/LSATutorial.pdf, Retrieved on 10-28-2007.


___________________________________________________________________________________________


37

documents in which they appear, moreover, GfIdf increases the weight of frequently occurring

terms. However, neither method depends on the distribution of terms in documents. They depend

only on the number of different documents in which a term occurs.

- 1 - Entropy: ( )∑−N

j N

ijpijp

log

log1 where

igf

ijf

ijp =

Entropy is a sophisticated weighting scheme that takes into account the distribution of terms over

documents. The average uncertainty or entropy of a term is given by ( ( )∑N

j N

ijpijp

log

log). Subtracting this

quantity from a constant assigns minimum weight to terms that are equally distributed over

documents (i.e. where Nijp1

= ), and maximum weight to terms which are concentrated in a few

documents.

Furthermore, there are other global weighting schemes as:

- Global Entropy: ( )

∑−N

j N

ijpijp

log

log1log ;

- Shannon Entropy: ∑−N

j ijpijp log [LFL98];

- and Entropy: ( )N

N

j ijpijp

log

log1 ∑+ 19.

In general, all global weighting schemes give a weaker weight to frequently terms or those occurring

in lot of documents.

Two main reasons make the use of normalization necessary:

- Higher Term Frequencies: Long documents usually use the same terms repeatedly. As a result,

the term frequency factors may be large for long documents, increasing the average contribution of

its terms towards the query-document similarity.

- More Terms: Generally, vocabulary is richer and more varied in long documents than shorter

ones. This enhances the number of matches between a query and a long document, increasing the

query-document similarity, and the chances of retrieval of long documents in preference over shorter

documents.

The normalization could be either explicit or implicit effectuated by the cosine based measure

(angular distance between query q and document D)

19 http://lsa.colorado.edu/~quesadaj/pdf/LSATutorial.pdf, Retrieved on 10-28-2007.


38 ____________________________________________________________________________________________


( )qD

qDqD

*

.,cos = .

Various normalization techniques are used in document retrieval systems. Following is a review of some

commonly used normalization techniques [SBM96]:

- Cosine Normalization: 2

1

∑M

iiw where )(),( i

globalwji

localwiw = ,

Cosine normalization is the most commonly used normalization technique in the vector-space

model. It attacks both normalization reasons in one step: higher individual term frequencies augment

individual weighting values iw , increasing the penalty on the term weights. Also, if a document is

rich, the number of individual weights in the cosine factor (M in the above formula) increases,

yielding a higher normalization factor.

- Maximum tf Normalization: Another popular normalization technique is normalization of

individual tf weights for a document by the maximum tf in the document. The Smart system’s

augmented tf factor

+max tf

tf* 0,5 0,5 , and the tf weights used in the INQUERY system

+max tf

tf* 0,6 0,4 are examples of such normalization.

By restricting the tf factors to a maximum value of 1.0, this technique only compensates for the first

normalization reason (higher tf s), while it does not make any correction for the second reason

(more terms). Hence, the technique turns out to be a “weak” form of normalization and favors the

retrieval of long documents.

- Byte Length Normalization: More recently, a length normalization scheme based on the byte

size of documents has been used in the Okapi system. This normalization factor attacks both

normalization reasons in one shot.

Other classic weighting schemes are used in the literature such as: the Tfc, Ltc weighting [AaE99],

and the Okapi BM-25 weighting [RWH94, Dar03].

- Tfc: ( )∑

=

M

k kidfkf

iidff

1

2*j

*ij

The tfxidf weighting, even widely used, does not take into account that documents may be of

different lengths. The tfc weighting is similar to the tfxidf weighting except for the fact that length

normalization is used as part of the word weighting formula.


___________________________________________________________________________________________


39

- Ltc: ( )( )( )∑

=+

+

M

k kidfkjf

iidff

1

2*1log

*1ijlog

A slightly different approach uses the logarithm of the word frequency instead of the raw word

frequency, thus reducing the effects of large differences in frequencies.

- Okapi BM-25: ( )

ijf

1

**75.025.0*2

ijf*loglog*3

+∑ =

+

−

Nk k

dl

jdlN

idfN

where dl is the document length.

Among these weighting scheme tried with LSA (Section 3.3.2.1), we find that the Okapi BM-25

scheme provides a 7.9 % - 27.7 % advantage over term frequency scheme on all the English corpuses

used in this Chapter.

3.2.3. Computing the SVD Once the nm× matrix A has been created and properly weighted, a rank-k approximation

( ),min( nmk << ) to A, kA , is computed using an orthogonal decomposition known as the singular value

decomposition (SVD) [GoV89] to reduce the redundant information of this matrix. The SVD, a

technique closely related to eigenvector decomposition and factor analysis [CuW85], is defined for a

matrix A as the product of three matrices:

TUSVA = ,

where the columns of U and V are the left and right singular vectors, respectively, corresponding to the

monotonically positive decreasing (in value) diagonal elements of S, which are called the singular

values of the matrix A. As illustrated in Figure 3.1, the first k columns of the U and V matrices and the

first (largest) k singular values of A are used to construct a rank-k approximation to A via Tkkkk VSUA = .

The columns of U and V are orthogonal, such that rTT IVVUU == , where r is the rank of the matrix A.

A theorem due to Eckart and Young [GoR71] suggests that kA , constructed from the k-largest singular

triplets20 of A is the closest rank-k approximation (in the least squares sense) to A [BeC87].

20 The triple }{ iii VU ,,σ , where ),...,,( 110 −= kdiagS σσσ , is called the

thi singular triplet. iU and iV are the left and right singular

vectors, respectively, corresponding to thethi largest singular value, iσ , of the matrix A.


40 ____________________________________________________________________________________________


=k

(m x n)(r x n)(r x r)

(m x r)

Ak = Uk Sk VkT

Term DocumentVectors Vectors

k

k

=k

(m x n)(r x n)(r x r)

(m x r)

Ak = Uk Sk VkT

Term DocumentVectors Vectors

k

k

Figure 3.1. A pictorial representation of the SVD. The shaded areas of U and V, as well as the diagonal

line in S, represent Ak, the reduced representation of the original term-document matrix A.

With regard to LSA, kA is the closest k-dimensional approximation to the original term-document

space represented by the incidence matrix A. As stated previously, by reducing the dimensionality of A,

much of the “noise” that causes poor retrieval performance is thought to be eliminated. Thus, although a

high-dimensional representation appears to be required for good retrieval performance, care must be

taken not to reconstruct A. If A is nearly reconstructed, the noise caused by variability of word choice

and terms that span or nearly span the document collection will not be eliminated, resulting in poor

retrieval performance [BeC87]. Generally, the choice of the reduced dimension is empiric, depends on

the nature of the corpus, the type of the used queries if they are long or short represented by keywords,

and the performed weighting schemes. This is experimentally proved in [ABE08] and the Section

3.3.2.2.

It is worthwhile to point out that, in the context of text retrieval, document vectors could either refer

to the columns in A or the columns in TV , and term vectors could either refer to the rows in A or the

rows in U. And the same nomenclature applies to the dimension reduced model, only with the subscripts

dropped off. It is important to differentiate between these two kinds of document vectors (or term

vectors) to avoid confusion. Also note that the term vectors and document vectors in A may be referred

to as the initial/original (term or document) vectors since they have not been subjected to dimension

reduction; while in kA they may be referred to as the (term or document) concepts, because the term-

document matrix reduction captures semantic structure (i.e. concepts) while it rejects the noise that

results from term usage variations.

In addition to the fact that the left and right singular vectors specify the locations of the terms and

documents respectively, the singular values are often used to scale the term and document vectors,

allowing clusters of terms and documents to be more readily identified. Within the reduced space,

semantically-related terms and documents presumably lie near each other since the SVD attempts to


___________________________________________________________________________________________


41

derive the underlying semantic structure of the term-document space [BeC87].

3.2.4. Query Projection and Matching In the LSA model, queries are formed into pseudo-documents that specify the location of the query

in the reduced term-document space. Given q, a vector whose non-zero elements contain the weighted

term-frequency counts of the terms that appear in the query (using the same weighting schemes applied

to the document collection being searched), the pseudo-document 'q , can be represented according to

three standard philosophies described in [Yan08].

These three different philosophies on how to conduct the query q in the dimension reduced model

give rise to four versions of the query method:

(I) Version A

Underlying Philosophy: Column vectors ))(:,),...,1(:,( nVV TT in matrix TV are k-dimensional

document vectors, their dimension having been reduced from m. Dimensionally-reduced )(:,iVT

should carry some kind of latent semantic information captured from the original model and may

be used for querying purposes. However, since )(:,iVT is k-dimensional while q is m-

dimensional, it is needed to translate q into some proper form in order to compare it with

)(:,iVT . Observing that ))(:,),...,1(:,( nAAA = and ))(:,),...,1(:,( nVV TT , equation

TUSVA = leads to ))(:,),...,1(:,())(:,),...,1(:,( nVVUSnAA TT= . Thus, for any individual column

vector in A )(:,)(:, iUSViA T= for )1( ni ≤≤ which implies that )(:,)(:, 1 iAUSiV TT −= for

)1( ni ≤≤ . Treating q like a normal document vector )(:,iA , q will be transformed to

qUSq Ta

1' −= and will have the same dimension as )(:,iVT .

Query Method: First, use formula qUSq Ta

1' −= to translate the original query q into a form

comparable with any column vector )(:,iVT in matrix TV . Then compute the cosine between aq'

and each )(:,iVT for )1( ni ≤≤ .

(II) Version B

Underlying Philosophy: As mentioned earlier, document vectors can mean two different things:

either column vectors ))(:,),...,1(:,( nVV TT in TV or column vectors ))(:,),...,1(:,( nAA in A. In

fact, the latter ones might be a better choice for serving as document vectors because they are

rescaled from the dimensionally reduced U and V by a factor of S after the SVD process. To

utilize ))(:,),...,1(:,( nAA for querying purposes, only one further step on the basis of version A

need to be taken, which is to scale k-dimensional bq' back to m dimensions:


42 ____________________________________________________________________________________________


qUUqUSUSUSqq TTab === − )('' 1

21.

Query Method: First, use formula qUUq Tb =' to translate the original query q into a folded-in-

plus-rescaled form comparable with any column vector )(:,iA in matrix A. Then compute the

cosine between bq' and each )(:,iA for )1( ni ≤≤ .

(III) Version B'

Underlying Philosophy: All the reasoning behind version B sounds good except for one thing:

since m-dimensional column vectors ))(:,),...,1(:,( nAA will be used as document vectors, it is not

needed to fold in q and then rescale it back to m dimensions: just the original query q could be

used (which is already m-dimensional) for comparing with each m-dimensional )(:,iA for

)1( ni ≤≤ .

Query Method: Compute the cosine between q and each )(:,iA for )1( ni ≤≤ .

The above three different versions of query method are summarized in Table 3.1, along with the

conventional technique of lexical matching.

Lexical

Matching Version A Version B Version B'

Document

Vectors

m-dim column

vectors in A

k-dim column

vectors in VT

m-dim column

vectors in A

m-dim column

vectors in A

Query

Vector

m-dim original

query vector q

k-dim folded-in

query vector

S -1UTq

m-dim folded-in-

plus-rescaled

vector UUTq

m-dim original

query vector q

Applicable

Literature Many

[BDO95]

[BeF96] [FiB02]

[Jia97] [Jia98]

[LeB97] [Let96]

[Wit97]

[DDF89]

[DDF90]

[BCB92] [BDJ99]

[Din99]

[Din01][HSD00]

[KoO96]

[KoO98] [Zha98]

[ZhG02]

Table 3.1. Comparison between Different Versions of the Standard Query Method.

Based on the analysis of the three standard versions, it was proved that version B and version B' are

essentially equivalent. On the other hand, the task of seeking the best version of the standard query method has

brought a marked difference for the version B compared to the version A [Yan08]. However, this latter is

21 It should be pointed out that because of dimensional reduction, UTU=I while UUT

≠I.


___________________________________________________________________________________________


43

still considered admirable due to its important role in the conservation of the space.

3. 3. Applications and Results A starting point, to any application, we create a vector-space model for our data [SaM83].

Documents will be typically represented by a term-frequency vector with its dimensions equal to the

number of unique words in the corpus, and each of its components indicating how many times a

particular word occurs in the document. To further improve the effectiveness of our systems applied to

English language, we use the TreeTagger part of speech tagger [Sch94], and we remove stopwords22.

The tagging process was done without training and the results of the tagging are used as-is. In that

respect, the results we obtain from subsequent modules could only be better if the output of the tagger

was corrected and the software trained.

3.3.1. Data The English testing data used, in our experiments including this chapter and some following

chapters, are formed by mixing documents from multiple topics arbitrarily selected from standard

information science test collections. The text objects in these collections are bibliographic citations

(consisting of the full text of document titles, and abstracts), or the full text of short articles. Table 3.2

gives a brief description and summarizes the sizes of the datasets used.

Cisi: document abstracts in library science and related areas published between 1969 and 1977 and

extracted from Social Science Citation Index by the Institute for Scientific Information.

Cran: document abstracts in aeronautics and related areas, originally used for tests at the Cranfield

Institute of Technology in Bedford, England.

Med: document abstracts in biomedicine received from the National Library of Medicine.

Reuters-21578: short articles belonging to the Reuters-21578 collection23. This collection consists

of news stories, appearing in the Reuters newswire for 1987, mostly concerns business and the economy.

It contains multiple categories that are overlapping.

Collection name Cisi Cran Med Reuters-21578

Document number 1460 1400 1033 21578

Table 3.2. Size of collections.

For document retrieval task, we have picked 30 queries from each of the first three collections,

22 ftp://ftp.cs.cornell.edu/pub/smart/english.stop, Retrieved on 10-28-2007. 23 http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html, Retrieved on 10-28-2007.


44 ____________________________________________________________________________________________


where 15 are used in the training phase to identify the best reduced dimension of the LSA model and the

other 15 are used in the test phase.

3.3.2. Experiments In these experiments, we are interested in evaluating the effectiveness of the weighting schemes,

after which we compare the performances of the latent semantic analysis and the vector-space models.

3.3.2.1. Weighting Schemes Impact As an extension to our previous work [ABE05], we study the effectiveness of 22 weighting schemes

such as combination of the global and local weighting schemes defined in Section 3.2.2, in addition to

the TFC, LTC, and Okapi BM-25 schemes.

To conform to the literature notations, we present a weighting scheme by three-letters code in which

the first letter corresponds to the local factor, the second letter to the global factor, and the third letter to

the normalization component. For example, using the weighting scheme ‘nnn’ leaves the term frequency

vector unchanged, whereas the weighting schemes ‘ntn’ and ‘ntc’ produces respectively the well-known

tfxidf and Tfc weights. For indicating the inverse of a weighting scheme, we use a letter bar. All the

notations are listed in the Appendix B.

Scheme MIAP Scheme MIAP Scheme MIAP Scheme MIAP Scheme MIAP

lNn 0.14 nGn 0.19 n_G n 0.21 n

_N n 0.24 l

_

1E n 0.27

nNn 0.14 nnn 0.20 ntc 0.22 n_

GE n 0.24 ltn 0.27

nG

E n 0.17

l1

E n 0.20 n

_

1E n 0.22 l

_N n 0.25 l

_

SE n 0.27

n1

E n 0.18 ltc 0.21 l

_G n 0.22 n

_

SE n 0.25 l

_

GE n 0.28

lG

E n 0.18

lGn 0.21

l_E n 0.23

ntn 0.25

Okapi 0.32

Table 3.3. Result of weighting schemes in increasing order for Cisi corpus.


___________________________________________________________________________________________


45


lNn 0.10 nGn 0.18 n_

1E n 0.24 ntc 0.27 l

_

SE n 0.32

nNn 0.10 n_G n 0.19 l

_G n 0.24 n

_

GE n 0.27 l

_

1E n 0.33

nG

E n 0.12 nnn 0.20 l

_N n 0.24 ntn 0.28 ltn 0.33

n1

E n 0.14

l1

E n 0.20 lGn 0.25 n

_

SE n 0.29 l

_

GE n 0.34

lG

E n 0.16

n_N n 0.24

ltc 0.25

l_E n 0.30

Okapi 0.47

Table 3.4. Result of weighting schemes in increasing order for Cran corpus.


nNn 0.09 n

GE n

0.14 nnn 0.18 n_

1E n 0.20 ntn 0.24

lNn 0.09 ntc 0.15 nGn 0.19 l_

1E n 0.22 ltn 0.25

n_N n 0.12

lG

E n 0.16

l1

E n 0.20 lGn 0.23 l

_

GE n 0.25

ltc 0.13 n

1E n

0.17 l_G n 0.20 n

_

GE n 0.23 l

_

SE n 0.25

l_N n 0.13

n_G n 0.18

l_E n 0.20

n_

SE n 0.24

Okapi 0.26

Table 3.5. Result of weighting schemes in increasing order for Med corpus.


46 ____________________________________________________________________________________________



nNn 0.14 n_G n 0.24 l

_G n 0.26 l

_E n 0.30 l

_

1E n 0.35

lNn 0.14 l_N n 0.25 l

1E n 0.27 lGn 0.31 l

_

GE n 0.36

nG

E n 0.18 n

_N n 0.25 ntc 0.27 n

_

GE n 0.32 l

_

SE n 0.36

lG

E n 0.22 nnn 0.25 ltc 0.27 n

_

SE n 0.33 ltn 0.37

n1

E n 0.22

nGn 0.26

n_

1E n 0.28

ntn 0.34

Okapi 0.41

Table 3.6. Result of weighting schemes in increasing order for Cisi-Med corpus.

The precedent experiments show that the choice of a weighting scheme is very important, because

some schemes destroy the mean interpolated average precision (MIAP) (see Section C.2.3. ). As we can

see for example in Table 3.5, the term frequency indexation (nnn) gives better result than when the first

ten schemes are used (nNn, lNn, n_N n, ltc, l

_N n, n

GE n, ntc, l

GE n, n

1E n, n

_G n).

Also the experiments show that the Okapi BM-25 weighting scheme gives the best results, presented in

bold, overall the other schemes, in the whole examples. Moreover, we remark that there is a marked

difference in the rank of the other weighting schemes from a corpus to another. For example, by

evaluating the performance of the system, the well known and used TfxIdf scheme (ntn) is classified

between the 18th and 21st rank.

3.3.2.2. Reduced Dimension k We would like to highlight that the best reduced dimension k is known as an empiric metric varies,

typically, between 100 and 300 [Dum94], depending on the characteristics of each corpus. However, we

can tell that this dependency is not limited to the corpus size and sparsity characteristics, but moreover it

is also related to the choice of the weighting scheme as experimentally proved in Table 3.7 and in

[ABE08]. Furthermore from this table and the results of [ABE08], we can conclude that the Okapi BM-

25 weighting scheme has another advantage than performing the retrieval over all the studied schemes.

This advantage is that the Okapi BM-25 weighting scheme has the smallest best reduced dimension k

compared to the other schemes in all experiments, even in the case of Arabic language [ABE08].


___________________________________________________________________________________________


47

nnn l

_

SE

n n

_

SE

n l

_

1E

n n

_

1E

n l 1E

n n 1E

n l

_

GE

n n

_

GE

n l GE

n n GE

n l_E n

Cisi 808 1440 1438 1458 1447 1395 487 1137 1424 427 442 1424

Cran 1204 772 1121 1189 961 1096 1370 904 971 1400 1365 886

Med 316 123 186 266 311 418 931 86 316 152 251 960

Cisi-Med 1329 437 1018 1436 2081 1260 889 333 1329 1622 564 534

l_N n n

_N n lNn nNn l

_G n n

_G n lGn nGn ntn ntc ltc ltn Okapi

Cisi 1458 1460 219 244 1433 1457 109 109 138 1460 1460 100 40

Cran 1387 1400 655 606 1366 1369 1400 1400 910 1400 1400 1070 262

Med 810 1032 1032 206 321 745 157 146 341 531 366 112 53

Cisi-Med 1487 2491 2493 491 502 1634 428 219 532 868 1778 257 70

Table 3.7. The best reduced dimension for each weighting scheme in the case of four corpuses.

3.3.2.3. Latent Semantic Model Effectiveness In this section, seeing that we have used the version A to model the data (see Section 3.2.4) in the

execution of our experiments, we are interested in evaluating the performance improvement offered by

the LSA model against the VSM.

After identifying the best weighting scheme and the most effective reduced dimension k for each

data set in the training phase, we compare in the Figure 3.2 the test phase results of the LSA model to

the VSM results.

The interpolated recall-precision curves, of the four experiments, strengthen what was known about

the LSA model. By computing the LSA and VSM MIAPs, we remark that LSA provides a 2 % - 10 %

advantage over the VSM, even with using just the version A for modeling the data.


48 ____________________________________________________________________________________________


Cisi Cran

Med Cisi-Med

Figure 3.2. The interpolated recall-precision curves of the LSA and the VSM models.

3. 4. Summary In this Chapter, we have first recalled the traditional technique of document retrieval, which is the

Vector-Space Model (VSM). Then, we have described one of its extended models: the Latent Semantic

Analysis (LSA), which shows substantial advancement over the traditional VSM, even when only the

version A modeling the data and queries is performed. Also, we have presented three combination

methods, found in current weighting literature, for the local, global weighting functions and the length

normalization. Through some experiments, we have juxtaposes the application of twenty five weighting

schemes, where comparison has shown advantages on behalf of the Okapi BM-25 weighting scheme.

The first advantage, of this scheme, is represented by its high performance improvement of the

information retrieval system, while the second is illustrated in getting the smallest best reduced

dimension k for the LSA model when this scheme is used.

___________________________________________________________________________________________


49

Chapter 4 Document Clustering based on Diffusion Map 4. 1. Introduction

A great challenge of text mining arises from the increasingly large text datasets and the high

dimensionality associated with natural language. In this chapter, a systematic study is conducted, for the

first time, in the context of the document clustering problem, using the recently introduced diffusion

framework and some characteristics of the singular value decomposition.

This study is two major fold: classical clustering and on-line clustering. In the first fold, we propose

to construct a diffusion kernel based on the cosine distance, we discuss the problem of the reduced

dimension choice, and we compare the performances of k-means algorithm in four different vector

spaces: Salton’s vector space, latent semantic analysis space, diffusion space based on the cosine

distance, and another based on the Euclidian distance. We also propose two postulates indicating the

optimal dimension to use for clustering as well as the optimal number of clusters to use in that

dimension.

While in the second fold, we introduce single-pass clustering, one of the most popular methods used

in online applications such as peer-to-peer information retrieval (P2P) [KWX01, KlJ04, KJR06], topic

detection and tracking (TDT) [HGM00, MAS03]. We present a new version of the classical single pass

clustering algorithm, called On-line Single-Pass clustering based on Diffusion Map (OSPDM).

4. 2. Construction of the Diffusion Map

Related to spectral graph cuts [Wei99, ShM00, MeS01] and eigenmaps [RoS00, ZhZ02, BeN03,

DoG03] methodologies, diffusion map first appeared in [CLL05]. In this section, we describe in brief its

construction in the case of a finite data.

4.2.1. Diffusion Space

Given a corpus, D, of documents, construct a weighted function k (di,dj), for di,dj ∈ D, and 1 ≤ i,j ≤

N, with N = |D|. k (di,dj) is also known as the kernel and satisfies the following properties:

• k is symmetric: ),(),( ijji ddkddk =

• k is positivity preserving: for all id and jd in the corpus D , 0),( ≥ji ddk

• k is positive semi-definite: for any choice of real numbers Nαα ...,,1 , we have

Document Clustering based on Diffusion Map

50 ____________________________________________________________________________________________


.0),(1 1

≥∑∑= =

ji

N

i

N

jji ddkαα

This kernel represents some notion of affinity or similarity between the documents of D, as it

describes the relationship between pairs of documents in the corpus. In this sense, one can think of the

documents as being the nodes of a symmetric graph whose weight function is specified by k . The kernel

measures the local connectivity of the documents, and hence captures the local geometry of the

corpus,D . The idea behind the diffusion map is to construct the global geometry of the data set from the

local information contained in the kernel k . The construction of the diffusion map involves the

following steps. First, assuming that the transition probability 1m , in one time step, between documents

id and jd is proportional to ),( ji ddk we construct an NN× Markov matrix by defining

,)(

),(),(),( 1

i

jiji dp

ddkddmjiM == where p is the required normalization constant, given by

.),()( ∑=j

jii ddkdp

The Markov matrix M reflects the first-order neighborhood structure of the graph. However to

capture information on larger neighborhoods, powers of the Markov matrix M are taken, inducing a

forward running in time of the random walk and constructing a Markov Chain. Thus considering tM the

tth power of M, the entry ),( jit ddm represents the probability of going from document id to jd in t time

steps.

Increasing t, corresponds to propagating the local influence of each node with its neighbors. In other

words, the quantity tM reflects the intrinsic geometry of the data set defined via the connectivity of the

graph in a diffusion process and the time t of the diffusion plays the role of a scale parameter in the

analysis. When the graph is connected, we have that [Chu97]:

( )jjitt

dddm 0),(lim φ=+∞→

, where 0φ is the unique stationary distribution ( )∑

=l l

ii dp

dpd

)(

)(0φ .

Using a dimensionality reduction function (the SVD in our approach), the Markov matrix M will have a sequence of r (where r is the matrix rank) eigenvalues in non-increasing order

110 ...... −≥≥≥≥≥ rl λλλλ with corresponding right eigenvectors .lψ

The stochastic matrix tM naturally induces a distance between any two documents. Thus, we define the

diffusion distance as ∑ −=l

jlilt

ljiDifft ddddD 222 ))()((),( ψψλ and the diffusion map as the mapping from


___________________________________________________________________________________________


51

the vector ,d representing a document, to the vector

=Ψ

−− )(

)(

)(

)(

11

11

00

d

d

d

d

nnt

t

t

t

ψλ

ψλψλ

M, for a value .n By retaining

only the first n eigenvectors, we embed the corpus D in an n -dimensional Euclidean diffusion space,

where { }110 ...,,, −nψψψ are the coordinate axes of the documents in this space. Note that typically,

Nn<< and hence we obtain a dimensionality reduction of the original corpus.

4.2.2. Diffusion Kernels

Following what is described in the preceding subsection, several choices for the kernel k are possible,

all leading to different analyses of the data. Inspired by the work of [LaL06] on word clustering, we

decided first, to use the Gaussian kernel (kernel based on the Euclidian distance), for document

clustering, which is defined as

−−=

ε

2

exp),(ji

ji

ddddk , where the parameter ε specifies the size of

the neighborhoods defining the local geometry of the data. The smaller the parameter ε , the faster the

exponential decreases and hence the weight function k becomes numerically insignificant more quickly,

as we move away from the center.

However, as experiments show in Section 4.4.1, there are strong indications that this kernel is not the

right choice for the document clustering. For this reason, in addition to the fact that the cosine distance

has emerged as an effective distance for measuring document similarity [Sal71, SGM00], we propose to

use a kernel based on what is known as the cosine distance: ji

Tji

jiCosdd

ddddD

.

.1),( −= . We define this

kernel as .),(

exp),(

−=

εjiCos

ji

ddDddk

However in the case where the vectors id and jd are normalized, due to the fact that the two kernels

are related, as shown by the equation 2

2

1.1),( ji

TjijiCos ddddddD −=−= , the distinction between

these two kernels could be ignored.

4.2.3. Dimensionality Reduction

Reduction of the data dimensionality, thereby reducing the complexity of data representation and


52 ____________________________________________________________________________________________


speeding up similarity computation times, may lead to significant savings of computer resources and

processing time. However the selection of fewer dimensions may cause a significant loss of the

document local neighborhood information.

Different methods for reducing the dimensionality of the diffusion space have been investigated, such

as the Graph Laplacian [CLL05, LaL06], Laplace–Beltrami [CLL05, CoL06a, CoL06b], the Fokker–

Planck operator [CLL05, CoL06a], and the singular value decomposition [VHL05]. However in this

work, we have chosen to embed a low-dimensional representation of the approximate diffusion map

using the singular value decomposition for classical clustering, and the SVD-updating method [Obr94,

BDO95] for the on-line clustering, taking advantage of the results concerning these methods in the field

of document clustering [DhM01, Ler99].

4.2.3.1. Singular Value Decomposition

Singular value decomposition is used to rewrite an arbitrary rectangular matrix, such as a Markov

matrix, as a product of three other matrices: TUSVM = , where U is a matrix of left singular vectors, S

is a diagonal matrix of singular values, and V is a matrix of right singular vectors (for more details see

Section 3.2.3). As the Markov matrix is symmetric, both left and right singular vectors provide a

mapping from the document space to a newly generated abstract vector space. The elements

),...,,( 110 −rλλλ of the diagonal matrix, the singular values, appear in a magnitude decreasing order. One

of the most important theorems of SVD, Eckart and Young theorem [GoR71], states that a matrix

formed from the first k singular triplets }{ iii VU ,,λ of the SVD (left vector, singular value, right vector

combination) is the best approximation to the original matrix that uses k degrees of freedom. The

technique of approximating a data set with another one having fewer degrees of freedom works well,

because the leading singular triplets capture the strongest, most meaningful, regularities of the data. The

latter triplets represent less important, possibly spurious, patterns. Ignoring them actually improves

analysis, though there is the danger that by keeping too few degrees of freedom, or dimensions of the

abstract vector space, some of the important patterns will be lost [LFL98].

In [DhM99, DhM01], Dhillon and Modha compared the closeness between the subspaces spanned

by the spherical k-means concept vectors and the singular vectors by using principal angles [BjG73,

GoV89, Arg03] (for more details, see Appendix D). Seeing that the concept vectors constitute an

approximation matrix comparable in quality to the SVD, they were interested in comparing a sequence

of singular subspaces to a sequence of concept subspaces, but since it is hard to directly compare the

sequences, they compared a fixed concept subspace to various singular subspaces, and vice-versa.


___________________________________________________________________________________________


53

By focusing on the average cosine of the principal angles between the concept subspace of 64

dimensions and various singular subspaces, plotted in the two following figures for two different Data

Sets, we remark that the average cosine tends the best to 1 when k, the number of singular vectors

constituting the singular subspace, is very small, appearing in the figures approximately equal to 6.

Figure 4.1. Average cosine of the principal angles between 64 concept subspace and various singular

subspaces for the CLASSIC data set.

Figure 4.2. Average cosine of the principal angles between 64 concept subspace and various singular

subspaces for the NSF data set.

This fact means that the concept subspace is completely contained in the singular subspace constituted

of the first six singular vectors. Thus the minimum number k of independent variables, required to

describe the approximate behavior of the underlying system in the truncated SVD matrix 1−nM , where


54 ____________________________________________________________________________________________


Tnnnn VSUM 1111 −−−− = , is reduced by a factor of 10 compared to the one needed and used for information

retrieval (considered to be between 100-300).

On the other hand, Lerman in [Ler99] presented a procedure, in clustering context, for determining

the appropriate number of dimensions for the subspace. This procedure could be considered as a visual

inspection of the thresholding method used in [VHL05, CoL06a, LaL06] and proposed by Weiss

[Wei99], for an affinity matrix, such as the Markov matrix, because in this method, the number k of the

eigenvectors used for parametrizing the data is equal to the number of eigenvalues that have a magnitude

1

0

−kλλ greater than a given threshold 0fη ; while in Lerman procedure, based on the plot of the

singular values in decreasing order, and the break or discontinuity in the slope, she shows that the

degrees of freedom k is equal to the number of points on the left side of the discontinuity.

After reducing the dimension, documents are represented as k-dimensional vectors in the diffusion

space, and could be clustered by using a standard clustering algorithm, such as k-means and single pass.

4.2.3.2. SVD-Updating

Suppose an nm× matrix A has been generated from a set of data in a specific space, and its SVD,

denoted by SVD(A) and defined as:

TUSVA= (1)

has been computed. If more data (represented by rows or columns) must be added, three alternatives for

incorporating them currently exist: recomputing the SVD of the updated matrix, folding-in the new rows

and columns, or using the SVD-updating method developed in [Obr94].

Recomputing the SVD of a larger matrix requires more computation time and, for large problems,

may be impossible due to memory constraints. Recomputing the SVD allows the new p rows and q

columns to directly affect the structure of the resultant matrix by creating a new matrix )()( qnpmA +×+ ,

computing the SVD of the new matrix, and generating a different rank-k approximation matrix kA ,

where

Tkkkk VSUA = , and ),min( nmk << . (2)

In contrast, folding-in, which is essentially the process described in Section 3.2.4 for query

representation version A and B, is based on the existing structure, the current kA , and hence new rows

and columns have no effect on the representation of the pre-existing rows and columns. Folding-in

requires less time and memory but, following the study undertaken in [BDO95], has deteriorating effects


___________________________________________________________________________________________


55

on the representation of the new rows and columns. On the other hand, as discussed in [Obr94, BDO95],

the accuracy of the SVD-updating approach can be easily compared to that obtained when the SVD of

)()( qnpmA +×+ is explicitly computed.

The process of SVD-updating requires two steps, which involve adding new columns and new rows.

a- Overview

Let D denote the p new columns to process, then D is an pm× matrix. D is appended to the

columns of the rank-k approximation of the nm× matrix A, i.e., from Equation (2), kA so that the k-

largest singular values and corresponding singular vectors of

)( DAB k=

(3)

are computed. This is almost the same process as recomputing the SVD, only A is replaced by kA . Let T

denote a collection of q rows for SVD-updating. Then T is a nq× matrix. T is then appended to the

rows of kA so that the k-largest singular values and corresponding singular vectors of

=T

AC k

(4)

are computed.

b- SVD-Updating Procedures

In this section, we detail the mathematical computations required in each phase of the SVD-updating

process. SVD-updating incorporates new row or column information into an existing structured model

( kA from Equation (2)) using the matrices D and T discussed in Section 4.2.3.1. SVD-updating exploits

the previous singular values and singular vectors of the original matrix A as an alternative to

recomputing the SVD of )()( qnpmA +×+ .

Updating Column. Let )( DAB k= from Equation (3) and define TBBB VSUBSVD =)( . Then,

)(0

0DUS

I

VBU T

kkP

kTk =

, since T

kkkk VSUA = . If )( DUSF Tkk= and T

FFF VSUFSVD =)( , then it

follows that FkB UUU = , FP

kB V

I

VV

=

0

0, and FB SS = .

Hence BU and BV are km× and )()( pkpn +×+ matrices, respectively.


56 ____________________________________________________________________________________________


Updating Row. Let

=T

AC k from Equation (4) and define T

CCC VSUCSVD =)( . Then

=

k

kk

q

Tk

TV

SCV

I

U

0

0.

If

=

k

k

TV

SH and T

HHH VSUHSVD =)( , then it follows that Hq

kC U

I

UU

=

0

0, HkC VVV = , and HC SS = .

Hence CU and CV are )()( qkqm +×+ and kn× matrices, respectively.

4. 3. Clustering Algorithms

In this section, we present the clustering algorithms that we use in this chapter, which are the k-

means algorithm, the single-pass algorithm and finally the on-line single-pass clustering based on

diffusion map (OSPDM).

4.3.1. k-means Algorithm

k-means [Mac67] is one of the simplest unsupervised learning algorithms that solve the clustering

problem. The procedure follows a simple and easy way to classify a given data set through a certain

number of clusters (assume k clusters) fixed a priori. The main idea is to find the centers of natural

clusters in the data by minimizing the total intra-cluster variance, or, the squared error function

∑∑= =

−k

j

n

ij

ji cx

1 1

2)( , where 2)(

jj

i cx − is a chosen distance measure between a data point )( jix and the

centroid jc , which is the mean point of all the points )( jix of the cluster j.

The algorithm starts by partitioning the input points into k initial sets, either at random or using some

heuristic data. It then calculates the centroid of each set. It constructs a new partition by associating each

point to the nearest centroid. Then the centroids are recalculated for the new clusters, and algorithm

repeated by alternate application of these two steps until convergence, which is obtained when the points

no longer switch clusters (or alternatively centroids are no longer changed).

k-means algorithm is composed of the following steps:

1- Place k points into the space represented by the objects that are being clustered. These points

represent initial group centroids.

2- Assign each object to the group that has the closest centroid.


___________________________________________________________________________________________


57

3- Recalculate the positions of the k centroids, when all objects have been assigned.

4- Repeat Steps 2 and 3 until the centroids no longer move.

This produces a separation of the objects into groups from which the metric to be minimized can be

calculated.

Although it can be proved that the procedure will always terminate, the k-means algorithm does not

necessarily find the most optimal configuration, corresponding to the global objective function

minimum. The algorithm is also significantly sensitive to the initial randomly selected cluster centers,

and may change the quality of its final solution, in practice, be much poorer than the global optimum.

But, since the algorithm is concidered fast, a common method is to run the k-means algorithm several

times and return the best clustering found.

Another main drawback of the k-means algorithm is that it has to be told the number of clusters (i.e.

k) to find. If the data is not naturally clustered, some strange results may be got.

4.3.2. Single-Pass Clustering Algorithm

Incremental clustering algorithms are always preferred to traditional clustering techniques, since

they can be applied in a dynamic environment such as the web [WoF00, ZaE98]. Indeed, in addition to

the traditional clustering objective, the incremental clustering has the ability to process new data as they

are added to the data collection [JaD88]. This fact allows dynamic tracking of the ever-increasing large

scale information being put on the web everyday without having to perform complete re-clustering.

Thus, various approaches, including a single-pass clustering algorithm, have been proposed [HaK03].

Algorithm

Single-pass clustering, as the name suggests, requires a single, sequential pass over the set of

documents it attempts to cluster. The algorithm classifies the next document in the sequence according

to a condition on the similarity function employed. At every stage, based on the comparison of a certain

threshold and the similarity between a document and a defined cluster, the algorithm decides on whether

a newly seen document should become a member of an already defined cluster or the center of a new

one. Usually, the description of a cluster is the centroid (average vectors of the document representations

included in the cluster in question), and a document representation consists of a term-frequency vector.

Basically, the single-pass algorithm operates as follow:


58 ____________________________________________________________________________________________


For each document d in the sequence loop

1- find a cluster C that minimizes the distance D(C, d);

2- if D(C, d) < t then include d in C;

3- else create a new cluster whose only document is d;

End loop.

where t is the similarity threshold value, which is often derived experimentally.

4.3.3. The OSPDM Algorithm

In our approach, we are interested in taking advantage of the semantic structure and the documents’

dependencies created due to the diffusion map, in addition to the resulting reduced dimensionality by

using the SVD-updating, which leads to significant savings of computer resources and processing time.

More specifically, we take into consideration the studies in [DhM01, Ler99] where we have established

that the best reduced dimension related to the SVD method for document clustering is restricted to the

first tens of dimensions (for more details see Section 4.2.3.1).

Hence, our approach in developing the OSPDM algorithm is resumed as follow:

Given a collection D of n documents, a new document d that should be added to the existing

collection D, and a set C of m clusters.

1- Generate the term-document matrix A from the set D.

2- Compute the Markov matrix M for the n documents.

3- Generate TMMM VSUMSVD =)( .

4- Choose the best reduced dimension for the clustering task, Tkkkk VSUM = .

5- Update the term-document matrix A by adding the column representing the document d and the

needed rows, if the new document contains some new terms.

6- Update the Markov matrix M (as M is symmetric, one can update just rows MR ).

7- Apply SVD-updating for

=

M

k

R

MT :

a. Put

=

kM

k

VR

SH , and generate T

HHH VSUHSVD =)( .


___________________________________________________________________________________________


59

b. Compute Hk

T UU

U

=

10

0.

c. Compute HkT VVV = , and HT SS = (for the next iteration).

8- Update the centroids of the m clusters, by using the reduced dimension k of the matrix TU .

9- Apply a step of the single-pass clustering:

a. Find a cluster iC that minimizes the distance D(Ci, UTk(n+1, 1:k)).

b. If D(Ci, UTk(n+1, 1:k)) < t then include d in ,iC with t as a specified threshold, and set n

= n + 1.

c. Else create a new cluster 1+mC whose represented by ):1,1( knUTk + , and m = m + 1.

4. 4. Experiments and Results

The testing data used, for evaluating the effective power of our algorithms, are formed by mixing

documents from multiple topics arbitrarily selected from our evaluation database, presented in Section

3.3.1. At each run of the test, documents from a selected number, k, of topics are mixed, and the mixed

document set along with the cluster number, k, are provided to the clustering process.

4.4.1. Classical Clustering

We have applied the diffusion process to several examples where we evaluate the results of the k-

means algorithm applied in four different vector spaces: Salton space; LSA space, where SVD is applied

to the term-document matrix; diffusion space based on the Euclidian distance; and diffusion space based

on the cosine distance. The evaluation is carried through the comparison of the averages of accuracy

(Acc) and mutual information (MI), defined in Section C.3. , resulting from thirty k-means runs.

Example 1. (Cisi and Med) In this example, the data set contains all documents of the collections

Cisi and Med. Figure 4.3 shows the two collections in the diffusion space at power t = 1, (a, c, e) for the

cosine kernel, and (b, d, f) for the Gaussian kernel, respectively, in 1, 2 and 3 dimensions. From this

figure, it appears clearly that the collections are better represented in the embedding space using the

cosine kernel.


60 ____________________________________________________________________________________________


- a - - b -

-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2-3

-2

-1

0

1

2

3

4Cosine Kernel

- c - - d -

- e - - f -

Magenta color represents documents of Med collection Cyan color represents documents of Cisi collection

Figure 4.3. Representation of our data set in various diffusion spaces.

-200 -150 -100 -50 0 50 100-15

-10

-5

0

5

10

15

20Gaussian Kernel


___________________________________________________________________________________________


61

However, we still do not know how they will be represented in other dimensions. To answer this

question, we cluster the embedded data set into k = 2 clusters. Table 4.1 recalls the results of running the

k-means program for several dimensions in the two diffusion spaces, as well as the LSA space. In bold,

we represent the best result over all dimensions for each space. Table 4.3 shows the best results of k-

means in Salton space, cosine diffusion space, and LSA space.

1-Dim 2-Dim 3-Dim 4-Dim Spaces

Acc MI Acc MI Acc MI Acc MI

Gaussian diffusion 58.60 0.03 58.68 0.15 58.78 0.21 58.68 0.10

Cosine diffusion 98.60 89.18 90.11 69.33 77.96 41.80 75.82 34.47

LSA 98.63 90.40 94.61 78.41 93.78 77.04 90.76 67.08



Gaussian diffusion 58.60 0.05 58.60 0.05 58.60 0.05 58.60 0.05

Cosine diffusion 76.11 33.80 75.76 32.25 71.63 29.75 59.49 8.12

LSA 83.72 46.4 73.31 26.82 65.95 12.64 61.16 5.497

Table 4.1. Performance of different embedding representations using k-means for the set Cisi and Med.

From the results of Table 4.1, we can see that the embedding diffusion that one obtains is very

sensitive to the choice of a diffusion kernel, and the data representation in higher dimension produces

worse results, confirming the Dhillon and Modha results [DhM01] discussed in Section 4.2.3.1.

Moreover, by comparing in Table 4.2 the running time of the diffusion process, at t = 1, using the two

kernels, we find that the process needs just about 36 seconds to build the 2-dimension diffusion space

based on the cosine kernel, while for the Gaussian kernel it takes about 31 minutes, indicating that the

cosine kernel takes advantage of the “word × document” matrix sparsity in the computation of the

Markov matrix and the SVD.

Cosine kernel Gaussian kernel

Distance 9 s 7 s

Markov matrix 2 s 14 s

SVD 25 s 31 min

Table 4.2. The process running time for the cosine and the Gaussian kernels.

In Figure 4.4, we plot the first two coordinates of some powers of the Markov matrix M (a, c, e, g, i)


62 ____________________________________________________________________________________________


for the cosine kernel, and (b, d, f, h, j) for the Gaussian kernel, respectively, for t equal to 2, 4, 10, 100

and 1000.

- a - - b -

- c - - d -

- e - - f -


___________________________________________________________________________________________


63

- g - - h-

- i - - j-

Magenta color represents documents of Med collection Cyan color represents documents of Cisi collection

Figure 4.4. Representation of our data set in Cosine and Gaussian diffusion spaces for various t time

iterations.

From this figure, we remark that when the value of the power t increases, the data of the two

collections gets more merged. Effectively, because in this case the data point get connected by a larger

number of paths. Moreover, we remark that the dependency changing rate of data in the case Gaussian

diffusion space is larger than in the cosine space, showing by this that the cosine distance is stable than

the Euclidean distance.

On the basis of these results, and the fact that we are using un-normalized data, we have decided to

exclude from our succeeding experiments the diffusion space based on the Euclidian distance (Gaussian

diffusion space) and the use of the Markov matrix power. Thus, we will restrict our comparisons to the

cosine diffusion space for t equal to 1, LSA, and Salton spaces.


64 ____________________________________________________________________________________________


Spaces Acc MI

Cosine diffusion 98.60 89.18

Salton 95.72 83.61

LSA 98.63 90.40

Table 4.3. Performance of k-means in cosine diffusion, Salton and LSA spaces for the set Cisi and Med.

On the other hand, despite the fact that Table 4.3 shows that the cosine diffusion representation is

just incrementally better in both accuracy and mutual information compared to the Salton representation,

we should not forget the excellent gain in the computation time. Thus, using a k-means algorithm

running in Matlab sometimes takes more than two hours when documents are represented in the Salton

space, where the length of a document vector is determined by the number of the collection terms,

usually in the thousands; while with the cosine diffusion representation, the running time is just a few

seconds, in view of the fact that the length of a document vector in the embedded space is very small,

reduced by a factor that may be larger than 1000. However, in the case of the LSA representation, we

remark that for this set of documents, k-means performs almost as accurately as in the case of the cosine

diffusion representation.

To pick the number of dimensions for the embedding space, as shown in Figure 4.5, we plot the first

100 singular values of the cosine diffusion map in the bottom curve. To help identify the discontinuity in

the slope of the singular value curve, we plot on the top part of the figure the difference between each

successive pair of singular values, magnified by a factor of 10 and displaced by 1 from the origin for

emphasis.

Figure 4.5. Representation of the first 100 singular values of the cosine diffusion map on the set Cisi

and Med.


___________________________________________________________________________________________


65

The first singular value is always unity, corresponding to the flat singular vector. The next singular

value is greater than the rest of the singular values, following the Lerman’s method [Ler99], indicating

that the optimal dimension is approximately 1. This is also confirmed by the results of Table 4.1.

Figure 4.6. Representation of the first 100 singular values of the Cisi and Med term-document matrix.

Using the same technique, to pick the reduced dimension for the LSA space, we remark that for this

set of documents the discontinuity point in Figure 4.6 corresponds to the best reduced dimension found

in Table 4.1.

After clustering the set of document into two clusters, in this step, we will use the diffusion process

and the Buchaman-Wollaston & Hodgeson method to make sure that the documents of each resulting

cluster, referenced by C1 and C2, should not be further partitioned.

Given that it is well known that for well separated data, the number of empirical histogram peaks is

equal to the number of components, the Buchaman-Wollaston and Hodgeson method [BuH29] consists

in fitting each peak to a distribution. Based on this, and on the Kullback-Leibler (KL) divergence

[KuL51], the Jensen-Shannon divergence [FuT04], and the accumulation function [Rom90] to compare

between the approximation distribution and the data histogram distribution, we determine, as is shown in

Table 4.4, that we have very good approximations of the histograms of clusters C1 and C2, each

represented, respectively, in Figure 4.7 and Figure 4.8, by only one normal distribution.


66 ____________________________________________________________________________________________


Figure 4.7. Histogram representation of the cluster C1 documents.


Cluster C1 Cluster C2

Kullback-Leibler 8e-15 2e-15

Jensen-Shannon -4e-17 8e-17

Accumulation 1e-16 3e-17

Table 4.4. Measure of the difference between the approximated and the histogram distributions.

In Figure 4.9 and Figure 4.10, we represent the first hundred singular values of documents from

clusters C1 and C2, respectively, in the cosine diffusion space. We remark that the discontinuity point of

the slope coincides with the largest singular value, which means that the other singular values are

meaningless. Thus, we could represent a document in the embedded cosine diffusion space in 1-

dimension.


___________________________________________________________________________________________


67

Figure 4.9. Representation of the first 100 singular values of the cosine diffusion map on the cluster C1.

Figure 4.10. Representation of the first 100 singular values of the cosine diffusion map on the cluster

C2

Example 2. (Cran, Cisi, and Med) Here, we mix all documents of the three collections: Cran, Cisi,

and Med. In Table 4.5, we present the results of the k-means algorithm running in five different

dimensions for the LSA and the cosine diffusion spaces. Table 4.6 shows the optimal performance of k-

means in the cosine diffusion, Salton and LSA spaces.

Dim1 Dim2 Dim3 Dim4 Dim5 Spaces

Acc MI Acc MI Acc MI Acc MI Acc MI

Cosine

diffusion 93.21 78.72 98.45 92.14 97.05 90.29 94.38 87.26 92.67 86.34

LSA 89.67 72.84 86.22 81.27 86.74 82.11 92.32 86.44 79.05 67.32

Table 4.5. Performances of different embedding representations using k-means for the set Cran, Cisi

and Med.


68 ____________________________________________________________________________________________


Spaces Acc MI


Salton 73.03 62.35

LSA 92.32 86.44

Table 4.6. Performance of k-means in cosine diffusion, Salton, and LSA spaces for the set Cran, Cisi

and Med.

From Tables Table 4.3 and Table 4.6, we remark that k-means performs much better in the cosine

diffusion space compared to the Salton space, and better than in the LSA space. However, for this set of

documents, the discontinuity of the singular value technique is not working for the LSA space, because

the marked slope discontinuity (shown in Figure 4.12) is around the 3rd singular value indicating an

optimal dimension equal to 2, while the best dimension found in Table 4.5 is the 4th.

Figure 4.11. Representation of the first 100 singular values of the cosine diffusion space on the set

Cran, Cisi and Med.

Figure 4.12. Representation of the first 100 singular values of the Cran, Cisi and Med term-document matrix.


___________________________________________________________________________________________


69

In the objective to recluster the document of the three resulted clusters C1, C2 and C3, we conclude

from the results of Figures Figure 4.13 to Figure 4.15 to and Table 4.7 that these sets of documents

could not be further refined. Also, from Figures Figure 4.16 to Figure 4.18, we establish that the slope

discontinuity point of the singular values for a set of documents representing one cluster in the cosine

diffusion space always coincides with the largest singular value.




70 ____________________________________________________________________________________________



Cluster C1 Cluster C2 Cluster C3

Kullback-Leibler 6e-17 1e-16 2e-16

Jensen-Shannon 2e-16 1e-16 1e-16

Accumulation 4e-17 4e-17 1e-17

Table 4.7. Measure of the difference between the approximated and the histogram distributions.

Figure 4.16. Representation of the first 100 singular values of the cosine diffusion map on cluster C1.


___________________________________________________________________________________________


71



The previous results suggest two postulates for the cosine diffusion space:

Dimension Postulate: The optimal dimension of the embedding for the cosine diffusion space is

equal to the number, d, of the singular values on the left side of the discontinuity point after excluding

the largest (first) singular value. When d is equal to zero, the data will be represented in 1-dimension.

Cluster Postulate: The optimal number of clusters in a hierarchical step is equal to d+1, where d is

the optimal dimension provided by the dimension postulate of the same step.

Example 3. (Cran, Cisi, Med, and Reuters_1) For this example, we use just 500 documents from

each collection of Cran, Cisi, and Med, mixed with 425 documents from the Reuters collection. From

Table 4.8, representing the results of the k-means algorithm running in five different dimensions for the

LSA and the cosine diffusion spaces, and Table 4.9, representing its optimal performance in the cosine

diffusion, Salton and LSA spaces, it appears that k-means performs better in the cosine diffusion space

compared to both of the other spaces. However, we are interested in more than that.


72 ____________________________________________________________________________________________




Cosine

diffusion 77.87 66.02 84.73 78.88 95.93 93.90 99.22 96.66 98.04 95.61

LSA 66.85 66.42 87.74 81.41 82.28 83.30 70.82 67.44 62.83 57.28

Table 4.8. Performance of different embedding cosine diffusion and LSA representations using k-means

for the set Cran, Cisi, Med and Reuters_1.

Spaces Acc MI


Salton 71.68 71.62

LSA 87.74 83.30

Table 4.9. Performance of k-means in Cosine diffusion, Salton and LSA spaces for the set Cran, Cisi,

Med and Reuters_1.

Effectively, in the following we are concerned in validating our postulates in the cosine diffusion

space. Based on the dimension and cluster postulates, Figure 4.19, representing the first hundred

singular values for the chosen set of documents, indicates that the embedding dimension for this set of

data should be equal to two, and the number of clusters should be equal to 3.

Figure 4.19. Representation of the first 100 singular values of the cosine diffusion map on the set Cran,

Cisi, Med and Reuters_1.

Thus, we run the 3-means program in the 2-dimension cosine diffusion space, and we present the

generated confusion matrix in Table 4.10.


___________________________________________________________________________________________


73

Cran Cisi Med Reuters

C1 493 0 59 0

C2 2 499 0 0

C3 5 1 441 425

Table 4.10. The confusion matrix for the set Cran-Cisi-Med-Reuters_1

clustered into 3 clusters in 2-dimention cosine diffusion space.

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5-3

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2K-means clustring

C2 C1

C3 S

Figure 4.20. Representation of the first clusters of the hierarchical clustering.

From Figure 4.20, representing the first clusters of the hierarchical clustering of the set of collections

Cran-Cisi-Med-Reuters_1, we choose to exclude the documents belonging to the cluster C2 from further

decomposition, based on the fact that it is sufficiently distant from the other clusters. We then rerun the

k-means algorithm on the rest of the document set, which we call S.

By executing the cosine diffusion map process in S, we get the singular values presented in Figure

4.21.

0 10 20 30 40 50 60 70 80 90 1000

1

2

3

4

5

6

7

8SlopeSingular value

Figure 4.21.Representation of the first 100 singular values of the cosine diffusion map on the data set S


74 ____________________________________________________________________________________________


From this figure, the cluster postulate suggests that this set of documents should be clustered into 3

clusters in 2 dimensions. The generated confusion matrix in Table 4.11 and the clusters shown in Figure

4.22 present the results of this experiment.


C1 493 0 4 0

C2 2 0 493 0

C3 3 1 3 425

Table 4.11. The confusion matrix for the set S in 2-dimension cosine diffusion space.

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-3

-2

-1

0

1

2

3

Figure 4.22. Representation of the Set S clusters.

By combining the confusion matrices in Tables Table 4.10 and Table 4.11, we get Table 4.12, which

shows that the number of the misclassified documents in this matrix is equal to 15.

Cran Cisi Med Reuter

C1 493 0 4 0

C2 2 499 0 0

C3 2 0 493 0

C4 3 1 3 425

Table 4.12. The resultant confusion matrix.

To verify the validity of the dimension postulate, and to argue for our choice of running 3-means in 2-

dimensional cosine diffusion space, we evaluate k-means performance in multiple dimensions for the

data sets C2 and S. In the case of C2 set, we restrain the performance computation to the mutual

information, while we omit the calculation of the accuracy, as it is a multi-class metric. The results of

Tables Table 4.13 and Table 4.14 show that the best embedding dimension to partition these three


___________________________________________________________________________________________


75

clusters is equal to two, as is indicated by the slope discontinuity shown in Figures Figure 4.19 and

Figure 4.20.

Spaces 1-Dim 2-Dim 3-Dim 4-Dim

Cosine diffusion 18.89 84.69 76.4 69.47

Table 4.13. Mutual information of different embedding cosine diffusion representations using k-means

to exclude the cluster C2 from the set Cran, Cisi, Med and Reuters_1.



Cosine

diffusion 90.87 72.32 99.08 95.07 97.55 93.76 86.51 79.79

Table 4.14. Performance of different embedded cosine diffusion representations using k-means for the

set S.

In order to verify the results of the hierarchical clustering suggested by the two postulates, we run 4-

means in 4-dimension cosine diffusion space, which is indicated in Table 4.8 as the best reduced

dimension for clustering the Cran-Cisi-Med-Reuters_1 set in one step.

By presenting, in Table 4.15, the generated confusion matrix from partitioning the entire collection

into 4 clusters in the 4-dimensional cosine diffusion space, we remark that this matrix indicates the

existence of 15 misclassified documents, which is identical to the number of misclassified documents in

the confusion matrix resulted from combining the confusion matrices of the hierarchical steps, presented

in Table 4.12.


C1 492 0 6 0

C2 3 500 0 0

C3 2 0 493 0

C4 3 0 1 425

Table 4.15. The confusion matrix for the set Cran-Cisi-Med-Reuters_1 clustered into 4 clusters in the 4-

dimention cosine diffusion space.

In this example, we have not just validated our postulates, but moreover we have established that the

relation between them is mutual in each step of the hierarchical process. The results of Table 4.8 show

that the reduced dimension deduced graphically for a set of data depends on the number of clusters to


76 ____________________________________________________________________________________________


which the data will be partitioned. Thus, when we have considered that the number of clusters of the set

Cran, Cisi, Med, and Reuters_1 is known (equal to 4), Table 4.8 indicates that the reduced dimension

resulted equal to 4 is different than the one deducted graphically from Figure 4.19.

Example 4. (Cran, Cisi, Med, and Reuters_2) To make sure that the need for many hierarchical

clustering steps does not depend on the number of clusters, especially when this number is larger than 3,

as the case of Example 3, we have chosen 500 documents from each of the collections Cran, Cisi, and

Med, different than those used in Example 3, and then we have mixed them with the 425 Reuters

documents used in Example 3.

From Figure 4.23, we can see that the marked slope discontinuity around the 4th singular value

indicates the optimal dimension shown in Table 4.16 and the correct number of clusters, since the first

hierarchical step.

Figure 4.23. Representation of the first 100 singular values of the cosine diffusion map on the set Cran,

Cisi, Med and Reuters_2.



Cosine

diffusion 72.05 57.74 86.37 79.06 98.16 96.08 97.92 95.39 96.97 94.69

LSA 80.16 71.04 88.94 83.98 86.82 86.74 72.04 66.99 67.52 60.67

Table 4.16. Performance of different embedding cosine diffusion and LSA representations using k-

means for the set Cran, Cisi, Med and Reuters_2.


___________________________________________________________________________________________


77

Spaces Acc MI


Salton 71.44 69.44

LSA 88.94 86.74

Table 4.17. Performance of k-means in Cosine diffusion, Salton and LSA spaces for the set Cran, Cisi,

Med and Reuters_2.

Example 5. (Reuters) In this example, we do a small experiment to assess how our approach to

clustering documents based on the cosine diffusion map and our proposed postulates responds to non-

separated data. To this end, we have mixed documents of four Reuters categories.

1-Dim 2-Dim 3-Dim 4-Dim 5-Dim Spaces


Cosine

diffusion 38.99 8.68 50.95 28.88 66.26 49.33 66.44 56.75 62.57 49.08

LSA 36.13 7.62 40.46 12.06 39.33 10.92 38.81 10.49 38.64 9.25

Table 4.18. Performance of different embedding cosine diffusion and LSA representations using k-

means for Reuters.

Spaces Acc MI


Salton 46.59 35.22

LSA 40.46 12.06

Table 4.19. Performance of k-means in Cosine diffusion, Salton and LSA spaces for Reuters.

0 10 20 30 40 50 60 70 80 90 1000

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8SlopeSingular value

Figure 4.24. Representation of the first 100 singular values of the cosine diffusion map on Reuters.


78 ____________________________________________________________________________________________


From Tables Table 4.18 and Table 4.19, we see that the k-means algorithm works much better in the

cosine diffusion space compared to both Salton and LSA spaces. However, we remark that in this

example, where documents are overlapping, the performance of k-means in the cosine diffusion space is

not as high as it is in the cases of the previous examples, where the documents are well separated. On the

other hand, the results exhibited in Figure 4.24 and Table 4.18 show that the discontinuity point

corresponds to the best reduced dimension, which means that our first proposed postulate is still valid,

whereas the second is not adequate for the case of overlapping clustering.

Examples 1-4 strongly suggest that the proposed postulates produce good results at identifying well

separated clusters. However, when data are not well separated, the notion of a cluster is no longer well

defined in the literature.

Comparing our results to the one stated for spectral clustering [Von06], we found that our cluster

postulate conforms to spectral clustering algorithms, where to construct k clusters, the k first

eigenvectors are used. Although, our cluster postulate indicates that the number of clusters is one more

than the best dimension, this difference is due to the fact that we normalize the eigenvectors by the one

corresponding to the largest eigenvalue, which means that by including the information of the first

eigenvector in the other eigenvectors we have excluded the use of this vector.

The idea to use the same approach, hierarchical clustering, in LSA space, confronts the obstacle of

the unknown number of clusters, because the results in this space do not give us any indication about the

choice of this number.

Seeing that the k-means clustering algorithm gave us similar results in both the LSA and the cosine

spaces for the data set of Example 1, we have decided to go in greater depth into the comparison

between these two spaces, and to undertake a statistical study.

By comparing the LSA and the diffusion map process in the flowcharts of Figure 4.25, we remark

that the SVD in the LSA method is applied to a term-document matrix, while in the DM approach, it is

applied to a document-document matrix. Thus, in LSA, the singular vector of a document gives its

relationship with the collection terms, while in the DM approach, a singular vector informs one about a

relationship between the collection documents.


___________________________________________________________________________________________


79

Term-Document

Matrix Document -Document

Matrix

Dimension Reduction

Singular Value Decomposition

Term-Document

Matrix

Dimension Reduction


LSA process Diffusion Map process

Markov Matrix

Term-Document

Matrix Document -Document

Matrix

Dimension Reduction


Term-Document

Matrix

Dimension Reduction


LSA process Diffusion Map process

Markov Matrix

Figure 4.25. The LSA and Diffusion Map processes.

Even though each topic has its own specific terms to describe it, this does not negate the fact that

close topics could have some specific terms in common. Thus, when we classify documents based on

their terms, as the number of topics in the same collection decreases, clusters become more separated.

Reciprocally, when we have a large number of topics, we could get overlapping clusters due to the terms

in common between these topics. However, if clustering is based on the relationship between

documents, this problem will be minimized.

For the statistical study, we have used 45 sets of data formed by documents from the Reuters,

Ohsumed [HBL94] and/or Cran and Cisi collections. These sets were formed such as each 15 of them

contain 2, 3 or 4 clusters.

The statistical study we undertake is based on the dependent t-test [PTS92], because we have

repeated measures. The t-test equation in this case is defined by D

D

S

XNt = , where each component of

the vector D is the difference between a pair of the accuracies or the mutual information in the cosine

diffusion and the LSA spaces, for a set of data. N is the length of vector D, or explicitly, is the number of

data sets. DX is the mean of D, and DS is its standard deviation, defined by: ∑=

−N

ii ddN

1

2)( .


80 ____________________________________________________________________________________________


Clusters’ number Two Three Four

Acc MI Acc MI Acc MI T-test value

-3.02 -3.44 0.56 0.62 1.76 3.08

Table 4.20. The statistical results for the performance of k-means algorithm in cosine diffusion and LSA

spaces.

From the results of the Table 4.20, we can conclude that the k-means clustering algorithm perform

very differently in the cosine diffusion space than in the LSA space, because the absolute value of the t-

test values is extremely larger than the statistical significance threshold, which is usually equal to 0.05

[All07], for the three cases. Moreover, the results show that when there are only 2 topics, which imply

that the term distributions for these 2 topics are disjoint, the k-means algorithm performs better in the

LSA space than in the cosine diffusion space. While for multiple topics (in the cases of 3 and 4 clusters),

when documents on different topics may use overlapping vocabulary, k-means performance is better in

the cosine diffusion space. Furthermore, the performance difference of the k-means algorithm in the two

spaces becomes larger when the number of clusters increases. On the other hand, we remark that these

results conform to the ones shown in Tables Table 4.3, Table 4.6, Table 4.9 and Table 4.17.

If we take into consideration that, in the real-world clustering environment, the data sets usually

contain more than two clusters, we can conclude that the k-means algorithm performs well in the cosine

diffusion space.

4.4.2. On-line Clustering

In the following, we evaluate the results of the single-pass clustering based on Diffusion Map)

clustering algorithm applied in three different vector spaces: Salton space, diffusion space and updated

diffusion space, for three data sets. The evaluation is done by comparing the accuracy and the mutual

information.

The data set Cisi-Med contains all documents of the collections Cisi and Med. In Cran-Cisi-Med,

we mix all documents of the three collections: Cran, Cisi, and Med. In spite in Cran-Cisi-Med-Reuters,

we use 500 documents from each collection of Cran, Cisi, and Med, mixed with the 425 Reuters

documents.


___________________________________________________________________________________________


81

Salton DM Upd-DM Space

Set ACC MI ACC MI ACC MI

Cisi-Med 87.16 65.52 91.29 72.12 91.41 72.56

Cran-Cisi-Med 60.82 37.21 80.5 69.29 79.83 68.25

Cran-Cisi-Med-Reuters 26.07 0.24 81.61 84.08 77.87 83.89

Table 4.21. Performances of the single-pass clustering.

From the results of Table 4.21, we can see that the performance of the single-pass clustering

algorithm in the diffusion space is better than the Salton’s space; while, it is almost similar to the

performance in the updated diffusion space. More precisely, the slight performance decrease in the

updated diffusion space is due to the updating process, while the dramatic variation on the mutual

information measure in Salton’s space, when Reuters collection is incorporated, is due to the cluster

number inexactitude for this case even after trying a variety of threshold values.

On the other hand, viewing the fact that embedded space is restricted to the first ten dimensions, the

single-pass algorithm requires less computation time in both the diffusion and the updated diffusion

spaces than the Salton space, which is more than compensates for the updating process runtime.

4. 5. Summary In this chapter, we have proposed a process, based on the cosine diffusion map and the singular

value decomposition, to cluster on-line and off-line documents. The experimental evaluation of our

approach for classical clustering has not only shown its effectiveness, but has furthermore helped to

formulate two postulates, based on the slope discontinuity of the singular values, for choosing the

appropriate reduced dimension in the cosine diffusion space, and finding the optimal number of clusters

for well separated data. Thus, our approach has shown many advantages compared to other clustering

methods in the literature. Firstly, the use of the cosine distance to construct the kernel has

experimentally indicated a better representation of the un-normalized data in the diffusion space than the

Gaussian kernel, and minimized the computational cost by taking advantage of the “word x document”

matrix sparsity. Secondly, the running time of the k-means algorithm in the reduced dimension of the

diffusion map space is very much lower than in the Salton space. Thirdly, we formulated a simple way

to find the right reduced dimension, where a learning phase is not needed. Fourth, the estimation of the

optimal number of clusters is immediate, not as in other approaches where some criteria are optimized

as a function of the number of clusters; in addition, our approach indicates this number even when there

is just one cluster. Finally, data representation in the cosine diffusion space has shown a non-trivial

statistical improvement in the case of multi-topic clustering compared to the representation in LSA


82 ____________________________________________________________________________________________


space.

Similarly, the on-line clustering algorithm, based on mapping the data into low-dimensional feature

space and maintaining an up-to-date clustering structure by using the singular value decomposition

updating method, has enhanced efficiency, specifically for stationary text data.

___________________________________________________________________________________________


83

Chapter 5 Term Selection 5. 1. Introduction

As the storage technologies evolve, the amount of available data explodes in both dimensions:

samples number and input space dimension. Therefore, one needs dimension reduction techniques to

explore and to analyze his huge data sets, which may lead to significant savings of computer resources

and processing time.

Many feature transformation approaches have been proposed for information retrieval context, while

feature selection methods are generally used in machine learning tasks (see Section 2. 4). In this chapter,

we propose to supplement, in the context of information retrieval, the feature transformation method,

based in our work on singular value decomposition (SVD), with term selection.

While latent semantic analysis, based on the SVD, has shown an effective improvement in

information retrieval, it has a major characteristic, or difficulty, in the learning phase which is the high

dimensionality of the feature space. To address this issue, we propose to use Yan’s approach, which

consists in extracting the generic terms [Yan05]. This approach was first proposed by Yan to improve

the LSA performance; however, by studying the benefit of the generic term extraction in the English

data collection (introduced in Section 3.3.1) we have remarked that this technique could be used as a

dimensionality reduction method.

5. 2. Generic Terms Definition Generic Terms are an obvious minority among all the terms in the context of text retrieval; they have

a relatively well-balanced and consistent occurrence across the majority of (if not all of) document

collection topics. Due to their contrary distribution features compared to their majority counterparts

Domain Specific Terms, having a relatively concentrated occurrence in very few topics (in the extreme

case, just one topic), it is found that these terms affect relatively the information retrieval performance

[Yan05]. While in Yan’s work, the extraction of generic terms was implemented in the aim to improve

the retrieval performance; our approach consists in using the same algorithm but this time in the

objective of keeping the same performance and reducing the features’ space.

5. 3. Generic Terms Extraction In the purpose to present the generic term extracting algorithm, we need first to precise some

definitions [Yan05], besides recalling the spherical k-means algorithm [DhM01].

Definition 1: The Concept Vector c of a set of n term vectors termi (1≤i≤n) is their normalized

mean (this definition is adapted from [DhM01]).

Term Selection

84 ____________________________________________________________________________________________


Given that the n term vectors do not diverge from each other too much, their Concept Vector can be

seen as a normalized representative vector for these n term vectors.

Mathematically, following Definition 1, we have:

c =

=

∑∑∑∑

====

n

ii

n

ii

n

ii

n

ii nn 1111

11termtermtermterm (1)

Given that a certain document collection has t terms (keywords), we can use spherical k-means

clustering algorithm to partition these t terms into k clusters. Mathematically, we have:

}1 :{1

tiCluster i

k

jj ≤≤=

=

termU (2)

and

φ=ji ClusterCluster I (i, j∈[1, k] and i≠j) (3)

Definition 2: The Affinity between a term vector and a cluster of term vectors is the cosine of

the term vector and the Concept Vector of the cluster.

The Affinity between a term and a cluster of terms, with a range of values between −1 and 1

inclusive, indicates how closely (in terms of the absolute value of the Affinity) and in which manner (the

Affinity being positive or negative) this term is related to this cluster.

Mathematically, given a term vector term and a cluster Cluster = {term(1), term(2), ... , term(w)}, their

Affinity is defined as

affinity(term, Cluster) = term

ctermctermterm

cc

termterm •=•=•

1 (4)

where c =

∑∑

==

w

ii

w

ii

1)(

1)( termterm

Definition 3: The Affinity Set between a term vector and a partition of all terms in a document

collection is the set of Affinity values between the said term vector and each cluster of the said partition.

The Affinity Set records a number of Affinity values for a particular term across all the clusters of a

certain partition.

Mathematically, given a term vector term and a partition having k clusters kjjCluster 1}{ = , the Affinity

Set AFN between this term and this partition is defined as:

AFN = {affinity(term, Clusterj): 1≤j≤k)} (5)

Term Selection

___________________________________________________________________________________________


85

Definition 4: The Characteristic Quotient (or CQ) of a term vector with respect to a partition

of all terms in a document collection is the standard deviation of the Affinity Set defined between this

term vector and this partition over the mean of all the members in the said Affinity Set.

The Characteristic Quotient of a term vector with respect to a partition provides a sensible estimate

(educated guess) on how evenly (or unevenly) the meaning of this term participates across all the

clusters of this partition.

Mathematically, given a term vector term, a partition having k clusters kjjCluster 1}{ = , and their

Affinity Set AFN, the Characteristic Quotient of this term vector with respect to this partition is defined

as:

CQ = )(

)(

AFNmean

AFNstdv (6)

where stdv() and mean() are defined as:

mean({ xi: 1≤i≤n }) = ∑=

=n

iix

nx

1

1

stdv({ xi: 1≤i≤n }) = ∑=

−−

n

ii xx

n 1

2)(1

1

Now we may formally define Generic Terms and Domain Specific Terms.

Definition 5: For a particular document collection, given all the terms and a meaningful partition

of these terms, the Generic Terms are those terms whose Characteristic Quotients are below the value

of GEN_CUTOFF.

Definition 6: For a particular document collection, given all the terms and a meaningful partition

of these terms, the Domain Specific Terms are those terms whose Characteristic Quotients are above or

equal to the value of GEN_CUTOFF.

The following two points shall clarify Definition 5 and Definition 6:

(I) The phrase meaningful partition refers to a partition that groups terms in such a way so

that terms of similar meanings are most likely located in the same cluster of the partition.

(II) GEN_CUTOFF is a small constant chosen to differentiate between Generic Terms and

Domain Specific Terms. More discussions on GEN_CUTOFF shall follow shortly.

Term Selection

86 ____________________________________________________________________________________________


Comparing Definition 5 to the intuitive definition (characterization) of Generic Terms at the

beginning of the current Section, we have the following observations:

(I) The new definition has the same spirit of the old one: A meaningful partition of terms into

many clusters stated in the new definition resembles a sensible grouping of documents

into many topics implied in the old one. The old definition was based on the distribution

pattern of Generic Terms over a range of document topics; the new one is based on the

participation (Affinity) pattern of Generic Terms among a number of term clusters.

(II) The new definition has an advantage over the old one: Definition 5 is a working

definition on Generic Terms which makes it possible for devising an algorithm to identify

all the Generic Terms in a given document collection. In the new definition:

Characteristic Quotients are mathematically well-defined; a meaningful partition of terms

is obtainable through a clustering algorithm called Spherical k-means; and the value of

GEN_CUTOFF can be determined experimentally through trial and error.

The rationale behind the new definition of Generic Terms and Domain Specific Terms is as follows:

In a meaningful partition of terms, terms of similar meanings are grouped together cluster by cluster.

The Characteristic Quotient of a term vector with respect to this partition indicates how evenly (or

unevenly) the meaning of this term relates to all the clusters of this partition. The bigger the CQ is, then

the more unevenly the relationship becomes, and the stronger the tendency is for this term to be

categorized as a Domain Specific Term. On the other hand, the smaller the CQ is, then the more evenly

the relationship becomes, and the stronger the tendency is for this term to be categorized as a Generic

Term. Therefore, the value of CQ may be used to identify a term as a Generic Term or a Domain

Specific Term for that matter.

It is worth noting that a limited number of terms may sit on the borderline between Generic Terms

and Domain Specific Terms, whatever the actual value of GEN_CUTOFF is. Therefore increasing the

value of GEN_CUTOFF may allow some previously categorized borderline-case Domain Specific

Terms to be newly identified as Generic Terms, and vice versa.

Practically, Yan used a simpler but equally effective method to avoid the process of determining the

actual value of GEN_CUTOFF. He set up a goal to identify a fixed number (say ng) of generic terms so

that those terms whose Characteristic Quotients are among the lowest ng of all terms are automatically

identified as generic terms with the rest of the terms simultaneously being identified as domain specific

ones. In this way, he eliminated the GEN_CUTOFF value without any compromise of the validity of the

generic term identification process.

Term Selection

___________________________________________________________________________________________


87

5.3.1. Spherical k-means Spherical k-means [DhM01] is a variant of the well known “Euclidean” k-means algorithm [DuH73]

that uses cosine similarity [Ras92]. This algorithm partitions the high dimensional unit sphere using a

collection of great hypercircles, and hence Dhillon and Modha refer to this algorithm as the spherical k-

means algorithm. The algorithm computes a disjoint partitioning of the document vectors, and, for each

partition, computes a centroid normalized to have unit Euclidean norm. The normalized centroids

contain valuable semantic information about the clusters, and, hence, they refer to them as concept

vectors. The spherical k-means algorithm has a number of advantages from a computational perspective:

it can exploit the sparsity of the text data, it can be efficiently parallelized [DhM00], and converges

quickly (to a local maxima). Furthermore, from a statistical perspective, the algorithm generates concept

vectors that serve as a “model” which may be used to classify future documents. An adapted version of

the algorithm for term clustering is given in the top level of the GTE algorithm.

5.3.2. Generic Term Extracting Algorithm Based on the previous definitions and discussions, we present the generic term extracting (GTE)

algorithm. Step (I) through Step (VI) are spherical k-means sub-algorithm (adapted from [DhM01]) for

achieving a meaningful partition of all terms, while it was to analyze documents, here it is used to

analyze terms. Step (VII) through Step (IX) are procedures for extracting generic terms one at a time. A

top-level flowchart of the GTE algorithm is represented in Figure 5.1.

Step (I) Initialization: (i) normalize all term vectors termi, where i ∈ [1, t];

(ii ) t = t0, loop = 1;

(iii ) randomly assign t terms to k clusters, thereby having: kjjCluster 1,0 }{ = ; (iv)

compute the concept vectors: kjj 1,0 }{ =c .

Step (II) For each termi (i ∈ [1, t]), compute ci* =

•

•T

jloopi

Tjloopi

,

,min argcterm

cterm, where j ∈ [1, k]. Note

that if there are two or more cloop,1, ..., cloop,n satisfying

•

•T

jloopi

Tjloopi

,

,min argcterm

cterm, then

randomly assign one of cloop,1, ... , cloop,n to ci*.

Step (III) For each j ∈ [1, k], compute the new Partition: }1 , :{ *,,1 tiCluster ijloopijloop ≤≤==+ ccterm .

Step (IV) For each j ∈ [1, k], compute the new concept vectors:

Term Selection

88 ____________________________________________________________________________________________


Step (V)

=∑

∑

∑

∑

+

+

+

+

∈

∈

∈

∈+

jloop

jloop

jloop

jloop

Cluster

Cluster

Cluster

Cluster

jloop

,1

,1

,1

,1

1 1 ,1

term

term

term

term

termterm

c .

Step (VI) Compute:

•

•−

•

=∑ ∑

∑ ∑∑ ∑

= ∈

= ∈= ∈+

+

++

k

j Clusterterm

Tjloop

k

j Clusterterm

Tjloop

k

j Clusterterm

Tjloop

jloop

jloopjloop

timprovemen

1,

1,

1,1

,1

,1,1

cterm

ctermcterm

.

Step (VII) Case A: IF improvement ≥ τ, then

(i) loop = loop + 1;

(ii ) Go to Step (II);

Case B: IF improvement < τ, then

Continue with the next step.

Step (VIII) For each i ∈ [1, t], compute: ]} [1, : { ]} [1, :{ ,1, kjkjaffinityAFN Tjloopijii ∈•=∈= +cterm .

Step (IX) Compute i* =

)(

)(min arg

i

i

AFNmean

AFNstdv, where i ∈ [1, t].

Step (X) Case A: IF num_gen_term < MAX_GEN_TERM, then

(i) Delete termi* from termi, where i ∈ [1, t];

(ii ) Add termi* to gen_term_lst, the Generic Term list;

(iii ) t = t − 1, loop = 1, num_gen_term = num_gen_term + 1;

(iv) Go to step (II);

Case B: IF num_gen_term = MAX_GEN_TERM, then

Continue with the next step.

Step (XI) Stop.

Figure 5.1 represents the extraction of a generic term (i.e., identification and removal) from the pool

of all terms. The old partition of terms is further adjusted via re-running the clustering sub-algorithm

before the next generic term is identified, this measure prevents the earlier-identified generic terms from

exerting compounding effects on the to-be-identified ones. Therefore, there are as many rounds of

running the clustering sub-algorithm as there are many generic terms to be identified.

Term Selection

___________________________________________________________________________________________


89

Spherical k-means clustering sub-algorithm

Extracting a generic term

Is termination criteria

satisfied?

Start

Exit

No

Yes

Spherical k-means clustering sub-algorithm

Extracting a generic term

Is termination criteria

satisfied?

Start

Exit

No

Yes

Figure 5.1. Top-Level Flowchart of GTE Algorithm.

The GTE Algorithm is guaranteed to terminate for the following three points:

(I) The Spherical k-means clustering algorithm (Step (I) through Step (VI)) is known to

converge [DhM01].

(II) Step (VII), Step (VIII) and Step (IX) Case A are sequential procedures.

(III) Termination criteria in Step (IX) Case B guarantees that the steps mentioned in the above

two points are iterated for no more than a maximum of MAX_GEN_TERM times.

5. 4. Experiments and Results In the aim to study the Yan approach, which consists in improving the LSA performance by

extracting generic terms, we propose first to index Cisi, Cran, and Med collections (introduced in

Section 3.3.1) in the native space, composed of unique terms occurring in documents; then, to apply the

GTE algorithm, by extracting several numbers of terms.

In Table 5.1, we put the index size of each collection in the native space and noun phrase space

(used for indexation in Chapter 3); where we remark that, even for these moderate-sized text collections,

the index size in the native space achieves tens of thousands.

Term Selection

90 ____________________________________________________________________________________________


Collection Native space NP space

Cisi 9161 2087

Cran 6633 1914

Med 12173 2769

Table 5.1. Index size in the native and Noun phrase spaces.

In Table 5.2, Table 5.3, and Table 5.4, we put respectively for the Cisi, Cran, and Med collections

the mean interpolated average precision (MIAP) for several indexes, where the number of the excluded

generic terms from the indexes is indicated in the tables. We would like to note that all these results are

for the best reduced dimension in the training phase of the LSA model.

Number of extracted

generic terms MIAP

Number of extracted

generic terms MIAP

0 0.28 1300 0.28

100 0.28 1400 0.28

200 0.28 1450 0.28

300 0.28 1460 0.28

400 0.28 1470 0.28

500 0.28 1480 0.28

600 0.28 1490 0.28

700 0.28 1495 0.27

800 0.28 1500 0.27

900 0.28 2000 0.27

1000 0.28 3000 0.27

1100 0.28 3560 0.28

1200 0.28 3570 0.25

Table 5.2. The MIAP measure for the collection Cisi in different indexes.

Term Selection

___________________________________________________________________________________________


91

Number of extracted

generic terms MIAP

Number of extracted

generic terms MIAP

0 0.51 600 0.50

100 0.52 700 0.50

200 0.51 800 0.50

300 0.51 900 0.50

400 0.51 1000 0.50

500 0.51 1100 0.50

550 0.51 1200 0.50

560 0.51 1300 0.50

570 0.51 1400 0.50

575 0.50 1500 0.48

580 0.50 - -

Table 5.3. The MIAP measure for the collection Cran in different indexes.

Number of extracted

generic terms MIAP

Number of extracted

generic terms MIAP

0 0.66 1200 0.66

100 0.66 1300 0.66

200 0.66 1400 0.66

300 0.66 1410 0.66

400 0.66 1420 0.66

500 0.66 1425 0.65

600 0.66 1430 0.65

700 0.66 1450 0.65

800 0.66 1500 0.65

900 0.66 2000 0.65

1000 0.66 3000 0.64

1100 0.66 3500 0.63

Table 5.4. The MIAP measure for the collection Med in different indexes.

By analyzing these tables, we remark that there is not an appearing improvement in the LSA

Term Selection

92 ____________________________________________________________________________________________


performance; because the existing improvement is very small affects in the case of Cisi and Med

collections the third decimal number, while in the case of Cran collection it affects the second decimal

number for an improvement less than 1%. Moreover, this improvement is remarked when no more than

the first six hundreds generic terms are excluded. For this reason, we think to use the generic term

extracting algorithm as a dimensionality reduction technique, by keeping the same performance

achieved in the native space while excluding more numbers of generic terms. As indicated in bold, by

this approach, we can exclude respectively 3560, 570, and 1420 terms from the Cisi, Cran, and Med

collections, which represent respectively about 38.8, 8.6, and 11.7% of the index size in the native

space.

On the other hand, by comparing these results to those achieved in Section 3.3.2.1, and recalled in

Table 5.5, we remark a large trade-off between sizes indicated in Table 5.1 and performances indicated

in Table 5.5, especially for Cran and Med collections.

Collection Native space NP space

Cisi 0.28 0.32

Cran 0.51 0.47

Med 0.66 0.26

Table 5.5. LSA performance in the native and Noun phrase spaces.

The reduction of the dimension may lead to significant savings of computer resources and

processing time. However poor feature selection may dramatically degrade the information retrieval

system’s performance. This is clearly remarked when NP indexation is used or when a large number of

terms is excluded by using the GTE algorithm, such in the case of excluding 3570, 1500, and 3500

generic terms respectively from the Csi, Cran, and Med collections, we get a performance degradation of

3%. Thus by removing many terms, the risk to remove potentially useful information on the meaning of

the documents becomes larger. It is then clear that, in order to obtain optimal (cost-)effectiveness, the

reduction process must be performed with care.

5. 5. The GTE Algorithm Advantage and Limitation Few term selection methods have the advantage of taking into account the interactions between

terms [CRJ03], such that usually the role of each term is evaluated independently of the other. By

analysing the results of the Cisi collection in Table 5.2, we remark that the LSA performance after

excluding 3560 terms achieves the same performance reached in the native space even after that this

Term Selection

___________________________________________________________________________________________


93

performance has been decreased by 1% before. This remark shows that the exclusion of a specific

number of terms, using the GTE algorithm, could positively or negatively affects the LSA concepts

because of the interactions between terms.

Although, the GTE algorithm has this advantage, it has a limitation as it does not process

automatically in the elimination of terms, which is due to the fact that the performance is not monotone.

5. 6. Summary By proposing to supplement, in the context of information retrieval, the feature transformation

method based on singular value decomposition with term selection, we have used the Yan’s approach.

Initially, this approach, consisting in extracting generic terms, was proposed to improve the performance

of the LSA model; however, we have used it for reducing the index size. In fact, the exclusion of generic

terms has not just reduced the storage capacity but also the capability of influencing a large number of

LSA concepts in an unpredictable way.

___________________________________________________________________________________________


94

Chapter 6 Information Retrieval in Arabic Language 6. 1. Introduction

Arabic texts are becoming widely available but due to the Arabic characteristic challenges still free

available corpora are missing automatic processing tools, and emerged standard IR-oriented algorithms

for this language.

In order to develop an Arabic IR system, we think that the improvement of former systems may

yield a predictive model to accelerate their processing and to obtain reliable results. So in the objective

of a specific study and a possible performance improvement of the Arabic information retrieval systems,

we have created an analysis corpus and a reference one, specialized in the environment field, and we

have proposed to use the latent semantic analysis method, to cure the problems arising from the vector-

space model. We have also studied how linguistic processing and weighting schemes could improve the

LSA method, and we have compared the performance of the vector-space model and the LSA approach

for the Arabic language.

As is generally known, the Arabic language is complicated for natural language processing due to

two main language characteristics. The first is the agglutinative nature of the language and the second is

the aspect of the vowellessness of the language, causing ambiguity problems at different levels. In this

work, we are more interested, specially, in the agglutinative problem.

6. 2. Creating the Test Set “A corpus, or a set of textual documents, could be seen as a language sample. For this reason, the

corpora are used for the automatic processing of natural language. As much the corpus is extensive and

varied, as much the sample is representative.” [Lap00].

Text corpus represents a real usage of a language, and provides an objective reference to analyze or

even get formal descriptions of a language. A corpus of reference must satisfy two requirements. One is

to be sufficiently large; the other is the diversity of usages such as training and testing.

6.2.1. Motivation In recent work within the framework of the information retrieval and the Arabic language automatic

processing, some sizeable newspaper corpuses (cf. Section 2. 6) have started to be available. However,

they are not free, and the topics treated by these corpuses remain of a general nature without affecting a

scientific field specialized such as the environment. For these two reasons, we have been interested in

Information Retrieval in Arabic Language

___________________________________________________________________________________________


95

building our own corpus.

6.2.2. Reference Corpus

6.2.2.1. Description The development of the corpus proceeded into two stages: Web harvesting, and text normalization.

These steps were executed by native speakers.

As preliminary processing, the source of our corpus was chosen from archived articles cited on the

Web sites “Al-Khat Alakhdar”24 and “Akhbar Albiae”25 , whose subjects cover various environmental

topics such as pollution, noise effects, water purification, soil degradation, forest preservation, climate

change and natural disasters. Thus, we have chosen for this corpus the appellation [AR-ENV],

designating by AR the Arabic language and by ENV the environmental thematic of the corpus.

The research district for the chosen topics took place through the use of keywords which must be

precise and allow finding a wide spectrum of documents in terms of genres. Two search strategies are

possible: the first in width which reviews most of the documents returned by a single query, the second

in depth, which examines only the first documents and explores the links. We conducted a thorough

search on the top twenty results using combinations of such keywords environment, pollution, and noise.

To ensure proper coverage of the subject, we have expanded the search using synonyms in the search

engine of both Web sites and in the Arabic version of the Google search engine26, and basing on the

terms found in the pages visited as noise pollution or degradation. We note that while gathering the

corpus, we have focused on the variety of the parameter settings involved in the creation of a document,

such as the document producer(s), the context of production or usage, the date of production, and the

document size which varies between one paragraph and forty pages.

On the other hand for each article, we have saved its URL, converted it from its original form HTML

to text file in UNICODE format, than unified its content as explained in the mutation process in Section

6.3.1.1 of this Chapter.

The corpus in its primary phase [ABE08] consists of 1 060 documents, containing 475 148 tokens

from which 54 705 are distinct; and 30 queries, 15 of them for the training phase of the best reduced

dimension choice of the LSA model, and the other 15 for test. These statistics are resumed in Table 6.1.

24 http://www.greenline.com.kw/home.asp, Retrieved on 10-22-2007. 25 http://www.4eco.com/, Retrieved on 10-22-2007. 26 http://www.google.com/intl/ar/.


96 ____________________________________________________________________________________________


Statistics [AR-ENV] Corpus

Document Number 1 060

Query Number 30

Token Number 475 148

Distinct Word Number 54 705

Table 6.1. [AR-ENV] Corpus Statistics.

The creation of queries was adjusted and calibrated manually to the collection documents. The first

queries have been inspired by a first reading of the corpus documents, after that general and ambiguous

queries were excluded. Each query is subdivided into three parts, containing a short title, a descriptive

sentence, and a narration part specifying the relevance criteria, as explicated with the example in Table

6.2. The example includes original and an English version of the query text. The average length of

queries containing just the title of the information needed is limited to approximately 2.70 tokens per

query; but when the logical parts “description” or “narration” are taken into account the average length

of the queries becomes approximately 16.17 tokens per query.

<title> تJ&J(^ا �_J:` </title>

<desc>تJ&J(^ا �_J:` �I ثص ا^�\ =�5!cW^ا dCI �5N^ا. </desc>

<narr> . ^rW!ض &ip�^J>7 و7pX ا^:J`J8ت ا^7noاءا^cW!ص ذات HgJD �k`F:& �Cl\ اJiAjر، 7gق ا </narr>

<title> The forest preservation </title>

<desc> Look for articles on preservation of the forest. </desc>

<narr> Relevant documents take up the prosecution of the forest destroyers, the ways of promoting

reforestation, and the deployment of green-space area. </narr>

Table 6.2. An example illustrating the typical approach to query term selection.

The relevance assessment was also performed manually, by reading and checking whole document

collection from Arabic narrative speaker reviewers. After picking up all documents specified as relevant

for a given query, a document is admitted as relevant to that particular query if derived from the rule of

majority: i.e., a document is defined as relevant to a particular query only if at least three out of five

from five reviewers agree on its relevancy.

The size of this corpus, used in our study as a reference corpus, although still modest, can guarantee

that the articles discuss a wide range of subjects and that their content is, to some extent, heterogeneous.


___________________________________________________________________________________________


97

The selected articles were published over a period of three years from 2003 to 2006. We realize that due

to this period of time, most of topics covered by the corpus are well represented.

6.2.2.2. Corpus Assessments The characteristics of any corpus determine the performance of IR and NLP techniques that uses that

corpus as a resource or dataset. Therefore linguists carry out a variety of tests to evaluate the

appropriateness of the data. These measures and evaluations vary with the task, the language, and the

techniques. Assessment tools can rearrange such a corpus store so that various observations can be

made. Using the corpus assessment tools, we first validate the collections by applying statistical and

probability tests, such as Zipf’s law and Token-to-Type Ratio. These tests overcome the ultimate

problem closely linked with the corpus size and representativeness. They are useful for describing the

frequency distribution of the words in the corpus. They are also well-known tests for gauging data

sparseness and providing evidence of any imbalance of the dataset.

Zipf’s law

According to Zipf’s law, if we count up how often each word occurs in a corpus and then list these

words in the order of their frequency of occurrence, then the relationship between the frequency of a

given word f and its position in the list (its rank r) will be a constant k such that: f.r = k.

Ideally, a simple graph for the above equation using logarithmic scale will show a straight line with a

slope of –1. So the situation in the corpus was checked by starting with one file and increasingly adding

more files to a corpus and checking the behavior of the relation between the rank and the frequency. An

enhanced theory of Zipf’s law is the Mandelbrot distribution. Mandelbrot notes that “although Zipf’s

formula gives the general shape of the curves, it is very bad in reflecting the details” [MaS99]. So to

achieve a closer fit to the empirical distribution of words, Mandelbrot derived the following formula for

a relation between the frequency and the rank:

f = P(r+ρ)-b

where P, b, and ρ are parameters of the text that collectively measure the richness of the text’s use of

words. The common factor is that there is still a hyperbolic relation between the rank and the frequency

as in the original equation of Zipf’s law. If this formula is graphed on doubly logarithmic axes, it closely

approximates a straight line descending with a slope –b just as Zipf’s law describes (See Figure 6.1).

The graph shows Rank on the X-axis versus Frequency on the Y-axis, using logarithmic scale. The

line in magenta corresponds to the ranks and frequencies of words in the whole documents of our

corpus. The straight line in canyon shows the relationship between Rank and Frequency predicted by

Zipf’s formula f . r = k.


98 ____________________________________________________________________________________________


Figure 6.1. Zipf’ law and word frequency versus rank in the [AR-ENV] collection.

Token-to-Type Ratio (TTR)

Token-to-type ratio is another measure used to evaluate a collection or a dataset for its

appropriateness to be used in an IR or NLP task. The measure reflects mainly the sparseness of the data

[Sch02, Ars04].

Text length Bengali

(CILL)

English

(Brown)

Arabic

(Al-Hayat)

100

1 600

6 400

16 000

20 000

1 000 000

1.204

2.288

3.309

4.663

5.209

10.811

1.449

2.576

4.702

5.928

6.341

20.408

1.19

1.774

2.357

2.771

2.875

8.252

Table 6.3. Token-to-type ratios for fragments of different lengths, from various corpora.

The measure is obtained by dividing the number of tokens (text length) by the number of distinct

words (type). It is sensitive to sample size, with lower ratios (i.e. a higher proportion of “new words”)

expected for smaller (and therefore sparser) samples. A 1 000 word article might have a TTR of 2.5; a

shorter one might reach 1.3; 4 million words will probably give a form/token ratio of about 50, and so

on. The factors that influence TTR for raw textual data include various morphosyntactic features and

orthographic conventions (see Table 6.3).


___________________________________________________________________________________________


99

For instance, the presence of a case system in a language will lead to a comparatively lower type to

token ratio. Arabic, a language with a highly inflective morphology, has a very low token-to-type ratio

compared to English [Yah89]. Figure 6.2 shows the TTR for our collection. The results confirm the

former finding of Yahya [Yah89], Goweder & De Roeck [GoD01] and Abdelali [Abd04].

Figure 6.2. Token-to-type ratios (TTR) for the [AR-ENV] collection.

Other Measures

To measure the lexical richness of the [AR-ENV] corpus, we have also used in the context of lexical

categories the lexical coverage and the grammatical category distribution measures. For further details

on these two metrics consult Boulaknadel dissertation [Bou08].

6.2.3. Analysis Corpus The process undertaken in the work presented in this Chapter is based on the opposition of two

corpora of different properties. The first corpus is a set of analysis [AR-RS] and the second is a set of

reference [AR-ENV]. As a first step of our study, we should apply our experimental protocol on an

analysis corpus. This approach has the advantage of ensuring if the observed results are stable,

independent of a particular corpus, and are not coincidental.

Our analysis corpus is an extract of articles’ set repatriated from the Web27 on the Royal Speech,

published between March 2002 and December 2006. The total size of this corpus is of 101 072

occurrences corresponding roughly to 20 107 types; and 10 queries for the training of the best dimension

choice of the LSA model, and 10 others for testing. Each query is consistent with those described in

Table 6.2.

27 http://www.maec.gov.ma/arabe, Retrieved on 12-26-2007.


100 ____________________________________________________________________________________________


6. 3. Experimental Protocol In contrast to other languages such as French and English, Arabic is an agglutinant language, where

words are preceded and succeeded by prefixes and suffixes. Moreover, diacritic marks are commonly

used in this language. Therefore, an adaptation of the standardized information retrieval system,

presented in Figure 6.3, is needed, specifically in the preprocessing phase.

Natural Language Processing

Corpus

User QueryTokenization

Stop Word List

Stemming

VMIndex Query

Correspondance


Corpus


Corpus

User QueryTokenization

Stop Word List

Stemming

VMIndex Query

Correspondance

Figure 6.3. A standardized information retrieval system.

6.3.1. Corpus Processing

6.3.1.1. Arabic Corpus Pre-processing The pre-processing of a corpus helps to format textual data and prepare it to be ready for the

subsequent processing. Generally, tokenization, stop words removal and stemming are the basic natural

language processes used in an information retrieval system (see Figure 6.3). However, for particular

languages, as the case of the Arabic language, other natural languages processes are needed to overcome

the specificity requirements of these languages.

The absence of a standardized Arabic information retrieval system was the first challenge

confronting our study. To deal with this problem, in addition to have a good knowledge of the Arabic

language characteristics and anomalies (introduced in Section 2.5.5), we have investigated many Arabic

studies, researches and preprocessing tolls (referenced in Section 2.5.6). Thus, we managed to propose

the system presented in Figure 6.4, and tried to improve its performance further.


___________________________________________________________________________________________


101

tا ��J&ا^J:`7ن ا^7`>�

Corpus

User QueryMoving Diacritics

Mutation

Tokenization

Stop Word List

Stemming

VM

Stop Word List

Index Query

Correspondance

tا ��J&ا^J:`7ن ا^7`>�

Corpus

tا ��J&ا^J:`7ن ا^7`>�

Corpus


Mutation

Tokenization

Stop Word List

Stemming

VM

Stop Word List

Index Query

Correspondance

Figure 6.4. An information retrieval system for Arabic language.

As visualized in Figure 6.4, the removal of diacritics and the mutation of document and query

content are the preliminary processes of an Arabic information retrieval system (AIRS), then a

tokenization process takes place. Stop words are removed before and after a stemming phase, and finally

the remaining tokens are indexed to be ready for a query evaluation.

All the mentioned pre-processes are described below according to their order of appearance in

Figure 6.4.

Diacritic Elimination Our corpus contains no diacritic texts except some words whose vowels are detected and eliminated.

The process removes all the diacritics except the diacritic ‘shaddah’ “ ” since ‘shaddah’ is placed above

a consonant letter as a sign for the duplication of that consonant; thus, it acts like a letter. In modern

Arabic writing, people rely on their knowledge of the language and the context while writing the Arabic

text. The Arabic surface form can be fully, partially, or entirely free of diacritics. The incompleteness of

the surface orthography in most of the standard written Arabic makes the written Arabic words

ambiguous. Thus, removing diacritics is of great importance to normalizing the queries and the

collection.

Mutation The mutation consists in normalizing the letters appearing in several distinct forms in combination

with various letters to one form. The process changes the letters “ أ ” ‘hamza-above-alif’, “ إ ” ‘hamza-

under-alif’ and ‘alif-madah’ “ v ” to plain ‘alif’ “ ا ”. The reason behind this conversion is that most

people do not formally write the appropriate ‘alif’ at the beginning of the word. Thus, the letter ‘alif’ is a

source of ambiguity. For example, the verb “wأ ”, which means take in English, and the plural noun


102 ____________________________________________________________________________________________


The normalization .” ا`7ف “ and ” اwwhich means letters in English, can be written as “ ,” أ`7ف“

process preserves the word sense intact. Similarly for words that contain ‘hamza-under-alif’ such as

”ة“ ’Similarly, the letter ‘ta-marbotah .” اJ8Xن “ which means human in English, can be written as ,” إJ8Xن“

that occur at the end of the Arabic word which indicates mostly the feminine noun is, in most cases,

written as ‘ha’ “�” which makes the word ambiguous. To resolve the ambiguity, we replace any

occurrences of “ ة ” in the end of the word with “ � ”. For example, the word “ �k<k` ” alternately appears

as “ �k<k` ” or “ xk<k` ” in Arabic text. We further replace the sequence of ‘alif-maksoura’ “ ى ” positioned

before the last letter in a word and ‘hamza’ “ ء ” positioned in the end of a word to “ ئ ” ‘alif-maksoura-

mahmozah’. Similarly, we replace the sequence of ‘ya’ “ ي ” positioned before the last letter in a word

and “ ء ” positioned in the end of a word to “ ئ ”.

Tokenization In natural language processing, the basic tokenization process consists in recognizing words

resulting from the existence of white-space characters and punctuation marks as explicit delimiters. But

due to its complicated morphology and agglutinant characteristic, Arabic language needs an enhanced

and designed tokenizer to detect and separate words. However in this system, only a basic tokenizer is

used.

Stop Word Removal By comparing the flowcharts of the two Figures Figure 6.3 and Figure 6.4, it appears clearly that the

Arabic language stemming process is preceded and followed by a stop-words removal process; and this

is often used for two reasons. The first reason is that there are some prefixes which are part of some

stop-words. For example in the demonstrative pronouns “ �_Y^ا ” meaning those, “ يY^ا ” and “ \�^ا ”

meaning that respectively for masculine and feminine person or thing, the definite article “ال ” meaning

the, considered as a prefix in Arabic language, is a main part of these pronouns that should not be

eliminated. The second reason is that when the stemming process is applied without to be followed by a

stop-words removal step, the majority of stop-words might not be eliminated. For example, the case of

the token “ xCND ” meaning before him, after a stemming process becomes “ �ND ” meaning before, which is

a stop-word.

Stemming As reviewed in Section 2.5.6, up to the day when this study has been undertaken, light stemming of

Darwish modified by Larkey [LBC02] was the best language process performing the information

retrieval in Arabic language [AlF02, GPD04, TEC05]. For this reason, we have chosen to use this

approach in the stemming phase of the proposed AIR system.

The chosen approach is used not to produce the linguistic root of a given Arabic surface form, but to

remove the most frequent suffixes and prefixes including duals and plurals for masculine and feminine,


___________________________________________________________________________________________


103

possessive forms, definite articles, and pronouns. A detailed list of these prefixes and suffixes is

presented in the Section A.2.4.

6.3.1.2. Processing Stage In this stage, a retrieval process is performed. To our best knowledge, the vector-space model (VSM)

known as Salton’s model [SaM83] was the only model used for retrieval in Arabic language, before our

work [BoA05], where the latent semantic analysis (LSA) model is used. The first model is highlighted in

Section 2.2.2, while the second is detailed in Chapter 3.

6.3.2. Evaluations In the objective to ameliorate the performance of the proposed system, we evaluate the effectiveness

of this latter and some other suggestions, described and discussed below, in both created corpora: the

analysis corpus [AR-RS] and the reference corpus [AR-ENV].

6.3.2.1. Weighting Schemes’ Impact In the case of the extended vector space model, analysis semantic latent, we have explored the effect

of five weighting schemes in two study cases: (a) short queries and (b) long queries. Based on the four

best performing weighting schemes, found in the Ataa Allah et al. work [ABE05], for the LSA model in

addition to the Okapi BM-25, we compare in Figure 6.5 the performance of the four weighting schemes

log(tf+1)xIdf, TfxIdf, Ltc, and Tfc, performed on the reference corpus [AR-ENV]. We remark that the

log(tf+1)xIdf scheme improves the model, while the other three have experienced some improvements

and degradations between them over the recall rates. However, the Okapi BM-25 weighting scheme

improves further the LSA method, reaching a gain of 6.17% and 6.15% respectively for short and long

queries over the log(tf+1)xIdf, and 16.75% and 33.33% when the data are not weighted. This gain is due

to the normalization factor that characterizes the Okapi BM-25 weighting scheme compared to the other

schemes (see Section 3.2.2).

Likewise, we have seen similar behavior of the five weighting schemes on the analysis corpus. The

Okapi BM-25 weighting scheme has increased the model performance by 4.48% and 3.08% respectively

for short and long queries over the log(tf+1)xIdf, and by 11.01% and 49.46% when no weighting

scheme is used.


104 ____________________________________________________________________________________________


- a - - b -

Figure 6.5. Comparison between the performances of the LSA model for five weighting schemes.

6.3.2.2. Basic Language Processing Usefulness In order to gather certain tokens, we have stemmed those of our corpora. The primary benefit of this

preprocessing is the reduction of the database index size. While, the matrix “tokens x documents” of the

reference corpus [AR-ENV] has 54 705 tokens, after performing the stem process, this number is

reduced to 22 553, and to 22 491 tokens after the elimination of stop-words.

We have carried out some experiments to evaluate the usefulness of these preprocessing for Arabic

information retrieval in two study cases. In the first case, no weighting was applied, while in the second

the Okapi BM-25 weighting is used.

The results of Figure 6.6-a (reference corpus, short queries, no weighting) show that the

improvement made by the use of the stop-word list is not so significant; however, given by the

combination of the stemming process and the elimination of stop-words is more interesting, and reaches

a gain of 3.45%.

Also on the analysis corpus, the results showed that the use of a stop-word list is not necessary;

while a significant improvement, equal to 5.01%, was observed for the combination of the stemming

process and the elimination of stop-words.

For long queries, the use of the stop-word list brings a gain of 3.27% compared to the case where no

preprocessing is used, and 21.06% when the stemming is applied.

Similarly in the case of reference corpus (long queries, Figure 6.6-b), we find that the elimination of

stop-words presents a benefit of 3.22% compared to a case where all preprocessing approaches are

ignored, and a benefit of 12.47% compared with stemming approach. These results reflect the

importance of implementing a second stage of stop-word removal process in an Arabic retrieval system

(the process presented after the stemming in Figure 6.4). Also we note that the combination of stemming


___________________________________________________________________________________________


105

and the elimination of stop-words in this case is more interesting and gives a gain of 17.47%.

Effectively, removing stop words has a significantly positive effect for stemmed Arabic, but not for

unstemmed Arabic. This difference is due to the fact that stem classes for stop words contain larger

numbers of unrelated word variants than stem classes for other words.

By comparing the results of short and long queries, we note that the effect of the stop-word removal

process depends mainly on the queries’ model.

- No weighting case

- a - - b -

- Okapi BM-25 weighing case

- c - - d -

Figure 6.6. Language processing benefit.

In Figure 6.6-c (reference corpus, short queries, Okapi BM-25 weighting), the results show that the

performance provided by the stemming reached a gain of 4.80%, while for long queries (Figure 6.6-d), it

gives a gain of 3.50%. Similarly, the same observations were found on the analysis corpus, as the


106 ____________________________________________________________________________________________


stemming brings a benefit of 6.31% in the case of short queries and 4.65% in the case of long queries.

However, the elimination of stop-words is not significant for the two corpora, for both queries’ model

long and short.

We note that to avoid the recursive testing cost for each corpus token, in order to eliminate those of

the stop-word list, we can use weighting schemes as the Okapi as BM-25, log(tf+1)xIdf, TfxIdf, Ltc and

Tfc, which minimize the effect of these tokens in particular and the effect of high frequencies in general.

This is confirmed by the results of one of our previous work [BoA05].

Based on the latter conclusion, we propose a new system for Arabic information retrieval (see Figure

6.7), where the phase of the stop-word removal process is excluded. Thus, we choose to use a light-

stemming corpus weighted by the Okapi BM-25 scheme for the remainder experiments, unless some

thing else is mentioned.

tا ��J&ا^J:`7ن ا^7`>�

Corpus


Mutation

Tokenization

Stemming

VMIndex Query

Correspondance

tا ��J&ا^J:`7ن ا^7`>�

Corpus

tا ��J&ا^J:`7ن ا^7`>�

Corpus


Mutation

Tokenization

Stemming

VMIndex Query

Correspondance

Figure 6.7. A new information retrieval system suggested for Arabic language.

6.3.2.3. The LSA Model Benefit In this section, we want to compare between the performances of the standard vector space model

(VSM) and the latent semantic analysis model (LSA), in the case of our Arabic corpora [AR-RS] and

[AR-ENV].


___________________________________________________________________________________________


107

- a - - b -

Figure 6.8. A comparison between the performances of the VMS and the LSA models.

For reference corpus, the curves of Figure 6.8-a, representing the results in the case of short queries,

show a significant statistical difference of 15.90% for the LSA model over the standard VSM, while the

curves in Figure 6.8-b, representing the results in the case of long queries, show a gain of 16.30%.

Similarly, the experiments performed on analysis corpus showed an improvement of 12.77% for the

LSA model in the case of short queries, and 13.73% in the case of long queries.

6.3.2.4. The Impact of Weighting Query We want to point out that we have used weighted queries in the former experiments. However in this

part, we are interested in studying the contribution of the weighting on two queries’ models: short and

long.

We have remarked that query weighting gives, in the case of the analysis corpus [AR-RS], an

improvement of 3.52% for short queries over the long ones; and a benefit of 2.70%, presented in Figure

6.9-a, in the case of the validation corpus [AR-ENV]. The fact that these results do not comply with

what is found in the literature [Sav02], have pushed us to seek the cause of this difference.

- Weighted queries

- a -


108 ____________________________________________________________________________________________


- Un-weighted queries

- b -

Figure 6.9. Weighting queries’ impact.

To this end, we have decided to evaluate the performance of the information retrieval system with

un-weighted queries. This experiment shows that, compared to short queries, long queries increase the

performance by 5.03% for the analysis corpus and 4.30% for the corpus of reference; while, short

queries decrease the performance respectively of 3.72% and 3.60% for analysis and reference corpora.

Viewing that short queries, containing just key words of the information needed, reflect better the

reality especially that of the Internet, while long queries containing “description” or “narration” parts

move a way the user from the reality of the Web, we suggest to use an information retrieval system,

where queries are weighted.

6.3.2.5. Non Phrase Indexation Always in the aim to improve the performance of the Arabic information retrieval system, we have

chosen to study the effect of the indexation by noun phrases (NP) [ABE06], which appear to be more

suited to indicate semantics entities than single terms [Ama02]. To this end, we need to adapt our system

to the new approach.

a- Arabic Information Retrieval System based on NP Extraction In this approach, the corpus preprocessing differ, from those used in the suggested Arabic

information retrieval presented in Figure 6.7, first in the use of the Buckwalter transliteration as a second

step of the system, and in the use of part of speech (POS) tagging and NP extraction processes

performed before stemming. These new processes are defined in Section A.3. , and commented below.


___________________________________________________________________________________________


109

tا ��J& ا^J:`7ن ا^7`>�

Corpus


Transliteration

Tokenization

POS Tagging

NP Extraction

LSA

Stemming

Index Query

Correspondance

tا ��J& ا^J:`7ن ا^7`>�

Corpus

tا ��J& ا^J:`7ن ا^7`>�

Corpus


Transliteration

Tokenization

POS Tagging

NP Extraction

LSA

Stemming

Index Query

Correspondance

Figure 6.10. Arabic Information Retrieval System based on NP Extraction.

Buckwalter Transliteration

Taking into account that, up to the day where this work has been done, no tool for Arabic POS

tagging was performed in Arabic script, we have applied the Buckwalter transliteration which consists in

converting Arabic characters into Latin.

Part Of Speech Tagging

Part-of-speech (POS) task consists in analyzing texts in order to assign an appropriate syntactical

category to each word (noun, verb, adjective, preposition, etc).

For Arabic language, many part-of-speech taggers have been developed which we can classify into

different categories. The first class techniques are based on tagset that have been derived from an Indo-

European based tagset. However, the tagsets used in the second category have been derived from

traditional Arabic grammatical theory. The taggers in the third class are considered as hybrid based on

statistical and rule-based techniques; in spite of the fourth category, machine learning is used.

All works on Arabic tagging (that we are aware of) are Diab’s POS tagger [DHJ04] consisting in

combining techniques of the first and the fourth classes, Arabic Brill's POS tagger [Fre01] using

techniques of the first and the third categories, and APT [Kho01] based on the second and the third class

techniques.

Exceptionally, in this study just to be conformed to the Base Phrase (BP) Chunker [DHJ04] (see the

Section A.3.3. A.3.3. for chunker definition), we have chosen to use Diab’s tagger. In this tagger a large

set of Arabic tags has been mapped (by the Linguistic Data Consortium) to a small subset of the English

tagset that was introduced with the English Penn Treebank.


110 ____________________________________________________________________________________________


Noun Phrase Extraction

In this step, we are interested in Noun Phrases (NP) at syntagmatic level of the linguistic analysis.

For that, we adapted the SVM-BP chunker based on a supervised machine learning perspective using

Support Vector Machines (SVMs) trained on the Arabic TreeBank, and consisting in creating non-

recursive base phrases such as noun phrases, adjectival phrases, verb phrases, preposition phrases, etc.

b- Non Phrase Indexation Effect Assuming that complex term indexation could constitute a better representation of the text content

than single terms, we have adopted a noun phrase (NP) indexing method where the text is processed by

keeping the information related to the document syntagmatic relations.

To this end, we have tested the performance of the AIRS based on NP extraction by comparing the

behavior of the system following two indexation strategies with the performance of the suggested AIRS.

Designating by ‘Strategy 1’ indexation based on single terms, performed by the AIRS based on NP

extraction after dropping part-of-speech tagging and noun phase extraction steps. This makes the system

equivalent to the suggested AIRS presented in Figure 6.7, with the only difference being in the

transliteration process. While ‘Strategy 2’ is based on noun phrase indexation, where a new vector is

created corresponding to noun phrase extracted from each document; ‘Strategy 3’ is based on the

indexation by single terms supplemented with noun phrases.

Figure 6.11. Influence of the NP and the singles terms indexations on the IRS performance.

By comparing the curves of Figure 6.11, we remark that the use of noun phrase in the indexation

process drops the performance of the system based on single terms. However, in the third strategy, when

we have attempted to remedy the situation by combining single terms and noun phrases, we remark that


___________________________________________________________________________________________


111

this approach performs the second strategy but not the first one. We conclude that the system based on

single terms always yields to the best performance for lower recall rates, which is the most important for

a user, seeing that a user is more interested in the relevant documents at the top of returned list.

c- Discussion In this section, we discuss the results presented in the previous subsection while we also attempt to

reason about some of their proprieties. We do so by, first, giving a small overview about the studies

undertaken in the field of NP indexation.

Previous studies for English, French, and Chinese languages showed that the use of noun phrase in

representing the document content could improve the effectiveness of an automatic information retrieval

system. Mitra et al. in [MBS97] showed that reindexing with noun phrases the 100 first documents

retrieved by the SMART system gives a benefit at low recall. However, TREC campaigns showed that

not necessary the noun phrase indexing approaches enhance the retrieval performance, and that this

improvement can depend on the size of the collection, and the query topic [Fag87, EGH91, ZTM96]. The

given results in PRISE system [SLP97], based on the noun phrase extraction by the Tagged Text Parser,

are a good example of the difficulty in evaluating the syntagmatic analysis effect on the IRS since the

performances obtained were not significant.

Conformed to this, our experimental results show that for Arabic language the NP-based indexing

decrease the retrieval performance compared to single-term-based indexing. We could explain that this

drop is due to the noun phrase size and the normalization lack, for example “اء!r^ث ا!C= �{رJآ” “air

pollution disaster” and “اء!r^ث ا!C=” “air pollution” should be normalized under “air pollution”. It could

be also explained by the use of a morpho-syntactic parser and chunker based on supervised learning

depending on an annotated corpus and not on specific syntactic pattern rules.

We think that the use of morpho-syntactic parser and chunker based on syntactic pattern rules, the

use of a part-of-speech tagger based on statistical and rule-based techniques, or a tagset derived from

Arabic grammatical theory could resolve the specified problems, and be more effective. In this aim, a

deep study was undertaken by Boulaknadel [Bou08].

6. 4. Summary In this chapter, we have presented an evaluation of the vector space model and the LSA method,

while performing linguistic processing, and using weighting schemes in an Arabic analysis and

reference corpora that we have created for this aim.

The undertaken experiments showed that light-stemming increases the performance of the Arabic

information retrieval system, especially when the Okapi BM-25 scheme is used. Thus, confirming the


112 ____________________________________________________________________________________________


fact that the linguistic preprocessing is an essential step in the information retrieval process. The study

also showed that the elimination of stop-word in retrieval, for Arabic language known by its agglutinant

characteristic, could be avoided by applying some weighting schemes that address the issue of high

frequency of words in a corpus. However, the noun phrase based indexing, even when supplemented by

single term based indexing, decrease this performance.

On the other hand, by comparing the performance of the vector space model to the LSA one, we

remark an important improvement on behalf of this latter. By evaluating the influence of weighting

schemes on queries’ model, we ascertain the usefulness of short queries representing the reality of the

Web.

Similarly to the results of Chapter 3, the experiments of this chapter also showed that the Okapi BM-

25 is the best weighting schemes between those performed in this work.

In the conclusion of the carried out study in Arabic retrieval, we can tell that the suggested system,

based on light-stemming, Okapi BM-25 scheme, short weighted queries, and LSA model, could be used

as a standardized system.

___________________________________________________________________________________________


113

Chapter 7 Conclusion and Future Work 7. 1. Conclusion

This dissertation advances a review of information retrieval models, and clustering algorithms’

taxonomy; while explaining the utility of clustering in the information retrieval context. Besides, it has

tried to study the state of the art in dimensionality reduction techniques, especially extraction features

methods, and Arabic information retrieval techniques, after recalling the Arabic language characteristics,

and Arabic corpora.

The key contribution of this work lies in providing an Arabic information retrieval system based on

light stemming, Okapi BM-25 weighting scheme, and the latent semantic model, by building an analysis

and a reference Arabic corpora, improving prior models addressing Arabic document retrieval problems,

and comparing specific weighting schemes. However, other approaches have been proposed in

document clustering and dimensionality reduction.

In clustering, we have proposed to use the diffusion map space based on the cosine kernel, where the

results of the k-means clustering algorithm have shown that the indexation of documents in this space is

more effective than in the diffusion map space based on the Gaussian kernel, the Salton’ space and the

LSA space, specially for the case of multi-clusters. Moreover, the use of k-means algorithm in this space

has met requirement in soundness and efficiency. We have also provided, when the singular value

decomposition method is used in the construction of the diffusion space, a technique for resolving the

problem of specifying the clusters’ number, and another for the choice of the cosine diffusion space

dimension. Furthermore, we have improved the single pass algorithm, by using the diffusion approach

based on the updating singular value decomposition technique, which is potentially useful in widespread

on-line applications that require real time updating, such as peer to peer information retrieval.

In dimensionality reduction, we have supplemented the singular value decomposition, used in

feature transformation, with term selection method based on the extracting generic term algorithm.

This dissertation thoroughly addressed the impact of term weighting in retrieval, based on latent

semantic model, for varying combination of the local and global weighting functions in addition to the

normalization function. The effectiveness of 25 different index weighting terms was explored and the

best one which is the Okapi BM-25 was identified.

7. 2. Limitations Experimental research inherently has limitations. The work presented in this dissertation is limited in

the following main ways:


114 ____________________________________________________________________________________________


The non-availability of free Arabic corpora needed for information retrieval and clustering

evaluation, and the requirement of considerable human effort for reconstructing static test collections,

such as those used in TREC, were the reason confiding us to use an Arabic reference corpus of 1 060

documents, with a total size of 5.34 megabytes, for retrieval and English corpora for clustering.

Hardware limitation. Most of the time we have only the possibility to use a machine with duo

processor of 1.80 gigahertz, and random access memory of 1 gigabyte.

7. 3. Prospects There are several directions in which this research can proceed. These directions can be categorized

into five broad areas:

- Automating the generic term extraction algorithm.

- Adapting the generic term extraction algorithm to other ranges of data.

- Applying the diffusion map approach results to multimedia data.

- Extending our Arabic reference corpus, and try to classify its content to non-overlapping groups.

This way it could serve for both retrieval and clustering evaluation.

- Improving our system performance by using the results of the noun phrase study undertaken by

Boulaknadel [Bou08], and the semantic query expansion.

- Implementation of a full Arabic search engine based on the previous studies undertaken in this

dissertation and those planned for further work.

___________________________________________________________________________________________


115

Appendix A Natural Language Processing A.1. Introduction

Document Retrieval is essentially a matter of deciding which documents in a collection should be

retrieved to satisfy a user's need for information. The user's information need is represented by a query

or profile, and contains one or more search terms, in addition perhaps to some additional information

such as importance weights. Hence, the retrieval decision is based on the comparison between the query

terms and the index terms (important words or phrases) appearing in the document itself. The decision

may be binary (retrieve/reject), or it may involve estimating the degree of relevance that the document

has to the query.

Unfortunately, the words that appear in documents and in queries often have many morphological

variants. Thus, pairs of terms such as “computing” and “computation” will not be recognized as

equivalent without some form of natural language processing (NLP).

In this appendix, we introduce some of these processing, more precisely those used and mentioned in

the thesis. The NLP can be classified into two categories: basic NLP techniques and advanced ones.

A.2. Basic Techniques The fact that N-gram, Tokenization, Transliteration, Stemming, and removal of Stop words are less

sophisticated than the other natural language processing, they are considered as basic techniques.

A.2.1. N-grams An n-gram is a sub-sequence of n items from a given sequence. It is a popular technique in statistical

natural language processing. For parsing, words are modeled such that each n-gram is composed of n

words. For a sequence of words, (for example “the dog smelled like a skunk”), the 3-grams would be:

“the dog smelled”, “dog smelled like”, “smelled like a”, and “like a skunk”. For sequences of characters,

the 3-grams that can be generated from “good morning” are “goo”, “ood”, “od ”, “d m”, “ mo”, “mor”

and so forth. Some practitioners preprocess strings to remove spaces, others do not. In almost all cases,

punctuation is removed by preprocessing. n-grams can also be used for sequences of words or, in fact,

for almost any type of data.

By converting an original sequence of items to n-grams, it can be embedded in a vector space, thus

allowing the sequence to be compared to other sequences in an efficient manner.

A.2.2. Tokenization Tokenization, or word segmentation, is a fundamental task of almost all NLP systems. In languages


116 ____________________________________________________________________________________________


that use word separators in their writing, tokenization seems easy: every sequence of characters between

two white spaces or punctuation marks is a word. This works reasonably well, but exceptions are

handled in a cumbersome way. On the other hand, there are other languages that do not use word

separators, like the case of Arabic language. They need much more complicated processing, closer to

morphological analysis or part-of-speech tagging. Tokenizers designed for those languages are generally

very tied to a given system and language.

A.2.3. Transliteration Transliteration is the practice of transcribing a word or text written in one writing system into

another writing system. It is also the system of rules for that practice.

Technically, from a linguistic point of view, it is a mapping from one system of writing into another.

Transliteration attempts to be exact, so that an informed reader should be able to reconstruct the original

spelling of unknown transliterated words. To achieve this objective transliteration may define complex

conventions for dealing with letters in a source script which do not correspond with letters in a goal

script.

This is opposed to transcription, which maps the sounds of one language to the script of another

language. Still, most transliterations map the letters of the source script to letters pronounced similarly in

the goal script, for some specific pair of source and goal language.

It is not to be confused with translation, which involves a change in language while preserving

meaning. Here we have a mapping from one alphabet into another.

Specifically for Arabic language, many transliteration systems are utilized, such as: Deutsche

Morgenländische Gesellschaft, Adopted by the International Convention of Orientalist Scholars in

Rome28; ISO/R 233, replaced by ISO 233 in 1984; BS 4280, developed by the British Standards

Institute29; and SATTS, One-to-one mapping to Latin Morse equivalents, used by US military. However

in our work, we have used Buckwalter transliteration30,31.

The Buckwalter Arabic transliteration was developed at Xerox by Tim Buckwalter in the 1990s. It is

an ASCII only transliteration scheme, representing Arabic orthography strictly one-to-one, unlike the

more common romanization schemes that add morphological information not expressed in Arabic script.

Thus, for example, a “و” (waaw) will be transliterated as “w” regardless of whether it is realized as a

vowel [u:] or a consonant [w]. Only when the “و” (waaw) is modified by a “ء” (hamza) “ ؤ ” does the

28 http://www.dmg-web.de, Retrieved on 10-7-2007. 29 http://www.bsi-global.com/index.xalter, Retrieved on 10-7-2007. 30 http://www.qamus.org/transliteration.htm, Retrieved on 10-7-2007. 31 http://www.xrce.xerox.com/competencies/content-analysis/arabic/info/buckwalter-about.html, Retrieved on 10-7-2007.


___________________________________________________________________________________________


117

transliteration change to ‘&’. The unmodified letters are straightforward to read (except for maybe ‘*’ =

“ but the transliterations of letters with diacritics and the ,((thaa) ”ث“ = ’v‘ ,(ayn) ”ع“ = ’and ‘E (thaal) ”ذ

harakat take some time to get used to, for example the nunated i‘rab “ ” [un], “ ” [an], “ ” [in] appear

as N, F, K, and the sukun “ ” [Ґa-] (no vowel) as o. “ ة ” (Ta marbouta) is p.

n ن Z ظ * ذ A ا K

h� E ع r ر b ب a

w و g غ z ز t ت u

y ي f ف s س v ث i

Y ى q ق $ ش j ج ~

p ة k ك S ص H ح O

l ل D ض x خ F v |

m م T ط d د N أ >

_ ـ ‘ ء { ئ & ؤ > إ

Table A.1. Buckwalter Transliteration.

A.2.4. Stemming A stemmer is a computer program or algorithm which determines a stem form of a given inflected

(or, sometimes, derived) word form, generally a written word form. The stem need not be identical to

the morphological root of the word; it is usually sufficient that related words map to the same stem, even

if this stem is not in itself a valid root.

A stemmer for English, for example, should identify the string “cats” (and possibly “catlike”, “catty”

etc.) as based on the root “cat”, and “stemmer”, “stemming”, “stemmed” as based on “stem”. English

stemmers are fairly trivial (with only occasional problems, such as “dries” being the third-person

singular present form of the verb “dry”, “axes” being the plural of “axe” as well as “axis”); but

stemmers become harder to design as the morphology, orthography, and character encoding of the target

language becomes more complex. For example, an Italian stemmer is more complex than an English one

(because of more possible verb inflections), a Russian one is more complex (more possible noun

declensions), an Arabic one is even more complex (due to nonconcatenative morphology and a writing

system without vowels), and so on.


118 ____________________________________________________________________________________________


Prefixes Suffixes

One-character Two-character Three-character One-character Two-character

”t“ ت

”l“ ل

”A“ ا

”y“ ي

”m“ م

”Al“ ا^ـ

”bt“ &�ـ

”tt“ =�ـ

”yt“ _�ـ

”lt“ ^�ـ

�ـ “mt”

”wt“ و=ـ

”st“ ��ـ

”�X “ntـ

”bm“ &:ـ

”lm“ ^:ـ

”wm“ وـ

”km“ آ:ـ

”fm“ 4:ـ

”C^ “llـ

”wy“ و_ـ

ـ^> “ly”

”st“ �>ـ

”fy“ 4>ـ

”wA“ وا

J4 “fA”

T “lA”

J& “bA”

”wAl“ وا^ـ

”J4 “fAl^ـ

”J& “bAl^ـ

”p“ ـ�

xـ “h”

”y“ ي

”A“ ا

”At“ ـJت

”An“ ـJن

J= “tA”

�= “tk”

\= “ty”

x= “th”

�= “tm”

”hm“ ه�

� ”hn“ ه

Jه “hA”

”km“ آ�

”wA“ وا

”wn“ ون

”wh“ و�

�_ “yp”

JX “nA”

�_ “yn”

x_ “yh”

Table A.2. Prefixes and suffixes list.

The Arabic light stemmer32, Darwish’s stemmer modified by Larkey [LBC02], used in this work

identify 3 three-character, 23 two-character and 5 one-character prefixes, 18 two-character and 4 one-

character suffixes that should be removed in stemming. The prefixes and suffixes to be removed are

shown in Table A.2.

A.2.5. Stop Words Stop words are those words which are so common that they are useless to index or use in search

32 http://www.glue.umd.edu/~kareem/research/, Retrieved on 4-15-2005


___________________________________________________________________________________________


119

engines or other search indexes. Usually articles, adverbials or adpositions are stop words. In Arabic

some obvious stop words would be “from” “�” (min), “to” “ d^إ” (ailaa), “he” “!ه” (huwa), and “she”

.(hiya) ”ه\“

It should be noted that there is no definitive list of stop words, as they can depend on the purpose of

the search. Full phrase searches, for example, would not want words removed. Also, if the search uses a

stemming algorithm then many words may not be needed in that searches stop list.

A.3. Advanced Techniques Due to the fact that Anaphoric resolution, Chunking, Lexical acquisition, Lemmatization, Noun

phrase (NP) extraction, Parts of speech (POS) tagging, Phrase name identification, Root, Sentence

parsing, Synonym expansion, and Word sense disambiguation required a text structure analysis, they are

considered as advanced NLP. In the following we recall the definition of a root, POS tagging, chunking,

and NP extraction.

A.3.1. Root The root is the primary lexical unit of a word, which carries the most significant aspects of semantic

content and cannot be reduced into smaller constituents. Content words in nearly all languages contain,

and may consist only of, root morphemes. However, sometimes the term root is also used to describe the

word minus its inflectional endings, but with its lexical endings in place. For example, chatters has the

inflectional root or lemma chatter, but the lexical root chat. Inflectional roots are often called stems, and

a root in the stricter sense may be thought of as a monomorphemic stem.

Roots can be either free morphemes or bound morphemes. Root morphemes are essential for

affixation and compounds.

The root of a word is a unit of meaning (morpheme) and, as such, it is an abstraction, though it can

usually be represented in writing as a word would be. For example, it can be said that the root of the

English verb form running is run, or the root of the French verb accordera is accorder, since those words

are clearly derived from the root forms by simple suffixes that do not alter the roots in any way. In

particular, English has very little inflection, and hence a tendency to have words that are identical to

their roots. But more complicated inflection, as well as other processes, can obscure the root; for

example, the root of mice is mouse (still a valid word), and the root of interrupt is, arguably, rupt, which

is not a word in English and only appears in derivational forms (such as disrupt, corrupt, rupture, etc.).

The root rupt is written as if it were a word, but it's not.

This distinction between the word as a unit of speech and the root as a unit of meaning is even more

important in the case of languages where roots have many different forms when used in actual words, as


120 ____________________________________________________________________________________________


is the case in Semitic languages. In these, roots are formed by consonants alone, and different words

(belonging to different parts of speech) are derived from the same root by inserting vowels. For

example, in Arabic, the root “آ��” ‘ktb’ represents the idea of writing, and from it we have “��آ”

‘kataba’ “he wrote”, and “��آ” ‘kutiba’ “has been written”, along with other words such as “��آ”

‘kutubN’ “books”.

A.3.2. POS Tagging Part-of-speech (POS) tagging is the annotation of words with the appropriate POS tags based on the

context in which they appear. POS tags divide words into categories based on the role they play in the

sentence in which they appear. POS tags provide information about the semantic content of a word

(“Did he cross the desert?” vs. “Did he desert the army?”). Nouns usually denote “tangible and

intangible things”, whereas prepositions express relationships between “things”. Most POS tag sets

make use of the same basic categories. The most common set of tags contain seven different tags

(Article, Noun, Verb, Adjective, Preposition, Number, and Proper Noun). Currently the most widely

used tag sets are those for the Penn Tree Bank33 (45 tags) and for the British National Corpus34 (BNC

Enriched Tagset also known as the C7 Tagset).

Most tagging algorithms fall into one of two classes: Rule-based taggers and Stochastic taggers.

Rule-based taggers generally involve a large database of hand-written disambiguation rules which

specify, for example, that an ambiguous word is a noun rather than a verb if it follows a determiner;

while, stochastic taggers generally resolve tagging ambiguities by using a training corpus, to compute

the probability of a given word having a given tag in a given context. However, the Transformation

based tagger or the Brill tagger shares features of both tagging architectures. Like the rule-based tagger,

it is based on rules which determine when an ambiguous word should have a given tag. Like the

stochastic taggers, it has a machine-learning component: the rules are automatically induced from a

previously tagged training corpus.

A.3.3. Chunking Text chunking (“light parsing”) is an analysis of a sentence which subsumes a range of tasks. The

simplest is finding ‘noun groups’ or ‘base NPs’. More ambitious systems may add additional chunk

types, such as verb groups, or may seek a complete partitioning of the sentence into chunks of different

types. But they do not specify the internal structure of these chunks, nor their role in the main sentence.

The following example identifies the constituent groups of the sentence “He reckons the current

33 http://www.scs.leeds.ac.uk/amalgam/tagsets/upenn.html, Retrieved on 10-1-2007. 34 http://www.natcorp.ox.ac.uk/docs/c7spec.html, Retrieved on 10-1-2007.


___________________________________________________________________________________________


121

account deficit will narrow to only $1.8 billion in September”: [NP He] [VP reckons] [NP the current

account deficit] [VP will narrow] [PP to] [NP only $1.8 billion] [PP in] [NP September].

Researchers focused on chunking task apply grammar-based methods, combining lexical data with

finite state or other grammar constraints, while others work on inducing statistical models either directly

from the words or from automatically assigned part-of-speech classes.

A.3.4. Noun Phrase Extraction The noun phrase extraction is the continuation of a chain of finite-state tools which include

tokenizer, part of speech tagger, and chunker. This step is used in order to identify phrases whose head

is a noun or a pronoun, optionally accompanied by a set of modifiers. The modifiers may be:

determiners: articles (the, a), demonstratives (this, that), numerals (two, five, etc.), possessives (my,

their, etc.), and quantifiers (some, many, etc.); in English, determiners are usually placed before the

noun; adjectives (the red ball); or complements, in the form of an adpositional phrase (such as: the man

with a black hat), or a relative clause (the books that I bought yesterday).

___________________________________________________________________________________________


122

Appendix B Weighting Schemes’ Notations Each weighting scheme can be decomposed into three steps: a local, a global and a normalization

step. For all measure we use the following symbols:

ijf : Term frequency, the number of times term i appears in document j.

idf : Document frequency, the number of documents in which term i occurs.

igf : Global frequency, the total number of times term i occurs in the whole collection.

igf

ijf

ijp = .

N: the number of documents in the collection.

M: the number of terms in the collection.

Code Description Expression

Local component for term i in document j: ),( jilocal

w

n ijf None, no change

b 1 Binary

l ( )1ijlog +f Natural log

Global component for term i: )(iglobal

w

n None, no global change 1

t Idf inverse document frequency

idf

N2log

N Normal ∑N

j ijf2

1

G GfIdf idf

igf

E Entropy ( )N

N

j ijpijp

log

log1 ∑+


___________________________________________________________________________________________


123

EG Global Entropy ( )

∑−N

j N

ijpijp

log

log1log

ES Shannon Entropy ∑−N

j ijpijp log

E1 1- Entropy ( )∑−N

j N

ijpijp

log

log1

Normalization component for document j: )( jnorm

w

n 1 None, no normalization

c ∑=

M

i

iglobal

wjilocal

w1

2

)(),( cosine

Table B.1. List of term weighting components.

___________________________________________________________________________________________


124

Appendix C Evaluation Metrics C.1. Introduction

Evaluation metrics are effective tools for evaluating many criteria, such as the execution efficiency,

the storage efficiency, the performance or the effectiveness of a system.

Execution efficiency measures the time taken by a system to perform a computation. Storage

efficiency is measured by the number of bytes needed to store data. However, effectiveness is the most

common measure used in experimentations.

Thereafter, we introduce the commonly qualitative measures used in retrieval and clustering

evaluation task that we have utilized in our works.

C.2. IR Evaluation Metrics Information retrieval is devoted to finding “relevant” documents, not finding simple matches to

patterns. Yet, often when information retrieval systems are evaluated, they are found to miss numerous

relevant documents. Moreover, users have become complacent in their expectation of accuracy of

information retrieval systems.

In Figure C.1, we illustrate the critical document categories that correspond to any issued query.

Namely, in the collection there are documents which are retrieved, and there are those documents are

relevant. In perfect system, these two sets would be equivalent-we would only retrieve relevant

documents. In reality, systems retrieve many non-relevant documents. To measure effectiveness, two

ratios are used: precision and recall denoting respectively purity and completeness.

Document Collection

Retrieved Retrieved& Relevant Relevant

Document Collection

Retrieved Retrieved& Relevant Relevant

Figure C.1. The computation of Recall and Precision.

C.2.1. Precision Precision is the ratio of the number of relevant documents retrieved to the total number retrieved.

Evaluation Metrics

___________________________________________________________________________________________


125

Precision provides an indication of the quality of the answer set. However, this does not consider the

total number of relevant documents. A system might have good precision by retrieving ten documents

and finding that nine are relevant (a 0.9 precision), but the total number of relevant documents also

matters. If there were only nine relevant documents, the system would be a huge success-however if

millions of documents were relevant and desired, this would not be a good result set.

documents retrieved theofnumber total

documents retrieved andrelevant theofnumber Precision

=

C.2.2. Recall Recall considers the total number of relevant documents. It is the ratio of number of relevant

documents retrieved to the total number of documents in the collection that are believed to be relevant.

When the total number of relevant documents in the collection is unknown, an approximation of the

number is obtained.

collection in the documentsrelevant theofnumber total

documents retrieved andrelevant theofnumber Recall

=

Precision and Recall are two complementary measures of retrieval performance. For a particular

query, it is usually possible to sacrifice one so as to boost the other. For example, lowering the retrieval

criteria so that more documents are retrieved will most likely increase the Recall rate; however, in the

mean time, this strategy will also probably admit many more non-relevant documents into the retrieval

result with the likely consequence of decreasing the Precision rate, and vice versa as explained in the

Figure C.2. Therefore it is usually recommended that a balance between these two measures be sought

for users' best needs.

1

1

Recall

Pre

cisi

on

The idealThe ideal

Returns relevant documents butmisses many useful ones tooReturns relevant documents butmisses many useful ones too

Returns most relevant Documents but includeslots of junk

Returns most relevant Documents but includeslots of junk

Figure C.2. The Precision Recall trade-off.

For IR effectiveness, precision and recall are used together but in different ways. For example,

Evaluation Metrics

126 ____________________________________________________________________________________________


Precision at n measures the precision after a fixed number of documents have been retrieved. Or,

Precision at specific recall levels, which is the precision after a fraction of relevant documents are

retrieved. Another and most commonly reported measure is the Interpolated precision-recall curve,

showing the interaction between the precision and the recall.

C.2.3. Interpolated Recall-Precision Curve The Interpolated Recall-Precision curve (IRP curve) is, one of the standard TREC performance

evaluation measure35, developed to enable averaging and performance comparison between different

systems [SaM83]. This measure combines precision and recall to produce a single metric of retrieval

effectiveness. It depicts how precision changes over a range of recall (usually from 0 to 1 in increments

of 0.1).

Mathematically, an N-point IRP curve is drawn by connecting points generated by the following

formula in the order of i (0 ≤ i ≤ N-1):

−=

≥≤≤

)(Pr)(Re

,1

),( max1

jecisionjcallN

iPR

i

mj

Rii , (1)

where m is the total number of retrieved documents, 1−

=N

iRi is the given recall at the ith rang, iP is

the interpolated precision based on the given recall, representing the maximum value of function

Precision(j) with j ranging from 1 to m and insuring function Recall(j) be no less than the given Recall

of 1−

=N

iRi , and the functions Recall(j) and Precision(j) are defined as follows:

Giving a list of the m retrieved documents ranked in descending order according to their relevancy

scores, and considering n the total number of relevant documents in the collection, and Relevant(j)

function representing the number of relevant documents in the top j ranked documents,

n

)Relevant(j Recall(j)= and

j

)Relevant(j j)Precision( = .

Note that while precision is not defined at a recall of 0, this interpolation rule does define an interpolated

value for recall rang 0.

Derived from IRP curve, a single numerical value, T, denoting the area covered between this curve

and the horizontal axis (the axis of Recall) may be used to crudely estimate the overall retrieval

performance of a particular query (Figure C.3). In another word, this single value T (called Average

Precision AP) indicates the average interpolated precision over the full range (i.e. between 0 and 1) of

35 http://trec.nist.gov/pubs/trec10/appendices/measures.pdf.

Evaluation Metrics

___________________________________________________________________________________________


127

recall for a particular query.

10Recall

Pre

cision

1

T

10Recall

Pre

cision

1

T

Figure C.3. Interpolated Recall Precision Curve.

The average precision of multiple query results are combined by taking their mean. The new mean is

called the Mean Interpolated Average Precision (MIAP).

C.3. Clustering Evaluation As for IR system, the evaluation of document clustering algorithm usually measures its effectiveness

rather than its efficiency, by somehow comparing the clusters it produces with “ground truth” consisting

of classes assigned to the patterns by manual means or some other means in whose veracity there is

confidence. Generally, to evaluate a single cluster, purity and entropy are used, while accuracy and

mutual information are used for the entire clustering [SGM00], [Erk06].

C.3.1. Accuracy Accuracy (Acc) is, the degree of veracity, closely related to precision which also called

reproducibility or repeatability because it is the degree to which further measurements or calculations

will show the same or similar results.

The results of calculations or a measurement can be accurate but not precise; precise but not

accurate; neither; or both. A result is called valid if it is both accurate and precise.

Mathematicly, the accuracy is defined as follows:

Let il be the label assigned to id by the clustering algorithm, and iα be di’s actual label in the corpus.

Then, accuracy is defined as

,)),((

1

n

lmapn

iii∑

=

αδ where

===

otherwise,0),(

,1),(

yx

yxifyx

δδ

.

)( ilmap is the function that maps the output label set of the clustering algorithm to the actual label set of

Evaluation Metrics

128 ____________________________________________________________________________________________


the corpus. Given the confusion matrix of the output, a best such mapping function can be efficiently

found by Munkres's algorithm [Mun57].

C.3.2. Mutual Information Mutual Information (MI) is a symmetric measure for the degree of dependency between the

clustering and the categorization. If the cluster and the class are independent, no one of them contains

information about the other and then their mutual information is equal to zero. Formally, the MI metric

does not require a mapping function, and it is generally used because it successfully captures how

related the labelings and categorizations are without a bias towards smaller clusters.

If { }klllL ...,,, 21= is the output label set of the clustering algorithm, and { }kA ααα ...,,, 21= is the

categorization set of the corpus with the underlying assignments of documents to these sets, the MI of

these two sets is defined as:

,)(.)(

),(log),(),( 2

, ji

jiji

ALl PlP

lPlPALMI

jiα

αα

α•= ∑

∈∈

where )( ilP and )( jP α are the probabilities that a document is labeled as il and jα by the algorithm

and in the actual corpus, respectively; and ),( jilP α is the probability that these two events occur

together. These values can be derived from the confusion matrix. We map the MI metric to the [ ]1,0

interval by normalizing it with the maximum possible MI that can be achieved with the corpus. The

normalized MI is defined as .),(

),(____

AAMI

ALMIMI =

___________________________________________________________________________________________


129

Appendix D Principal Angles

Principal angles (sometimes denoted as canonical angles) concept allows to characterize or measure,

in a natural way, how two subspaces differ, by generalizing the notion of an angle between two lines to

higher-dimensional subspaces of dR [BjG73, GoV89, Arg03].

Known that for two non-zero vectors, the acute angle between the vectors x, y dR∈ is denoted by

( ) ( )yx

yxyx

,arccos, =∠ , and by definition could be emphasized that ( )

2,0

π≤∠≤ yx .

Considering F and G two subspaces of dR . Recursively, a set of angles between these two

subspaces could be defined, which is denoted as principal or canonical angles. Let two real-valued

matrices F and G be given, each with d rows, and their corresponding column-spaces F and G, which

are subspaces in dR . Assuming that

p = dim (F) ≥ dim(G) = q ≥ 1.

Then the principal angles [ ]2,0 πθ ∈l between F and G may be defined recursively for l = 1, 2, …, q

by gf T

GgFfl maxmaxcos

∈∈=θ , subject to the constraints: f = 1, g = 1, j

T ff = 0, jT gg = 0, j = 1,

2, …, l-1. The vectors ( qff ...,,1 ) and ( qgg ...,,1 ) are called principal vectors of the pair of subspaces.

Intuitively, 1θ is the angle between two closest unit vectors Ff ∈1 and Gg ∈1 , 2θ is the angle

between two closest unit vectors Ff ∈2 and Gg ∈2 such that 2f and 2g are, respectively, orthogonal to

1f and 1g . Continuing in this manner, always searching in subspaces orthogonal to principal vectors

that have already been found, the complete set of principal angles and principal vectors will be obtained.

The average cosine of the principal angles between the subspaces F and G is wrote as ∑=

q

llq

1

1 cos)( θ . For

algorithms to compute the principal angles see [BjG73, Arg03].

___________________________________________________________________________________________


130

References [AaE99] K. Aas, and L. Eikvil, ‘Text Categorisation: A Survey’, Technical Report, June 1999, Norwegian

Computing Center.

[Abd04] A. Abdelali, ‘Localization in Modern Standard Arabic’, Journal of the American Society for

Information Science and Technology, Vol. 55, No.1 (2004), pp. 23-28.

[ACS04] A. Abdelali, J. Cowie, and H. Soliman, ‘Arabic Information Retrieval Perspectives’, In Proceedings

of JEP-TALN 2004 Arabic Language Processing, Fez , Morocco, April, 2004.

[Abd87] A. Abdul-Al-Aal, An-Nahw Ashamil, Maktabat Annahda Al-Masriya, Cairo, Egypt, 1987.

[AAE99] H. Abu-Salem, M. Al-omari, and M. Evens, ‘Stemming methodologies over individual query words

for an Arabic information retrieval system’, Journal of the American Society for Information Science,

Vol. 50, No. 6 (1999), pp. 524-529.

[AMC05] Z. Abu Bakar, M. Mat Deris, and A. Che Alhadi, ‘Performance Analysis of Partitional and

Incremental Clustering’, Seminar Nasional Aplikasi Teknologi Informasi 2005, Yogyakarta,

Indonesia, June, 2005.

[AGG98] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan, ‘Automatic Subspace Clustering of High

Dimensional Data for Data Mining Applications’, In Proceedings of the ACM International

Conference on the Management of Data, June, 1998, pp. 94-105.

[Ala90] M.A. Al-Atram, ‘Effectiveness of Natural Language in Indexing and Retrieving Arabic Documents’,

King Abdulaziz City for Science and Technology Project number AR-8-47, Riyadh, Saudi Arabia,

1990.

[AlA89] S. S. Al-Fedaghi, and F. S. Al-Anzi, ‘A New Algorithm to Generate Arabic Root-Pattern Forms’, In

Proceedings of the 11th National Computer Conference, King Fahd University of Petroleum &

Minerals, Dhahran, Saudi Arabia, 1989, pp. 391-400.

[Alg87] M. Al-Gasimi, ‘Arabization of the MINISIS System’, In Proceedings of the First King Saud University

Symposium on Computer Arabization, Riyadh. Saudi Arabia, April, 1987, pp. 13-26.

[AlF02] M. Al-Jlayl, and O. Frieder, ‘On Arabic search: Improving the retrieval effectiveness via light

stemming approach’, In Proceedings of the 11th ACM International Conference on Information and

Knowledge Management, 2002, pp. 340-347.

[Alka91] I.A. Al-Kharashi, ‘MICRO-AIRS: A Microcomputer-based Arabic Information Retrieval System

Comparing Words, Stems, and Roots as Index Terms’, Ph.D. Dissertation, Illinois Institute of

Technology, Illinois, USA, 1991.

References

__________________________________________________________________________________________


131

[AlE94] I.A. Al-Kharashi, and M. W. Evens, ‘Comparing Words, Stems, and Roots as Index Terms in an

Arabic Information Retrieval System, Journal of the American Society for Information Science, Vol.

45, No. 8 (1994), pp. 548-560.

[Alku91] M. Al-Khuli, A dictionary of theoretical linguistics: English-Arabic with an Arabic-English glossary,

Library of Lebanon, Beirut, Lebanon, 1991.

[Als99] M. Al-Saeedi, Awdah Almasalik ila Alfiyat Ibn Malek, Dar ihyaa al oloom, Beirut, Lebanon, 1999.

[Als96] R. Al-Shalabi, Design and Implementation of an Arabic Morphological System to Support Natural

Language Processing, Ph.D. Dissertation, Computer Science, Illinois Institute of Technology, Chicago,

1996.

[AlA04a] I.A. Al-Sughaiyer, and I.A. Al-Kharashi, ‘Arabic Morphological Analysis Techniques: a

Comprehensive Survey’, Journal of the American Society for Information Science and Technology,

Vol.55, No. 3, February, 2004, pp.189-213.

[AlA04b] L. Al-Sulaiti, and E. Atwell, ‘Designing and Developing a Corpus of Contemporary Arabic’, In

Proceedings of the 6th Teaching and Language Corpora Conference, Granada, Spain, 2004, pp.92.

[AlA05] L. Al-Sulaiti, and E. Atwell, ‘Extending the Corpus of Contemporary Arabic’, In Proceedings of

Corpus Linguistics Conference, Vol. 1, No. 1 (2005), pp. 15-24.

[All07] M. P. Allen, ‘The t test for the simple regression coefficient’, Chapter in Understanding Regression

Analysis, Springer US, 1997, pp. 66-70.

[Ama02] M. Amar, Les Fondements théoriques de l’indexation: une approche linguistique, ADBS editions,

Paris, France, 2000.

[AmR02] G. Amati, and C. J. Van Rijsbergen, ‘Probabilistic Models of Information Retrieval based on

Measuring the Divergence from Randomness’, ACM Transactions on Information Systems (TOIS),

Vol. 20, No. 4 (2002), pp. 357-389.

[Arg03] M.E. Argentati, Principal Angles between Subspaces as Related to Rayleigh Quotient and Rayleigh

Ritz Inequalities with Applications to Eigenvalue Accuarcy and an Eigenvalue Solver, Ph.D.

Dissertation, University of Colorado, USA, 2003.

[Ars04] A. Arshad, ‘Beyond Concordance Lines: Using Concordances to Investigating Language

Development’, Internet Journal of e-Language Learning and Teaching, Vol. 1, No. 1 (2004), pp. 43-

51.

[ABE05] F. Ataa Allah, S. Boulaknadel, A. El Qadi, and D. Aboutajdine, ‘Amélioration de la performance de

l’analyse sémantique latente pour des corpus de petite taille’, Revue des Nouvelles technologies de

l’Information (RNTI), Vol. 1 (2005), pp. 317.

[ABE06] F. Ataa Allah, S. Boulaknadel, A. El Qadi, and D. Aboutajdine, ‘Arabic Information Retrieval System

Based on Noun Phrases’, Information and Communication Technologies, Vol. 1, No. 24-28, Damask,

References

__________________________________________________________________________________________


132

Syria, April, 2006, pp. 1720 - 1725.

[ABE08] F. Ataa Allah, S. Boulaknadel A. El Qadi, and D. Aboutajdine, ‘Evaluation de l’Analyse Sémantique

Latente et du Modèle Vectoriel Standard Appliqués à la Langue Arabe’, Revue de Technique et

Science Informatiques, sent on February 2006, accepted on January 2007, and to appear on 2008.

[Att00] A. M. Attia, ‘A Large-Scale Computational Processor of the Arabic Morphology, and Applications’,

A Master’s Thesis, Faculty of Engineering, Cairo University, Cairo, Egypt, 2000.

[BaH76] F. B. Backer, and L. J. Hubert, ‘A Graphtheoretic Approach to Goodness-of-Fit in Complete-Link

Hierarchical Clustering’, Journal of the American Statistical Association, Vol. 71, 1976, pp. 870-878.

[BCB92] B. T. Bartell, G. W. Cottrell, and R. K. Belew, ‘Latent Semantic Indexing is an Optimal Special Case

of Multidimensional Scaling’, Proceedings of the 15th Annual International ACM SIGIR Conference

on Research and Development in Information retrieval, 1992, pp. 161-167.

[BeK03] J. Becker, and D. Kuropka, ‘Topic-based Vector Space Model’, In Proceedings of the 6th

International Conference on Business Information Systems, Colorado Springs, June, 2003, pp. 7-12.

[Bec59] M. Beckner, The Biological Way of Thought, Columbia University Press, New York, 1959.

[Bee96] K. R. Beesley, ‘Arabic finite-state Morphological Analysis and Generation’ In Proceedings of the 16th

International Conference on Computational Linguistics (COLING-96), Vol. 1, pp. 89-94, 1996.

[BeC87] N. Belkin and W. B. Croft, ‘Retrieval Techniques’, In M. Williams, editor, Annual Review of

Information Science and Technology (ARIST), Vol. 22, Chap. 4. Elsevier Science Publishers B.V.,

1987, pp. 109-145.

[BeN03] M. Belkin and P. Niyogi, ‘Laplacian Eigenmaps for Dimensionality Reduction and Data

Representation’, Neural Computation, Vol. 6, No. 15, 2003, pp. 1373-1396.

[BeB99] M. W. Berry, and M. Browne, Understanding Search Engines: Mathematical Modeling and Text

Retrieval, Siam Book Series: Software, Philadelphia, 1999.

[BDJ99] M. W. Berry, Z. Drmac, and E. R. Jessup, ‘Matrices, Vector Spaces, and Information Retrieval’,

Society for Industrial and Applied Mathematics Review, Vol. 41, No. 2 (1999), pp. 335-362.

[BDO95] M. W. Berry, S. T. Dumais, and G. W. O'Brien, ‘Using Linear Algebra for Intelligent Information

Retrieval’, Society for Industrial and Applied Mathematics Review, Vol. 37, No. 4 (1995), pp. 573-

595.

[BeF96] M. W. Berry, and R. D. Fierro, ‘Low-Rank Orthogonal Decompositions for Information Retrieval

Applications’, Numerical Linear Algebra with Applications, Vol. 3, No. 4 (1996), pp. 301-328.

[BjG73] A. Bjorck, and G. Golub, ‘Numerical Methods for Computing Angles between Linear Subspaces’,

Journal of Mathematics of Computation, Vol. 27, No. 123 (1973), pp. 579-594.

[Bla06] A. Blansché. Classification non Supervisée avec Pondération D’attributs par des Méthodes

References

__________________________________________________________________________________________


133

Evolutionnaires, Ph.D. Dissertation, Louis Pasteur University- Strasbourg I, September, 2006.

[BlL97] A. Blum, and P. Langley, ‘Selection of Relevant Features and Examples in Machine Learning’,

Journal of Artificial Intelligence, Vol. 97 (1997), pp. 245-271.

[Boo80] A. Bookstein, ‘Fuzzy Requests: An Approach to Weighted Boolean Searches’, Journal of the

American Society for Information Science, Vol. 31 (1980), pp. 240-247.

[BoG97] I. Borg, and P. Groenen, Modern Multidimensional Scaling: Theory and Applications, Springer-

Verlag, New York, USA, 1997.

[BoA05] S. Boulaknadel, and F. Ataa Allah, ‘Recherche d’Information en Langue Arabe : Influence des

Paramètres Linguistiques et de Pondération de LSA’, In Actes des Rencontres des Etudiants

Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL), Paris

Dourdan, Vol. 1 (2005), pp. 643-648.

[Bou08] S. Boulaknadel, Recherche d'Iinformation en Langue Arabe, Ph.D. Dissertation, Mohamed V

University, Morocco, 2008.

[BuH29] H. G. Buchaman-Wollaston, and W. G. Hodgeson, ‘A New Method of Treating Frequency Curves in

Fishery Statistics, with some Results’, Journal of the International Council for the Exploration of the

Sea, Vol. 4 (1929), pp. 207-225.

[BuK81] D. Buell, and D. H. Kraft, ‘Threshold Values and Boolean Retrieval Systems’, Journal of Information

Processing and Management, Vol. 17, No. 3 (1981), pp. 127-36.

[Can93] F. Can, ‘Incremental Clustering for Dynamic Information Processing’, ACM Transactions on

Information Processing Systems, Vol. 11 (1993), pp. 143-164.

[CaD90] F. Can, and N.D. Drochak II, ‘Incremental Clustering for Dynamic Document Databases’, In

Proceeding of the 1990 Symposium on Applied Computing, 1990, pp. 61-67.

[Cha94] B.B. Chaudhri, ‘Dynamic Clustering for Time Incremental Data’, Pattern Recognition Letters, Vol.

15, No. 1 (1994), pp. 27-34.

[Chu97] F.R.K. Chung, ‘Spectral Graph Theory’, Conference Board of the Mathematical Sciences Conference

Regional Conference Series in Mathematics, May, 1997, No. 92.

[ChH89] K. Church and P. Hanks, ‘Word Association Norms, Mutual Information, and Lexicography’, In

Proceedings of the 27th Annual Meeting of the Association for Computational Linguistics, 1989, pp.

76-83.

[CRJ03] J. Clech, R. Rakotomalala, and R. Jalam, ‘Sélection multivariée de termes’, In Proceedings of the 35th

Journées de Statistiques, Lyon, France, 2003, pp. 933-936.

[CoL06a] R.R. Coifman and S. Lafon, ‘Diffusion Maps,’ Applied and Computational Harmonic Analysis, Vol.

21, No. 1 (2006), pp. 6-30.

References

__________________________________________________________________________________________


134

[CoL06b] R.R. Coifman and S. Lafon, ‘Geometric Harmonics: A Novel Tool for Multiscale Out-of-Sample

Extension of Empirical Functions’, Applied and Computational Harmonic Analysis, Vol. 21, No. 1

(2006), pp. 31-52.

[CLL05] R.R. Coifman, S. Lafon, A.B. Lee, M. Maggioni, B. Nadler, F. Warner, and S. Zucker, ‘Geometric

Diffusions as a Tool for Harmonics Analysis and Structure Definition of Data: Diffusion Maps,’

Proceedings of the National Academy of Sciences, Vol. 102, No. 21 (2005), pp. 7426-7431.

[Com94] P. Comon, ‘Independent component analysis, A new concept?’, In Proceeding of Signal Processing,

Vol. 36, No. 3 (1994), pp. 287-314.

[CLR98] F. Crestani, M. Lalmas, C. J. Van Rijsbergen, and I. Campbell, ‘“Is this Document Relevant? …

Probably”: A Survey of Probabilistic Models in Information Retrieval’, ACM Computing Surveys

(CSUR), Vol. 30, No. 4 (1998), pp. 528-552.

[Cro77] W.B. Croft, ‘Clustering large files of documents using the single link method’, Journal of the

American Society for Information Science, Vol. 28 (1977), pp. 341-344.

[Cro72] D. Crouch, ‘A clustering algorithm for large and dynamic document collections’, Ph.D. Dissertation,

Southern Methodist University, 1972.

[CuW85] J.K. Cullum, and R.A. Willoughby, Lanczos algorithms for large symmetric eigenvalue computations

– Vol. 1 Theory, (Chapter 5: Real rectangular matrices), Brikhauser, Boston, 1985.

[DaL97] M. Dash, and H. Liu, ‘Feature Selection for Classification’, Journal of Intelligent Data Analysis, Vol.

1, No. 1-4 (1997), pp. 131-156.

[Dar02] K. Darwish, ‘Building a Shallow Arabic Morphological Analyzer in One Day’, In Proceedings of the

Association for Computational Linguistics, 2002, pp. 47-54.

[Dar03] K. Darwish, Probabilistic Methods for Searching OCR-Degraded Arabic Text, Doctoral Dissertation,

University of Maryland, College Park, Maryland, 2003.

[DDJ01] K. Darwish, D. Doermann, R. Jones, D. Oard, and M. Rautiainen, ‘TREC-10 Experiments at

Maryland: CLIR and Video’, In Proceedings of the 2001 Text Retrieval Conference National Institute

of Standards and Technology, November, 2001, pp. 552.

[Dat71] R.T. Dattola, ‘Experiments with a fast clustering algorithm for automatic classification’, In The

SMART Retrieval System-Experiments in Automatic Document Processing, G. Salton Edition,

Prentice-Hall, Englewood Cliffs, New Jersey, 1971, Chap. 12.

[DaB79] D.L. Davies, and D.W. Bouldin, ‘A Cluster Separation Measure’, IEEE Transactions on Pattern

Analysis and Machine Intelligence, Vol. 1, No. 2 (1979), pp. 224-227.

[DDF89] S. Deerwester, S. Dumais, G. Furnas, G.W. Furnas, R.A. Harshman, T.K. Landauer, K.E. Lochbaum,

and L.A. Streeter, ‘Computer information retrieval using latent semantic structure’, U. S. Patent, No.

4 (1989), pp. 839-853.

References

__________________________________________________________________________________________


135

[DDF90] S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. Landauer, and R. Harshman, ‘Indexing by Latent

Semantic Analysis’, Journal of the American Society for Information Science, Vol. 41, No. 6 (1990),

pp. 391-407.

[DKN03] I.S. Dhillon, J. Kogan, and M. Nicholas, ‘Feature Selection and Document Clustering’, In M.W.

Berry, editor, A Comprehensive Survey of Text mining, Springer-Verlag, 2003.

[DhM99] I.S. Dhillon, and D.S. Modha, ‘Concept Decompositions for Large Sparse Text Data using

Clustering,’ Technical Report RJ 10147 (95022), IBM Almaden Research Center, 1999.

[DhM00] I.S. Dhillon and D.S. Modha, ‘A parallel data-clustering algorithm for distributed memory

multiprocessors’, In Large-Scale Parallel Data Mining, Lecture Notes in Artificial Intelligence, Vol.

1759 (2000), pp. 245-260.

[DhM01] I.S. Dhillon and D.S. Modha, ‘Concept Decompositions for Large Sparse Text Data using

Clustering,’ Machine Learning, Vol. 42, No. 1-2 (2001), pp. 143-175.

[DHJ04] M. Diab, K. Hacioglu, and D. Jurafsky, ‘Automatic Tagging of Arabic Text: from Raw Text to Base

Phrase Chunks’, In Proceedings of the Human Language Technology conference and the North

American Chapter of the Association for Computational Linguistics, Boston, Massachusetts, USA,

May, 2004, pp. 149-152.

[Did73] E. Diday, ‘The Dynamic Cluster Method and Sequentialization in Nonbierarchical Clustering’,

International Journal of Computer and Information Science, Vol. 2, No. 1(1973), pp. 63-69.

[DHZ01] C. Ding, X. He, H. Zha, M. Gu, and H. Simon, ‘A min–max cut algorithm for graph partitioning and

data clustering’, In Proceedings of IEEE International Conference on Data Mining, 2001, pp. 107-

114.

[DHZ02] C. Ding, X. He, H. Zha, M. Gu, and H. Simon, ‘Adaptive Dimension Reduction for Clustering High

Dimensional Data’, In Proceedings of the 2nd International IEEE Conference on Data Mining,

December, 2002, pp. 147-154.

[Din99] C. H. Ding, ‘A Similarity-based Probability Model for Latent Semantic Indexing’, Proceedings of the

22nd ACM SIGIR Conference, August, 1999, pp. 59-65.

[Din01] C. H. Ding, ‘A Probabilistic Model for Dimensionality Reduction in Information Retrieval and

Filtering’, In Proceedings of the 1st SIAM Computational Information Retrieval Workshop, 2000.

[DoG03] D.L. Donoho, and C. Grimes, ‘Hessian Eigenmaps: New Locally Linear Embedding Techniques for

High-Dimensional Data’, In Proceedings of Nat’l Academy of Sciences, Vol. 100, No. 10 (2003), pp.

5591-5596.

[DuH73] R.O. Duda, and P.E. Hart, Pattern Classification and Scene Analysis, Wiley, New York, NY, USA,

1973.

References

__________________________________________________________________________________________


136

[Dum91] S. Dumais, ‘Improving the Retrieval of Information from External Sources’, Behavior Research

Methods, Instruments, & Computers, Vol. 23, No. 2 (1991), pp. 229-236.

[Dum92] S. Dumais, ‘Enhancing Performance in Latent Semantic Indexing (LSI) Retrieval’, Technical

Memorandum Tm-ARH-017527, Bellcore, 1992.

[Dum94] S. Dumais, ‘Latent Semantic Indexing (LSI) and TREC-2’, Technical Memorandum Tm-ARH-

023878, Bellcore, 1994.

[Dun03] M. H. Dunham, Data Mining: Introductory And Advanced Topics, New Jersey: Prentice Hall, 2003.

[Dun89] G. H. Dunteman, Principal Component Analysis, Sage Publications, Newbury Park, California, USA,

1989.

[Egg04] L. Egghe, ‘Vector Retrieval, Fuzzy Retrieval and the Universal Fuzzy IR Surface for IR Evaluation’,

Journal of Information Processing and Management, Vol. 40, No. 4 (2004), pp. 603-618.

[EGH91] D. A. Evans, K. Ginther-Webster, M. Hart, R. G. Lefferts, and I. Monarch, ‘Automatic Indexing

using Selective NLP and First-Order Thesauri’, In Proceedings of the Conference on Intelligent Text

and Image Handling, Barcelona, Spain, 1991, pp. 394-401.

[Fag87] J. Fagan, Experiments in Automatic Phrase Indexing for Document Retrieval: A Comparison of

Syntactic and Non-syntactic methods, Doctoral dissertation, Cornell University, 1987.

[FaO95] C. Faloutsos, and D.W. Oar, A Survey of Information Retrieval and Filtering Methods. Technical

Report CS-TR-3514, Department of Computer Science, University of Maryland, College Park, 1995.

[FiB02] R.D. Fierro and M.W. Berry, Efficient Computation of the Riemannian SVD in Total Least Squares

Problems in Information Retrieval, in S. Van Huffel and P. Lemmerling (Eds.), Total Least Squares

and Errors-in-Variables Modeling: Analysis, Algorithms, and Applications, Kluwer Academic

Publishers, 2002, pp. 349-360.

[For03] G. Forman, ‘An Extensive Empirical Study of Feature Selection Metrics for Text Classification’,

Journal of Machine Learning Research, Vol. 3 (2003), pp. 1289-1305.

[FrB92] W. Frakes and R. Baeza-Yates, editors, ‘Information Retrieval: Data Structures & Algorithms’,

Prentice Hall, Englewood Cliffs, New Jersey, 1992.

[Fre01] A. Freeman, ‘Brill's POS Tagger and a Morphology Parser for Arabic’, In Proceedings of the 39th

Annual Meeting of Association for Computational Linguistics and the 10th Conference of the

European Chapter, Workshop on Arabic Language Processing: Status and Prospects, Toulouse,

France, July, 2001.

[Fri73] M. Fritzche, ‘Automatic Clustering Techniques in Information Retrieval’, Diplomarbeit, Institut für

Informatik der Universität Stuttgart 1973.

[FuT04] B. Fuglede, and F. Topsoe, ‘Jensen-Shannon Divergence and Hilbert Space Embedding’, In IEEE

International Symposium on Information Theory, July, 2004, pp. 31.

References

__________________________________________________________________________________________


137

[GeO01] F.C. Gey, and D.W. Oard, ‘The TREC-2001 Cross-Language Information Retrieval Track: Searching

Arabic Using English, French or Arabic Querie’, In Proceedings of the 2001 Text Retrieval

Conference, National Institute of Standards and Technology, November, 2001, pp. 16-26.

[GoR71] G. Golub, and C. Reinsch, Handbook for Automatic Computation II, Linear Algebra, Springer-

Verlag, New York, 1971.

[GoV89] G. Golub, and C. Van Loan, Matrix Computations, Johns-Hopkins, Baltimore, Maryland, 2nd Edition,

1989.

[GoD01] A. Goweder, and A. De Roeck, ‘Assessment of a Significant Arabic Corpus’, In Proceedings of the

39th Annual Meeting of the Association for Computational Linguistics, Arabic language Processing,

Toulouse, France, 2001, pp. 73-79.

[GPD04] A. Goweder, M. Poesio, A. De Roeck, and J. Reynolds, ‘Identifying Broken Plurals in Unvowelised

Arabic Text’, In Proceedings of Empirical Methods In Natural Language Processing, Geneva, July,

2004, pp. 246-253.

[GoR69] J. C. Gower, and G. J. S. Ross, ‘Minimum Spanning Trees and Single-Linkage Cluster Analysis’,

Applied Statistics, Vol. 18, No. 1 (1969), pp. 54–64.

[GRG97] V. N. Gudivada, V. V. Raghavan, W. I. Grosky, and R. Kasanagottu, ‘Information Retrieval on the

World-Wide Web’, IEEE Internet Computing, Vol. 1, No. 5 (1997), pp. 58-68.

[GuB06] S. Guérif, and Y. Bennani, ‘Selection of Clusters Number and Features Subset during a two-levels

Clustering Task’, In Proceeding of the 10th IASTED International Conference Artificial Intelligence

and Soft Computing, August, 2006, pp. 28-33.

[GuB07] S. Guérif, and Y. Bennani, ‘Dimensionality Reduction through Unsupervised Features Selection’, In

Proceeding of the 10th International Conference on Engineering Applications of Neural Networks,

Thessaloniki, Hellas, Greece, August, 2007, pp. 98-106.

[GBJ05] S. Guérif, Y. Bennani, and E. Janvier, ‘µ-som: Weighting Features During Clustering’, In Proceeding

of the 5th Workshop On Self-Organizing Maps, September, 2005, pp. 397-404.

[GuE03] I. Guyon, and A. Elisseeff, ‘An Introduction to Variable and Feature Selection’, Journal of Machine

Learning Research, Vol. 3 (2003), pp. 1157-1182.

[HaK03] K.M. Hammouda, and M.S. Kamel, ‘Incremental Document Clustering Using Cluster Similarity

Histograms’, In Proceeding of the IEEE International Conference on Web Intelligence, June, 2003,

pp. 597-601.

[HaK92] L. Hagen, and A.B. Kahng, ‘New Spectral Methods for Ratio Cut Partitioning and Clustering’, IEEE

Transaction Computer-Aided Design of Integrated Circuits and Systems, Vol. 11, No. 9, September,

1992, pp. 1074–1085.

[HGM00] V. Hatzivassiloglou, L. Gravano, and A. Maganti, ‘An Investigation of Linguistic Features and

References

__________________________________________________________________________________________


138

Clustering Algorithms for Topical Document Clustering’, In Proceedings of the 23rd Annual

International ACM SIGIR Conference on Research and Development in Information Retrieval,

Athens, Greece, July, 2000, pp. 224-231.

[HeP96] M. A. Hearst and J. O. Pedersen, ‘Reexamining the Cluster Hypothesis: Scatter/Gather on retrieval

results,’ Proceedings of the 19th International ACM Conference on Research and Development in

Information Retrieval, Zurich, Switzerland, August, 1996, pp. 76-84.

[HBL94] W. R. Hersh, C. Buckley, T. J. Leone, and D. H. Hickam, ‘Ohsumed: An Interactive Retrieval

Evealuation and new large Test Collection for Research’, In Proceedings of the 17th Annual

International ACM Conference on Research and Development in Information Retrieval, Dublin,

Ireland, 1994, pp. 192-201.

[HiK99] A. Hinneburg, and D. A. Keim, ‘Optimal Grid-Clustering: Towards Breaking the Curse of

Dimensionality in High-Dimensional Clustering’, In Proceedings of the 25th International Conference

on Very Large Data Bases, Edinburgh, 1999, pp. 506-517.

[HKE97] I. Hmeidi, K. Kanaan, and M. Evens, ‘Design and Implementation of Automatic Indexing for

Information Retrieval with Arabic Documents’, Journal of the American Society for Information

Science, Vol. 48, No. 10 (1997), pp. 867-881.

[Yan05] H. Yan, ‘Techniques for Improved LSI Text Retrieval’, Ph.D. Dissertation, Wayne State University,

Detroit, Michigan, USA, 2005.

[Yan08] H. Yan, W. I. Grosky, and F. Fotouhi, ‘Augmenting the power of LSI in text retrieval: Singular value

rescaling’, Journal of Data and Knowledge Engineering, Vol. 65 (2008), pp. 108-125.

[HNR05] J.Z. Huang, M.K. Ng, H. Rong, and Z. Li, ‘Automated Variable Weighting in k-means Type

Clustering’, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 27, No. 5 (2005),

pp. 657-668.

[HSD00] P. Husbands, H. Simon, and C. H. Ding. ‘On the Use of the Singular Value Decomposition for Text

Retrieval’, Computational information Retrieval, M. W. Berry, Ed. Society for Industrial and Applied

Mathematics, Philadelphia, PA, 2001, pp. 145-156.

[Ibn90] Ibn Manzour, Lisan Al-Arab, Arabic Encyclopedia, 1290.

[JaD88] A. Jain, and R. Dubes, ‘Algorithms for Clustering Data’, Prentice Hall, Englewood Cliffs, N.J., 1988.

[JMF99] A.K. Jain, M.N. Murty, and P.J. Flynn, ‘Data clustering: a review’, ACM Computing Surveys, Vol.

31, No. 3 (1999), pp. 264-323.

[JaZ97] A. Jain, and D. Zongker, ‘Feature Selection: Evaluation, Application, and Small Sample

Performance’, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 19, No. 2

(1997), pp. 153-158.

[JaS71] N. Jardine, and R. Sibson, Mathematical Taxonomy, Wiley, London and New York, 1971.

References

__________________________________________________________________________________________


139

[Jia97] J. Jiang, ‘Using Latent Semantic Indexing for Data Mining’, MS Thesis, Department of Computer

Science, University of Tennessee, December, 1997.

[Jia98] E.P. Jiang, ‘Information retrieval and Filtering Using the Riemannian SVD’, Ph.D. Dissertation,

Department of Computer Science, University of Tennessee, August, 1998.

[JKP94] G.H. John, R. Kohavi, and K. Pfleger, ‘Irrelevant features and the subset selection problem’, In

Proceedings of the 11th International Conference on Machine Learning, San Francisco, CA, USA,

1994, pp. 121-129.

[Jon72] K.S. Jones, ‘A Statistical Interpretation of Term Specificity and its Application in Retrieval’, Journal

of Documentation, Vol. 28, No. 1 (1972), pp. 11-21.

[KaR90] L. Kaufman, and P. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis,

Wiley series in probability and mathematical statistics-Applied probability and statistics, A Wiley-

Interscience, New York, NY, 1990.

[Kin67] B. King, ‘Step-wise clustering procedures’, Journal of the American Statistical Association, Vol. 69

(1967), pp. 86-101.

[KlJ04] I.A. Klampanos, and J.M. Jose, ‘An Architecture for Information Retrieval over Semi-Collaborating

Peer-to-Peer Networks’, In Proceedings of the 2004 ACM Symposium on Applied Computing, Vol. 2

(2004), Nicosia, Cyprus, March, pp. 1078-1083.

[KJR06] I.A. Klampanos, J.M. Jose, and C.J.K. van Rijsbergen, Single-Pass Clustering for Peer-to-Peer

Information Retrieval: The Effect of Document Ordering,’ Proceedings of the 1st International

Conference on Scalable information Systems, Hong Kong, May, 2006, Article 36.

[Kho01] S. Khoja, ‘APT: Arabic Part-of-speech Tagger’, In Proceedings of the Student Workshop at the 2nd

Meeting of the North American Chapter of the Association for Computational Linguistics, 2001, pp.

20-25.

[KhG99] S. Khoja, and R. Garside, Stemming Arabic text, Technical Report, Computing Department,

Lancaster University, Lancaster, September, 1999.

[KiR92] K. Kira, and L. A. Rendell, ‘A Practical Approach to Feature Selection’, In Proceedings of the 9th

International Conference on Machine Learning, San Francisco, CA, USA, 1992, pp. 249-256.

[KoJ97] R. Kohavi, and G. H. John, ‘Wrappers for feature subset selection’, Journal of Artificial Intelligence,

Vol. 97, No. 1-2 (1997), pp. 273-324.

[KoO96] T.G. Kolda, and D.P. O'Leary, ‘Large Latent Semantic Indexing via a Semi-Discrete Matrix

Decomposition’, Technical Report, No. UMCP-CSD CS-TR-3713, Department of Computer Science,

Univ. of Maryland, 1996.

[KoO98] T.G. Kolda, and D.P. O'Leary, ‘A Semi-Discrete Matrix Decomposition for Latent Semantic Indexing

in Information Retrieval’, ACM Transactions on Information Systems, Vol. 16, No. 4 (1998), pp. 322-

References

__________________________________________________________________________________________


140

346.

[KoS96] D. Koller, and M. Sahami, ‚Toward Optimal Feature Selection’, In Proceedings of the 13th

International Conference on Machine Learning, 1996, pp. 284–292.

[KWX01] B. Krishnamurthy, J. Wang, and Y. Xie, ‘Early Measurements of a Cluster-Based Architecture for

P2P Systems,’ Internet Measurement Workshop, ACM SIGCOMM, San Francisco, USA, November,

2001.

[KuL51] S. Kullback, and R. A. Leibler, ‘On Information and Sufficiency’, Annual Mathematical Statistics,

Vol. 22 (1951), pp.79-86.

[LaL06] S. Lafon, and A.B. Lee, ‘Diffusion Maps and Coarse-Graining: A Unified Framework for

Dimensionality Reduction, Graph Partitioning, and Data Set Parameterization,’ IEEE Transactions on

Pattern Analysis and Machine Intelligence, Vol. 28, No. 9 (2006), pp. 1393-1403.

[LFL98] T.K. Landauer, P. W. Foltz, and D. Laham, ‘An Introduction to Latent Semantic Analysis,’ Discourse

Processes, Vol. 25 (1998), pp. 259-284.

[Lap00] E. Laporte, ‘Mot et niveau lexical’, Ingénierie des langues, 2000, pp. 25-46.

[LBC02] L. S. Larkey, L. Ballesteros, and M. Connell, ‘Improving Stemming for Arabic Information Retrieval

: Light Stemming and Cooccurrence Analysis’, In Proceedings of the 25th Annual International

Conference on Research and Development in Information Retrieval (SIGIR 2002), Tampere, Finland,

August 2002, pp. 275-282.

[LaM71] D. N. Lawley, and A. E. Maxwell, Factor Analysis as a Statistical Method, 2nd edition, American

Elsevier Publication, New York, USA, 1971.

[Law03] N. D. Lawrence, ‘Gaussian Process Latent Variable Models for Visualisation of High Dimensional

Data’, In Proceeding of Neural Information Processing Systems, December, 2003.

[Lee94] J.H. Lee, ‘Properties of Extended Boolean Models in Information Retrieval’, Proceedings of the 17th

Annual International ACM SIGIR Conference, Dublin, Ireland, 1994, pp. 182-190.

[Ler99] K. Lerman, ‘Document Clustering in Reduced Dimension Vector Space,’ Unpublished Manuscript,

1999, http://www.isi.edu/~lerman/papers/papers.html.

[Let96] T.A. Letsche, ‘Toward Large-Scale Information Retrieval Using Latent Semantic Indexing’, MS

Thesis, Department of Computer Science, University of Tennessee, August 1996.

[LeB97] T.A. Letsche, and M.W. Berry, ‘Large-Scale Information Retrieval with Latent Semantic Indexing’,

Information Sciences, Vol. 100, No. 1-4 (1997), pp. 105-137.

[Leu01] A. Leuski, ‘Evaluating Document Clustering for Interactive Information Retrieval,’ Proceedings of

the ACM 10th International Conference on Information and Knowledge Management, Atlanta,

Georgia, November, 2001, pp. 33-40.

References

__________________________________________________________________________________________


141

[Lit69] B. Litofsky, ‘Utility of automatic classification systems for information storage and retrieval’, Ph.D.

Dissertation, University of Pennsylvania, 1969.

[LiM98] H. Liu, and H. Motoda, ‘Feature Selection for Knowledge Discovery & Data Mining’, The Kluwer

International Series in Engineering and Computer Science, Kluwer Academic Publishers, Boston,

USA, 1998.

[Mac67] J. B. MacQueen, ‘Some Methods for Classification and Analysis of Multivariate Observations’,

Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability, Berkeley,

University of California Press, Vol. 1 (1967), pp. 281-297.

[MaL01] V. Makarenkov, and P. Legendre, ‘Optimal Variable Weighting for Ultrametric and Additive Trees

and k-means Partitioning: Methods and Software’, Journal of Classification, Vol. 18, No. (2001), pp.

245-271.

[MAS03] J. Makkonen, H. Ahonen-Myka, and M. Salmenkivi, ‘Topic Detection and Tracking with Spatio-

Temporal Evidence,’ In Proceedings of 25th European Conference on Information Retrieval

Research, 2003, pp. 251-265.

[MaS99] C. Manning, and H. Schütze, Foundations of Statistical Natural Language Processing, MIT Press,

Cambridge, MA, 1999.

[MeS01] M. Meila, and J. Shi, ‘A Random Walk’s View of Spectral Segmentation’, In Proceedings of

International Workshop on Artificial Intelligence and Statistics, Key West, Florida, USA, January,

2001.

[Mil02] A. Miller, Subset Selection in Regression, 2nd edition, Chapman & Hall/CRC, 2002.

[MBS97] M. Mitra, C. Buckley, A. Singhal, and C. Cardi, ‘In Analysis of Statistical and Syntactic Phrases’, In

Proceeding of the 5ème Conférence de Recherche d’Information Assistée par Ordinateur, Montreal,

Canada, June, 1997, pp. 200-214.

[MMP02] P. Mitra, C.A. Murthy, and S.K. Pal, ‘Unsupervised Feature Selection Using Feature Similarity’,

IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 24, No. 3 (2002), pp.301-312.

[Mun57] J. Munkres, ‘Algorithms for the Assignment and Transportation Problems,’ Journal of the Society for

Industrial and Applied Mathematics, Vol. 5, No. 1 (1957), pp. 32-38.

[MuA04] S. H. Mustafa, and Q. A. Al-Radaideh, ‘Using N-grams for Arabic Text Searching’, Journal of the

American Society for Information Science and Technology, Vol. 55, No. 11, September, 2004, pp.

1002-1007.

[NJW02] A. Ng, M. Jordan, and Y. Weiss, ‘On Spectral Clustering: Analysis and an Algorithm’, In

Proceedings of 14th Advances in Neural Information Processing Systems, 2002.

[NiC05] M. Nikkhou, and K. Choukri, ‘Report on Survey on Arabic Language Resources and Tools in

Mediterranean Countries’, ELDA, NEMLAR, 2005.

References

__________________________________________________________________________________________


142

[Obr94] G. W. O’Brien, ‘Information Management Tools for Updating an SVD Encoded Indexing Scheme’,

Master’s Thesis, The University of Knoxville, Tennessee, Knoxville, TN, 1994.

[PLL01] J.M. Pena, J.A. Lozano, P. Larranaga, and I. Inza, ‘Dimensionality Reduction in Unsupervised

Learning of Conditional Gaussian Networks’, IEEE Transactions on Pattern Analysis and Machine

Intelligence, Vol. 23, No. 6, June, 2001, pp. 590-603.

[PoC98] J.M. Ponte, and W.B. Croft, ‘A Language Modeling Approach to information retrieval’, Proceedings

of the 21st Annual International ACM SIGIR Conference, 1998, pp. 275-281.

[PTS92] W.H. Press, B.P. Flannery, S.A. Teukolsky, and W.T. Vetterling, Numerical Recipes in C: The Art of

Scientific Computing, 2nd edition, Cambridge University Press, 1992, pp. 994.

[PrS72] N.S. Prywes, and D.P. Smith, ‘Organization of Information’, Annual Review of Information Science

and Technology, Vol. 7 (1972), pp. 103-158.

[PNK94] P. Pudil, J. Novovicova, and J. Kittler, ‘Floating Search Methods in Feature Selection’, Journal of

Pattern Recognition Letters, Vol. 15, No. 11 (1994), pp. 1119-1125.

[Rad79] T. Radecki, ‘Fuzzy Set Theoretical Approach to Document Retrieval’, Journal of Information

Processing and Management, Vol. 15 (1979), pp. 247-259.

[Ras92] E. Rasmussen, ‘Clustering Algorithms’, In Information Retrieval: Data Structures and Algorithms,

1992, pp. 419-442.

[RoS76] S.E. Robertson, and K. Sparck Jones, ‘Relevance Weighting of Search Terms’, Journal of American

Society for Information Sciences, Vol. 27, No. 3 (1976), pp. 129-146.

[RWH94] S. E. Robertson, S. Walker, M. Hancock-Beaulieu, and M. Gatford, ‘Okapi at TREC-3’, In

Proceedings of TREC-3, November, 1994, pp. 109-126.

[Rom90] P. M. Romer, ‘Endogenous Technical Change’, Journal of Political Economy, Vol. 98, No. 5 (1990),

pp. 71-102.

[RoS00] S.T. Roweis, and L.K. Saul, ‘Nonlinear Dimensionality Reduction by Locally Linear Embedding’,

Journal of Science, Vol. 290, No. 5500 (2000), pp. 2323-2326.

[SaB90] G. Salton, and C. Buckley, ‘Improving retrieval performance by relevance feedback’, Journal of the

American Society for Information Science, Vol. 41, No. 4 (1990), pp. 288-297.

[Sal68] G. Salton, Automatic Information Organization and Retrieval, McGraw-Hill, New York, 1968.

[Sal71] G. Salton, The SMART Retrieval System – Experiments in Automatic Document Processing,

Prentice-Hall Inc, Englewood Cliffs, New Jersey, 1971.

[SaM83] G. Salton and M.J. McGill, Introduction to Modern Information Retrieval, McGraw Hill Publishing

Company, New York, 1983.

[SaW78] G. Salton, and A. Wong, ‘Generation and Search of Clustered Files’, ACM Transaction on Database

References

__________________________________________________________________________________________


143

Systems, Vol. 3, No. 4, December, 1978, pp. 321-346.

[Sav02] J. Savoy, ‘Morphologie et recherche d’information’, Technical Report, CLEF, 2002.

[SSM99] B. Scholkopf, A. J. Smola, and K. Muller, ‘Kernel Principal Component Analysis’, In B. Scholkopf,

C. J. C. Burges, and A. J. Smola edition, Advances in Kernel Methods - Support Vector Learning,

MIT Press, Cambridge, MA, 1999, pp. 327-352.

[Sch94] H. Schmid, ‘Probabilistic Part-of-Speech Tagging Using Decision Trees,’ Proceedings of

International Conference on New Methods in Language Processing, Manchester, UK, July, 1994, pp.

172-176.

[Sch02] N. Schmitt, ‘Using corpora to teach and assess vocabulary’, Chapter in Corpus Studies in Language

Education, Melinda Tan Edition, IELE Press, 2002, pp. 31-44.

[SAS04] Y. Seo, A. Ankolekar, and K. Sycara, ‘Feature Selection for Extracting Semantically Rich Words’,

Technical Report CMU-RI-TR-04-18, Robotics Institute, Carnegie Mellon University, March, 2004.

[ShM00] J. Shi, and J. Malik, ‘Normalized Cuts and Image Segmentation’, IEEE Transaction on Pattern

Analysis and Machine Intelligence, Vol. 22, No. 8 (2000), pp. 888-905.

[SBM96] A. Singhal, C. Buckley, and M. Mitra, ‘Pivoted document length normalization’, Proceedings of the

19th Annual International ACM SIGIR Conference, Zurich, Switzerland, August, 1996, pp. 21-29.

[SnS73] P. H. A. Sneath, and , R. R. Sokal, Numerical Taxonomy, Freeman, London, UK, 1973.

[SGM00] A. Strehl, J. Ghosh, and R. Mooney, ‘Impact of Similarity Measures on Web-page Clustering’, In

Proceedings of AAAI workshop on AI for Web Search, K. Bollacker (Ed) TR WS-00-01, AAAI Press,

July, 2000, pp. 58-64.

[SLP97] T. Strzalkowski, F. Lin, and J. Perez-Carballo, ‘Natural language Information Retrieval: Trec-6

Report’, In Proceeding of the 6th Text Retrieval Conference, 1997, pp. 347-366.

[Sub92] J.L. Subbiondo, John Wilkins' Theory of Meaning and the Development of a Semantic Model, In

John Wilkins and 17th-Century British Linguistics, Chap. 5: Wilkins' Classification of Reality,

Joseph L. Subbiondo edition, Amsterdam, 1992, pp. 291-308.

[TEC05] K. Taghva, R. Elkhoury, and J. Coombs, ‘Arabic Stemming Without A Root Dictionary’, In

Proceeding of Information Technology: Coding and Computing, Las Vegas, NV, April, 2005, pp.

152-157.

[TSL00] J.B. Tenenbaum, V. de Silva, and J. C. Langford, ‘A Global Geometric Framework for Nonlinear

Dimensionality Reduction’, Journal of Science, Vol. 290 (2000), pp. 2319-2323.

[TuC91] H. Turtle, and W. Croft, ‘Evaluation of an Inference Network-based Retrieval Model’, ACM

Transactions on Information Systems, Vol.9, No. 3, 1991, pp. 187-222.

[VHL05] U .Vaidya, G. Hagen, S. Lafon, A. Banaszuk, I. Mezic, and R. R. Coifman, ‘Comparison of Systems

References

__________________________________________________________________________________________


144

using Diffusion Maps’, Proceedings of the 44th IEEE Conference on Decision and Control and the

European Control Conference, Seville, Spain, December, 2005, pp. 7931-7936.

[Van72] C.J. Van Rijsbergen, Automatic information structuring and retrieval, Ph.D. Dissertation, University

of Cambridge, 1972.

[Van79] C.J. Van Rijsbergen, Information Retrieval, Second Edition, Butterworths Publishing Company,

London, 1979.

[Von06] U. Von Luxburg, ‘A Tutorial on Spectral Clustering’, Technical Report TR-149, Max Planck Institute

of Biological, Cybernetics, 2006.

[WaK79] W. G. Waller, and D. H. Kraft, ‘A mathematical model of a weighted Boolean retrieval system’,

Journal of Information Processing & Management, Vol. 15, No. 5 (1979), pp. 235-245.

[Wei99] Y. Weiss, ‘Segmentation Using Eigenvectors: A Unifying View’, Proceeding of IEEE International,

Conference of Computer Vision, Vol. 14 (1999), pp. 975-982.

[VeA99] J. Vesanto, and J. Ahola, ‘Hunting for Correlations in Data Using the Self-Organizing Map’, In

Proceeding of the International ICSC Congress on Computational Intelligence Methods and

Applications, 1999, pp. 279-285.

[WMC01] J. Weston, S. Mukherjee, O. Chapelle, M. Pontil, T. Poggio, and V. Vapnik, ‘Feature Selection for

SVMs’, Journal of Advances in Neural Information Processing Systems, Vol. 13 (2001), pp. 668-674.

[Wit97] D. I. Witter, ‘Downdating the Latent Semantic Indexing Model for Information retrieval’, MS Thesis,

Department of Computer Science, University of Tennessee, 1997.

[WMB94] I.H. Witten, A. Moffat, and T.C. Bell, Managing Gigabytes: Compressing and Indexing Documents

and Images. Van Nostrand Reinhold, New York, NY, 1994.

[WoF00] W. Wong, and A. Fu, ‘Incremental Document Clustering for Web Page Classification,’ In Proceeding

of the International Conference on Information Society, Japan, 2000.

[WZW85] S.K.M. Wong, W. Ziarko, and P.C.N. Wong, ‘Generalized Vector Spaces Model in Information

Retrieval’, In Proceeding of the 8th annual International ACM SIGIR Conference, Montreal, Quebec,

Canada, 1985, pp. 18-25.

[XJK01] E.P. Xing, M.I. Jordan, and R.M. Karp, ‘Feature Selection for High-Dimensional Genomic

Microarray Data’, In Proceedings of the 18th International Conference on Machine Learning, San

Francisco, CA, USA, 2001, pp. 601-608.

[XuC98] J. Xu, and W.B. Croft, ‘Corpus-Based Stemming using Co-occurrence of Word Variants’, In ACM

Transactions on Information Systems , Vol. 16, No. 1 (1998), pp. 61-81.

[XFW01] J. Xu, A. Fraser, and R. Weischedel, ‘TREC 2001 Crosslingual Retrieval at BBN’, In TREC 2001,

Gaithersburg: NIST, 2001.

References

__________________________________________________________________________________________


145

[Yah89] A. H. Yahya, ‘On the Complexity of the Initial Stages of Arabic Text Processing’, First Great Lakes

Computer Science Conference, Kalamazoo, Michigan, U.S.A., October, 1989, pp. 18-20.

[YaH98] J. Yang, and V. Honavar, ‘Feature Subset Selection Using a Genetic Algorithm’, IEEE Transaction

Intelligent Systems, Vol. 13, No. 2 (1998), pp. 44-49.

[YaP97] Y. Yang, and J.O. Pedersen, ‘A Comparative Study of Feature Selection in Text Categorization’, In

Proceedings of the 14th International Conference on Machine Learning, San Francisco, CA, USA,

1997, pp. 412–420.

[YuL03] L. Yu, and H. Liu, ‘Feature Selection for High-Dimensional Data: A Fast Correlation-based Filter

Solution’, In Proceedings of the twentieth International Conference on Machine Learning, 2003, pp.

856-863.

[Zah71] C. T. Zahn, ‘Graph-Theoretic Methods for Detecting and Describing Gestalt Clusters’, IEEE

Transactions on Computers, Vol. 20, No. 1 (1971), pp. 68-86.

[ZaE98] O. Zamir, and O. Etzioni, ‘Web Document Clustering: A Feasibility Demonstration,’ In Proceedings

of the 21st International ACM SIGIR Conference, Melbourne, Australia, 1998, pp 46-54.

[ZeH01] S. Zelikovitz and H. Hirsh, ‘Using LSI for Text Classification in the Presence of Background Text,’

In Proceedings of the ACM 10th International Conference on Information and Knowledge

Management (CIKM’01), Atlanta, Georgia, November, 2001, pp. 113-118.

[ZTM96] C. Zhai, X. Tong, N. Milic-Frayling, and D. A. Evans, ‘Evaluation of Syntactic Phrase Indexing-

CLARIT NLP Track Report’, In Proceedings of the 5th Text Retrieval Conference, Gaithersburg, MD,

November, 1996, pp. 347-358.

[ZhZ02] Z. Zhang, and H. Zha, ‘Principal Manifolds and Nonlinear Dimension Reduction via Local Tangent

Space Alignement’, Technical Report CSE-02-019, Dept. of Computer Science and Eng.,

Pennsylvania State University, Pennsylvania, USA, 2002.

[ZhG02] R. Zhao, and W. I. Grosky, ‘Negotiating the Semantic Gap: from Feature Maps to Semantic

Landscapes’, Pattern Recognition, Vol. 35, No. 3 (2002), pp. 593-600.