qualitative evaluation of automatic assignment of keywords to images

Information Processing and Management 42 (2006) 136–154

www.elsevier.com/locate/infoproman

Qualitative evaluation of automatic assignmentof keywords to images

Chih-Fong Tsai *, Ken McGarry, John Tait

School of Computing and Technology, University of Sunderland, Sunderland SR6 0DD, UK

Received 6 July 2004; accepted 1 November 2004

Available online 10 December 2004

Abstract

In image retrieval, most systems lack user-centred evaluation since they are assessed by some chosen ground truth

dataset. The results reported through precision and recall assessed against the ground truth are thought of as being

an acceptable surrogate for the judgment of real users. Much current research focuses on automatically assigning

keywords to images for enhancing retrieval effectiveness. However, evaluation methods are usually based on system-level

assessment, e.g. classification accuracy based on some chosen ground truth dataset. In this paper, we present a qualitative

evaluation methodology for automatic image indexing systems. The automatic indexing task is formulated as one

of image annotation, or automatic metadata generation for images. The evaluation is composed of two individual

methods. First, the automatic indexing annotation results are assessed by human subjects. Second, the subjects are asked

to annotate some chosen images as the test set whose annotations are used as ground truth. Then, the system is tested by

the test set whose annotation results are judged against the ground truth. Only one of these methods is reported for most

systems on which user-centred evaluation are conducted. We believe that both methods need to be considered for full

evaluation. We also provide an example evaluation of our system based on this methodology. According to this study,

our proposed evaluation methodology is able to provide deeper understanding of the system�s performance.

� 2004 Elsevier Ltd. All rights reserved.

Keywords: Qualitative evaluation; Image annotation; Image retrieval; Statistical analysis

1. Introduction

Evaluation is a critical issue for Information Retrieval (IR). Assessment of the performance or the value

of an IR system for its intended task is one of the distinguishing features of the subject. The type of

0306-4573/$ - see front matter � 2004 Elsevier Ltd. All rights reserved.

doi:10.1016/j.ipm.2004.11.001

* Corresponding author.

E-mail addresses: [email protected] (C.-F. Tsai), [email protected] (K. McGarry), john.tait@

sunderland.ac.uk (J. Tait).

mailto:[email protected]

mailto:[email protected]

mailto:john.tait@

C.-F. Tsai et al. / Information Processing and Management 42 (2006) 136–154 137

evaluation to be considered depends on the objectives of the retrieval system. In general, retrieval perfor-

mance evaluation is based on a test reference collection, e.g. TREC, and on an evaluation measure, e.g.

precision and recall (Baeza-Yates & Ribeiro-Neto, 1999).

Saracevic (1995) reviews the history and nature of evaluation in IR and describes six different levels of IR

evaluation from system to user levels. However, most IR evaluations are only based on the system level(s)and lack user-centred evaluation. To achieve a more comprehensive picture of IR performance and users�needs, both system- and user-centred evaluations are needed. That is, we need to evaluate at different levels

as appropriate and/or against different types of relevance (Dunlop, 2000). Examples of some recent studies

focusing on user judgments are Belkin et al. (2001), Hersh et al. (2001), and Spink (2002).

Due to the advances in computing and multimedia technologies, the size of image collections is increas-

ing rapidly. Content-Based Image Retrieval (CBIR) has been an active research area for the last decade

whose main goal is to design mechanisms for searching large image collections. Similar to traditional

IR, studies on user issues of image retrieval are lacking (Fidel, 1997; Rasmussen, 1997).Current CBIR systems index and retrieve images based on their low-level features, such as colour, tex-

ture, and shape, and it is difficult to find desired images based on these low-level features, because they have

no direct correspondence to high-level concepts in humans� minds. This is the so-called semantic gap prob-

lem. Bridging the semantic gap in image retrieval has attracted much work generally focussing on making

systems more intelligent and automatically understanding image contents in terms of high-level concepts

(Eakins, 2002). Image annotation systems, i.e. automatic assignment of one or multiple keywords to an im-

age, have been developed for this purpose (Barnard et al., 2003; Kuroda & Hagiwara, 2002; Li & Wang,

2003; Park, Lee, & Kim, 2004; Tsai, McGarry, & Tait, 2003; Vailaya, Figueiredo, Jain, & Zhang, 2001).To evaluate the annotation results, most of these systems are only based on some chosen dataset with

ground truth, such as Corel. However, the problem is that currently there is no standard image dataset

for evaluation, like the web track of TREC for IR (Craswell, Hawking, Wilkinson, & Wu, 2003). As IR

systems also need to consider human subjects for evaluation, quantitative evaluation of current annotation

systems are insufficient to validate their performances. Therefore, user-centred evaluation of image anno-

tation systems is also necessary.

This paper is organised as follows. Section 2 reviews related work on conducting qualitative evaluation

for image retrieval related algorithms, systems, etc. Section 3 presents our qualitative evaluation method-ology for image annotation systems. Section 4 shows an example of assessing our image annotation system

based on the proposed methodology. Section 5 provides some discussion of the user-centred evaluation.

Finally, some conclusions are drawn in Section 6.

2. Related work

For human assessment image retrieval systems, the general approach is to ask human subjects to eval-uate directly the systems� outputs. For example, a questionnaire can be devised to ask the human judges to

rank the level of preference for each specific retrieved image, or the ease with which they were able to find

desired images. For image annotation, keywords associated with their images can be selected as relevant or

irrelevant by the judges. Then, conclusion can be drawn from the analysis of the qualitative data gathered.

An alternative approach is to ask human subjects to decide their desired outputs by manual processing

on a given dataset. That is, people are asked to choose relevant images of some specific queries according

their subjective opinions. In the case of image annotation, people might be asked to annotate a given set of

images. If the results show a certain degree of consistency, they can be treated as ground truth. Then, thesystem can be tested by the same dataset given to the human subjects and its outputs can be compared with

the ground truth. Note that the difference between the ground truth dataset, such as Corel and the one of

human annotation here is that the former is based on some unknown selected professional indexers, but the

138 C.-F. Tsai et al. / Information Processing and Management 42 (2006) 136–154

later can be naıve or real amateur users. In other words, one obtains a different �ground truth� depending onwhether the data is derived from the viewpoint of a professional indexer, or from the viewpoints of more

general classes of users. In addition, we believe that the later ground truth could much better represent gen-

eral users� opinions.We classify qualitative evaluation of image retrieval related work into four research domains. They are

CBIR and/or relevance feedback algorithms/systems, image annotation algorithms/systems, user needs/

behaviour in searching, and others such as edge detection and image segmentation. In addition, the papers

describing their evaluation can be further classified into the two types of evaluation methods described

above. Table 1 shows the results where Type I represents �results assessment by humans� and Type II

�pre-define a ground truth dataset by humans for further testing or evaluation�. Note that we are not inter-

ested in the technical issues of these works and they are not described here.

According to this table, it is clear that most qualitative evaluation in literature is based on the Type I

method to directly assess the results/outputs of their systems to draw conclusions. Few studies considerthe Type II method to compare their systems� outputs with the correct ones as defined by some human sub-

jects in advance.

For the domain of CBIR and/or relevance feedback, users are generally asked to evaluate the retrieval

results of these systems for validation, but few ask some chosen human subjects to pre-define the retrieval

results by some given queries to assess the systems� retrieval results. None of them consider both evaluation

methods at the same time.

Table 1

A comparison of related qualitative evaluation methods for image retrieval

Domains Methods

Type I Type II

CBIR and/or relevance feedback Barnard and Shirahatti (SPIE�03) Black et al. (CIVR�02)Ciocca and Schettini (IP&M�99) Han and Myaeng (SIGIR�96)Cox et al. (IEEE TOIP�00) Markkula et al. (IR�01)Gudivada and Raghavan (IP&M�97) Minka and Picard (PR�97)Markkula et al. (IR�01) Sanchez et al. (IP&M�03)McDonald et al. (SIGIR�01) Squire and Pun (PR&M98)

Mehtre et al. (IP M�98) Wu and Narasimhalu (IP&M�98)Muller et al. (MDM/KDD�00)Rodden et al. (SIGCHI�01)Sanchez et al. (IP&M�03)

Image annotation Barnard and Shirahatti (SPIE�03) Black et al. (CIVR�02)Wang et al. (ICIP�03) Gorkani and Picard (ICPR�94)

Liu et al. (CVPR�01)Wang et al. (ICIP�03)

User needs and/or behaviour in searching, etc. Armitage and Enser (JOIS�97) Conniss et al. (2000)

Chen, H.-L. (IP&M�01) Saracevic et al. (ASIS�90)Choi and Rasmussen (IP&M�02) Xie (ASIS�97), etc.Efthimiadis and Fidel (SIGIR�00)Goodrum and Spink (IP&M�01)Jorgensen (IP&M�98)Jose et al. (SIGIR�98)Markkula and Sormunen (IR�00)McDonald and Tait (SIGIR�03)

Others Heath et al. (CVIU�98) Martin et al. (ICCV�01)(edge detection, image segmentation, etc.) Martin et al. (ICCV�01)

Shaffrey et al. (ACIVS�02)


It should be noted that studies of user needs/behaviour in search do not necessarily conduct Type II eval-

uation since their goal is to understand user behaviour in the context of information searching rather than

to assess systems in terms of retrieval effectiveness. Some studies, which aim to model information seeking

behaviour, focus on contextual studies of people in the workplace without considering a front-end system

(Conniss, Ashford, & Graham, 2000).For the domain of edge detection, image segmentation, etc. few systems are assessed by independent

users. Generally success in identifying the pre-determined regions of interest is assessed. Only Martin,

Fowlkes, Tal, and Malik (2001) consider both evaluation methods. They found that different human seg-

mentations of the same image are highly consistent.

For the domain of image annotation systems, which is the main focus of this paper, although a number

of image annotation systems have been reported in the literature, most of them are evaluated by using some

chosen ground truth datasets for analysing classification accuracy. However, they generally lack of user-

centred evaluation. Therefore, it is hard to draw a valid general conclusion for image retrieval from thisdata. Only Wang, Li, and Lin (2003) consider both evaluation methods. However, for their Type II method

only the first author pre-assigned some keywords to four images as the ground truth test set to compare the

system�s performance. Therefore, the data collection strategy is not very objective and reliable. Moreover,

they did not report how reliable and consistent the judgments for the Type I method were and of course it is

not available for their Type II method. It can be said that if the judgments are not consistent, the validation

may not be reliable.

In conclusion, for evaluating an image annotation system by human subjects there is no work that con-

siders both Type I and II evaluation methods with both statistical analysis of consistency and of the signif-icance level of the results. That is, while conducting qualitative evaluation, no answers are produced to the

questions: ‘‘how reliable are the assessments of the automatic annotation results?’’ and ‘‘how consistent are the

human annotations with the automatic indexing annotations?’’. This is the reason why our evaluation meth-

odology is proposed to validate our image annotation system and could be considered for future image

annotation systems.

3. The evaluation methodology

The conclusion of Section 2 motivates proposing a user-centred evaluation methodology for existing im-

age annotation systems in terms of effectiveness, i.e. quality and accuracy of image annotation. Fig. 1 shows

the evaluation procedure. It is composed of the Type I and Type II evaluation methods described above.

Both types of evaluation contain three steps, which are research question formulation, data collection, and

data analysis, which can provide different kinds of understanding of an image annotation system. For the

results assessment

results comparison

Type I Evaluation(Human judgments)

Research Questionsfor

Human judgments

Data Analysis(Statistical measure)

Type II Evaluation(Human annotations)

Research Questionsfor

Human annotations

Data Collectionfrom

Human annotations

Data Collectionfrom

Human judgments

System Outputs (image annotations)

Fig. 1. The evaluation procedure.


Type I evaluation method, the system is tested against some chosen dataset and the annotation results are

assessed by some human subjects, thus �results assessment�. For the Type II evaluation method, some

human subjects are asked to annotate a chosen set of images and then, the system is tested by annotating

this set of images and comparing this with the human, thus �results comparison�. The following subsections

describe the following three steps.

3.1. Research questions

To conduct qualitative evaluation for image annotation systems, the first step is to formulate some re-

search questions to collect appropriate data and materials for further analyses. Depending on the research

aims and objectives, different research questions can be answered. The main research questions of the Type

I and Type II evaluation methods are based on first examining the reliability (i.e. consistency) of human

judgments (Type I) and annotations (Type II). This is because if data collected from human performanceare not consistent to a certain degree, it is difficult to design a reliable validation scheme for a system. We

think that this is the first and a required step to conduct user-centred (qualitative) evaluation. Then, the

second research questions of both evaluation methods attempt to understand the system performance by

some comparisons. The final questions we are interested in are the agreement of human judgments and

annotations in terms of concept- or keyword-based image annotation.

3.1.1. Research questions for Type I evaluation

• Question 1: are the judgments correlated and consistent?

The first research question for Type I evaluation, �results assessment�, is to see how well human judgments

are correlated with system annotation results. In order to make this evaluation reliable, we also need to

assess how consistent humans are in their judgments of the relevance of the assigned annotations.• Question 2: does the system outperform the baseline?

A random guessing approach is used as the baseline in this work. This randomly assigns keyword annota-

tions to images. We think that an image annotation system should outperform the random guessing ap-proach as an absolute minimum. Currently there are a number of automatic image annotation systems

reported in the literature, but no agreed baseline for comparison. In the future a more demanding baseline

reflecting the performance of real systems could be adopted without changing the evaluation framework,

but it is difficult to establish what such a more demanding baseline should be right now.

• Question 3: which class(es) obtain most agreement between the system annotations and human judgments

and which show little agreement if any?

Human judges might, for example, accept the system�s assignment of annotation keywords corresponding

to concrete objects (grass, trees, car) but be more likely to disagree with the assignment of more abstract

annotations (festival, dance, happy). Further, might there be more disagreement between human judgments

about the appropriate assignment of these abstract annotations.

3.1.2. Research questions for Type II evaluation

• Question 1: are the human annotations consistent?

The first research question for Type II evaluation, �results comparison�, is to determine whether and to what

extent different human judges and the same human judges on different occasions produce the same anno-

tations for a set of images.


• Question 2: do the annotation results of the system show compatibility with the benchmark?

The benchmark is based on the human annotations. This research question is to see whether the system has

similar performances with the benchmark, i.e. the system assigns similar annotations as humans.

• Question 3: which class(es) obtain most agreement between the benchmark and the system performance and

which little agreement if any?

The question here is whether there are some classes (perhaps concrete keywords) for the system has high

levels of agreement with the benchmark annotations, and others (perhaps abstract) in which such high lev-

els of agreement are not achieved.To answer these questions for evaluation, some null hypotheses can be formulated which are the reverse

of what we believe. Then, some statistical measures can be used such as the t-test to test the hypotheses for

data analysis. Section 3.3 describes this issue in detail.

3.2. Data collection

3.2.1. The judges

To evaluate image annotation systems, we believe that any judges who can recognise the association be-tween image contents and keywords are qualified. That is, when looking at an image, acceptable subjects

only need to reliably link its content to some relevant keywords to judge the relevance of the assigned key-

words by the system. However, cultural issues and the subject�s background and expertise may affect the

judgments. Therefore, we consider using native speaker subjects who are not experts in image indexing

as the focus group for evaluation.

About the number of human subjects, however, it may vary between experiments� goals. As the goal is to

evaluate image annotation results rather than users� behaviours in searching (Schamber, 1994) or the level of

satisfaction with retrieval results (Applegate, 1993) for example, our initial study focuses on a small group ofcontrolled human subjects. This makes it easier to maintain a consistent experimental setting and procedure.

Clearly, a larger sample would be needed if we wished to make general claims about user behaviour.

3.2.2. The test set

The size of the test set for qualitative evaluation should not be so large as to make the manual annotation

and assessment work excessively labour intensive. For example, people may be unwilling to assess a very large

number of images to determine whether those images have relevant associated keywords. In addition, their

judgments or annotations may be affected by long assessment or annotation sessions (Black Jr., Fahmy, &Panchanathan, 2002). On the other hand, a very small test set, say five images, which is used to test a 100-cat-

egory classification system, may not be adequate since it is very difficult to ensure it is a representative sample

of the full test set or there are enough judgments or annotation decisions to assess statistical significance, unless

a large group of subjects is used. For example the Type II method of Wang et al. (2003) suffers from this prob-

lem of an overly small sample sizemaking it difficult to reliably draw conclusions from the experimental results.

Although it depends on the task, we think that the range from 50 to 100 images should be appropriate

for an initial study while making the human judgments or annotations reasonably consistent. If these num-

bers were too large, our analysis would show inconsistent results, although clearly one would need toundertake experiments with different numbers of images to determine whether an overly small sample is

the primary cause of the inconsistency, and this has not been done in the work reported here.

3.2.3. The tools and rules for human judgments and annotations

Once the test set is chosen, a tool or interface needs to be provided for the human subjects to assess the

system�s outputs and annotate images for the Type I and II methods respectively. It should be as simple as


possible to avoid compounding factors from interface issues (for example ambiguous command button

labelling, confusing screen layout and so on). In addition, how to assess the results and annotate images

should be defined before collecting the data. That is, ‘‘what keywords can be thought of as relevant (Type

I)?’’ and ‘‘which and how many relevant keywords can be assigned to images (Type II)?’’

3.2.4. Data representation and quantisation

As the data are collected qualitatively, i.e. relevant or perhaps irrelevant keywords associated with their

images are selected (Type I) or images are annotated (Type II), the next step is to consider the data quan-

tisation for qualitative data representations.

• For the Type I evaluation method, the rate of classification accuracy is measured by

PmiPNi

whereP

ni

denotes the total number of the relevant keyword i selected and the total number of keyword i assigned

by the system. For example, the system assigns the keyword grass to 10 images from the test set, and the

judge selects 4 images out of 10 assignments as relevant, then the system has 40% annotation accuracy

for grass.

• For the Type II evaluation method (i.e. results comparison), we consider a simple and direct data quan-tisation method. If the annotator assigns two relevant keywords (e.g. keywords i and j ) to an image, and

there is at least or only one keyword assigned to the image by the system (e.g. keywords i, k, l, m, n)

which is the same as one of the two keywords assigned by the annotator, e.g. keyword i, the system

has 100% annotation accuracy for keyword i, but 0% for keywords k, l, m, n for this image. Note that

in practice it is unlikely that a system score 100% accuracy because we will be assigning only a very small

proportion of the available keywords to each image. In our example evaluation discussed in the next sec-

tion, we constrain the system to assign five keywords out of 150 controlled keywords for each image (cf.

Section 4.1.2) and the human subjects to assign between two and five keywords out of these 150 to eachimage (cf. Section 4.2.3).

3.3. Data analysis

Next we need to statistically analyse the collected data. The aim is to measure the correlation between

different human judgments or annotations and to measure the level of significance of these results. Analysis

of inter-informant agreement and levels of significance is generally lacking in related work and a systematicapproach to this problem in the context of content-based image retrieval has not previously been reported.

Two statistical tools are used to answer the research questions of Section 3.1. They are Pearson Product-

Moment Correlation Coefficient and the t-test (Pagano, 2001). The Pearson product-moment correlation

coefficient is the most widely used measure of correlation. It is a measure of the degree of relationship be-

tween two variables, i.e. we can know whether or not one can predict another. The t-test is typically used to

compare the means of two populations and determine whether or not the means in two sample populations

are significantly different.

Therefore, the correlation coefficient can measure the consistency of human judgments and annotations,which can answer Research Question 1 of both evaluation methods. That is, the average result of correla-

tion for each pair of the judgments and/or annotations can show the reliability and consistency of user-cen-

tred evaluation.

For answering Research Question 2, the distribution t-test can be used to assess the result of two differ-

ent approaches (Type I: the annotation system vs. random guessing; Type II: the annotation system vs.

human annotations) for the confidence level of data distribution. The following hypotheses correspond to

the research questions of Types I and II evaluations. Note that for Research Question 3 of both evaluation


methods, a null hypothesis is not made because we do not assume which classes obtain most agreement

between two systems in this study.

3.3.1. Hypothesis testing for Type I evaluation

• Null hypothesis 1 for Question 1 (Type I): the judgments are not correlated with a low level of signifi-

cance. That is, we intend to first of all test whether the human judgments are consistent or correlated in a

certain degree by the correlation coefficient analysis. In addition, we believe that the judgments are cer-

tainly correlated with a high level of significance.

• Null hypothesis 2 for Question 2 (Type I): the system does not outperform the random guessing

approach. That is, we assume that the results of our system outperform the random guessing approach

within a high level of significance.

3.3.2. Hypothesis testing for Type II evaluation

• Null hypothesis 1 for Question 1 (Type II): the human annotations are not correlated with a low level of

significance. To test this hypothesis, we assume that the human annotations are certainly correlated with

a high level of significance via the correlation coefficient analysis.

• Null hypothesis 2 for Question 2 (Type II): the annotation results of the system are not compatible with

human annotations. That is, we hypothesise that the system annotations are similar to human annota-tions within a high level of significance.

4. An experimental example

4.1. Type I evaluation: human assessments for the annotation results

4.1.1. The judges

We asked five judges (PhD research students) who are not experts in image indexing and retrieval to de-

cide whether the keywords which are assigned by our system and the random guessing approach are rele-

vant to that image. There were three male judges and two females who were all English first language

speakers.

4.1.2. The test set

We considered two datasets. One was the Corel image collection and the other one was supplied by

Washington University. 1 Our prototype system, CLAIRE (CLAssifying Images for REtreieval), is imple-

mented based on a two-level learning framework. Colour and texture classifiers are used for low-level clas-

sification as a first-level learning machine and a high-level concept classifier which learns from the outputs

of the first-level classifiers is used for the final decisions (image annotation) as the second-level learning de-

vice (Tsai, McGarry, & Tait, 2004). In addition, each image is first resized into 128*128 pixel resolution

and then partitioned into five equal-sized patches based on the tiling scheme shown in Fig. 2. That is, each

image contains four tiles corresponding to the four quadrants of the image and one tile for the centre sub-image. This scheme was adopted because of the expectation that one of the major subjects of interest in a

1 Available at: http://www.cs.washington.edu/research/imagedatabase/groundtruth/.

http://www.cogsci.princeton.edu/~wn/

Fig. 2. The tiling scheme.


photograph is usually placed at or close to the centre of the image. Each tile subimage is run through the

learning machines separately, so each image has five keywords assigned by CLAIRE.

In these experiments CLAIRE is trained on an extract from Corel to assign 60 keywords to unseen

images, such as sky, tree, grass, building, etc. In particular, the trained learning machines assign one of

the 60 keywords to each of the five tiles in the unseen image. Note that we did not consider the pre-definedconceptual categories of Corel for training and testing in this test. The testing images are outside of the

training examples and composed of the two datasets. At the beginning, we manually selected 800 images

whose contents reflect the 60 keywords, in which 650 images are from the first dataset (Corel) and 150

images from the second (Washington). Each of the 60 keywords is assigned to at least 10 images out of

the 800. Then, we randomly selected 60 images from the 800 images as the test set for human judgments.

Next, 40 images out of these 60 were annotated with keywords by CLAIRE and 20 were assigned keywords

by the random guessing approach. Note that the five judges do not know which images were processed by

which approach.CLAIRE has been developed as a step towards an integrated content-based image retrieval system in

which users are able to query an image database using a combination of analytic keywords queries, simi-

larity searching and browsing. An assumption of our framework is that users of such a system will be happy

with automatic indexing strategies which focus on achieving high recall in response to (initial) keyword que-

ries. This is because, compared to text, dealing with irrelevant images (we posit) requires a low cognitive

and interaction load if the system provides an interface incorporating thumbnail presentation of images,

the use of spatial metaphors, query by (multiple) image example and so on. An exploration of these

assumptions goes well beyond this paper and forms part of our future research programme.However, because we are trying to obtain high recall through automatic indexing, it is more important to

annotate an image with a relevant keyword than it is to assign an irrelevant keyword to that image. We

therefore report as successful any annotation of any image with a relevant keyword, regardless of the num-

ber of irrelevant keywords assigned. Of course this approach cannot be taken to the limit: one could, for

example, in principle assign all the keywords in the system�s vocabulary to every image.

The application of our evaluation framework here never assigns more than five keywords to any image,

so even given the limited scale of vocabulary used in the experiments reported in this paper, only a small

proportion of keywords are assigned.In a sense the restriction on the numbers of keywords assigned is acting as a balancing precision enhanc-

ing measure. However the use of evaluation frameworks which fully balance precision and recall, like

reporting at the 11 standard points of recall (Hersh, Buckley, Leone, & Hickam, 1994), precision at 10

(Xu & Croft, 1998), van Rijsbergen�s F-measure and so on (as are commonly adopted in text retrieval)

go well beyond any evaluations reported in Section 2, and would require further development of our eval-

uation framework. We will return to this issue in the conclusion.

It is worth noting that in this test CLAIRE only assigns 47 of the possible 60 keywords to any image in

the experimental set.One further point needs to be made about our approach to indexing images by automatically annotating

them with keywords. We have formulated the problem of accurately assigning keywords as the task of

correctly identifying that class of images to which a human indexer is likely to find a particular key word


relevant. Having formulated the task in this way it is often convenient to discuss our experiments and their

results using the language and metrics of automatic classification. We are conscious of sometimes sliding

between the language of automatic classification and that of automatic indexing where one or the other

seems to aid comprehension or readability. We hope the reader will find that acceptable.

4.1.3. The assessment tool and rules

We designed a simple interface shown in Fig. 3, a Microsoft Access database form, for the judges to eval-

uate whether the images have any relevant keywords assigned. The judges were asked to tick one or more of

the five selection boxes corresponding to five keywords per image if the keyword(s) are relevant to their

associated images; otherwise, they are asked to move to the next image for more judgments. They can also

go back to change their selections if any. As described above, our system assigns five keywords to an image,

which correspond to each of the five tiles. Therefore, if some of the five keywords were duplicated and rel-

evant, they were asked to select all of them. For example, in Fig. 3 if any one of the five judges thinks thatgrass is relevant to the image, the third and fifth selection boxes corresponding to grass will be selected.

That is, if the two grass keywords are selected, the system is thought to be able to assign 2 relevant key-

words out of 5 (i.e. 40% accuracy) in this case, but 20% (1 relevant keyword out of 5) if only one occurrence

of the keyword was considered. In retrospect it might have been useful to adopt other approaches to nor-

malising for the effects of duplicated system generated keywords. However, in practice the alternatives seem

unlikely to substantially change the results. To complete this assessment, each judge spent about 10–15 min.

4.1.4. Results

As described in Section 3.2.4, the collected data/materials are quantised for rates of classification accu-

racy or precision. For example, in Fig. 3 if a judge selects sky, grass, mountain, and grass (from the second

to fifth keywords) as relevant to the image, then the system assigns 4 relevant keywords out of 5 to the im-

age. The following shows the results of human judgments.

4.1.4.1. Human judgments for the number of relevant keywords assigned. Fig. 4 shows the number of auto-

matically assigned relevant keywords for each image of the test set as assessed by the judges. This result

shows something about the subjectivity and variability of human judgments. For example, Judge 1 is

Fig. 3. The interface for the judges to select relevant keyword(s) if any.


consistently more likely to judge a keyword as relevant than Judge 5. This illustrates the necessity of mea-

suring the correlation between the judgments as part of our claim for the Research Question 1/Null hypoth-

esis 1.

4.1.4.2. Result comparisons. Fig. 5 shows the results of our system and the random guessing approach fromthe judgments. The result show that CLAIRE has not only lower errors in the context of assigning five irrel-

evant keywords to an image, but also higher accuracies for assigning 1–5 relevant keywords to an image

than the random guessing approach.

4.1.4.3. Annotation accuracy and disagreement between different judges. Fig. 6 shows the rate of classification

accuracy for those classes or keywords for which there is an especially high level of variation between

the assessments from different human judges. In other words the keywords or classes for which selection(s)

(i.e. classification rates) are well correlated are omitted. As CLAIRE assigns 47 different keywords to thetest set, there are 13 keywords about which the five judges have different opinions. Although human

judgments for image annotation are subjective, under the scale of 60-category classification these general

(non-professional) users agree with most of the keywords assigned to the images. The following statistical

0

5

10

15

20

25

30

0 1 2 3 4 5

No. of relevant keyword(s)

No.

of i

mag

e(s)

Judge 1Judge 2Judge 3Judge 4Judge 5

Fig. 4. The number of relevant keyword(s) of the test set selected by the judges.

8/40

32/40

20/40

9/40

4/401/40

10/20 10/20

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

0 >= 1 >= 2 >= 3 >= 4 5

No. of relevant keyword(s)

Rel

evan

t/irr

elev

ant r

ate

CLAIRE

Guessing

Fig. 5. Average numbers of images which have 0, 1, 2, 3, 4, and 5 relevant keywords assigned.

tree

train

stre

etsky

sailin

g

ocea

n

mou

ntai

n

grou

nd

gras

s

clou

ds

city

scap

e

build

ing

boat

s

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Prec

isio

n

Judge 1

Judge 2Judge 3

Judge 4

Judge 5

Fig. 6. The keywords which have particularly different selections from the judges.


data analysis will measure the extent of (dis)agreements between different judges and the automatic

indexing system.

4.1.5. Data analysis

In this section we present two results from our analysis of the behaviour of the human judges in the TypeI experiments. They are, first, the number of keywords assigned by each judge on the test set as shown in

Fig. 4 and, second, the actual keywords associated with each image, measured as classification accuracy or

precision, partly shown in Fig. 6. For the judgments about the number of relevant keywords over the test

set, we obtained r = 0.8385 of the five judges which is significant at the 0.05 level. This shows there is a sig-

nificant (p > 0.95) level of correlation between the number of automatically assigned keywords judged as

relevant by different judges. For the particular keywords associated with each image, we obtained

r = 0.9468 of the five judges which is significant at the 0.01 level. This demonstrates that there is a very sig-

nificant (p > 0.99) degree of correlation between the automatically assigned keywords judged as relevant bydifferent judges. Therefore, the human judgments are highly correlated and thus consistent and reliable. As

a result, the Null hypothesis 1 (that the judgments are not correlated) is rejected.

The difference in performance of CLAIRE and the random guessing approach was assessed by using the

t-test. We obtained t = 2.77 > 2.306 (a = 0.05 and df = 4). This rejects the Null hypothesis 2, that there

would be no difference in the human judges assessment of the keywords assign by CLAIR and the random

guessing approach. Therefore, there is a statistically significant evidence that our system improves on or

outperforms the random guessing approach.

For Question 3, the qualitative data show that for most of the 60 keywords the five judges agree onwhether or not a particular keyword is being assigned to relevant images. (Remember, however, in practice

CLAIRE only ever assigned 47 of the 60 keywords, but the random guessing approach may assign any of

the 60.) However, there are 13 keywords, shown in Fig. 6, for which there is significant variation between

the behaviour of the five judges. In addition, over half of the judges have different selections for the clouds,

grass, ground, tree classes. That is, we only got r = 0.4965 for the four classes, but r = 0.8938 for the other

nine classes.

Although human judgments are subjective, the results imply that they are consistent under the scale of

the 60 categories. That is, (untrained) people do not have very variable opinions about the relevance of the60 keywords corresponding to their images. Therefore, this evaluation can validate that our system im-

proves upon or outperforms the random guessing approach.

Table 2

The 150 keywords

100 Concrete classes agates coast homes pills tall ship

antelope cuisines horses plants texture

antique dessert jewellery polo things

balloon dogs lighthouse predator tools

beaches dogsled machinery primates train

bobsled doors mammals pub signs tulips

bonsai drinks men puma valley

botany everglade marble pyramids vegetable

beads fabric masks race car volcano

building firearms minerals reptile war plane

buses firework monuments road water fall

butterfly flags mountain rock form waves

cactus flora mushroom rodeo wildcats

cars flower old dish roses wild bird

cards flower bed old doll sail wild fish

castles foliage orchids sculpture wild goat

cats fractals owls shells wild nests

children fruit palaces stamps whale

churches graffiti penguin steam engine work ship

clothing hawk perennial subsea women

50 Abstract classes architecture dawn gardens night space

autumn desert glamour office sports

aviation estate golf old works summer

ballet farm harbours parades sunsets

barbecue fashion industry park surfing

barnyard festival interior pastoral tropical

battles fitness kitchen rafting vineyard

com. tech. forests leisure ruins waterway

couples fountain market rural wet sport

cruise game nature scene winter


This discussion also shows how a Type I evaluation method can provide a powerful means of assessing

image retrieval performance and hence help enhance the level of image retrieval system effectiveness.

4.2. Type II evaluation: annotation comparison between the system and humans

4.2.1. The annotators

To assess consistency of evaluation, the five human subjects participated in Type I evaluation were asked

to annotate a given set of images. Note that these two types of evaluation were conducted at different times.

4.2.2. The test set

In this test the five human subjects were asked to annotate a given number of images with the 150 key-

words listed in Table 2. The selection of these 150 keywords is based on the corresponding categories on the

Corel dataset (CD1, 7, and 8). One third of the 150 categories were selected to be abstract concepts, such as

festival, parades, tropical, etc. and two third of the 150 categories were concrete concepts, such as car, but-

terfly, antelope, etc. We follow WordNet, 2 taking concrete concepts to be a physical object or entity and

abstract to be marked as an abstraction, human activity, or an assemblage of multiple physical objects or

entities. This gives a comparable but specific definition of different levels of high-levels concepts as used in

2 Available at: http://www.cogsci.princeton.edu/~wn/.

http://www.cogsci.princeton.edu/~wn/


previous work, like internal category structure (i.e. levels of categorisation) (Rosch, 1973), generic/specific/

abstract levels (Jorgensen, Jaimes, Benitez, & Chang, 2001) and Level 2 and 3 (Eakins, 2002) where Level 2

involves some degree of inference about the identity of objects and Level 3 involves complex reasoning

about the significance of the objects or scene depicted.

For the experiment with the human subjects we manually selected 3000 unseen images in which each ofthe 150 keywords (or really here categories) has 20 images. Then, 30 sets of images including 24 reflecting

concrete concepts and 6 reflecting abstract concepts were randomly chosen from the 150 keywords. Next,

two images indexed by each of the 30 keywords were randomly selected to be the test set which therefore

contains 60 images.

For system annotation, CLAIRE was first trained using the 150 categories, so each category has 30

training examples. Note that these categories do not include the Corel classes for which Muller, Marc-

hand-Maillet, and Pun (2002) would predict problematical good performance. Next, the chosen test set

containing the 60 images is used to test the trained system for annotation.

4.2.3. The annotation tool and rules

Similar to the assessment tool, Fig. 7 shows the interface for human annotations. During annotation, the

150 keywords shown in Table 2 were provided on paper to help the subjects to annotate the 60 images. Five

blanks associated with each image allow the subjects to annotate images with this controlled set of key-

words. The annotation requirement is to assign at least two relevant keywords to each image and so the

maximum is five keywords per image. Subjects can go back to change their annotations if any; otherwise,

they are asked for next annotation. On average, every annotator took about 40 min to finish this task, sospending about 45 s per image on average.

4.2.4. Results

The collected data and material, i.e. human annotations, are used as a set of ground truths in which

annotations of each subject are somewhat different and treated individually. The keyword annotation re-

sults of CLAIRE were compared with the annotations of each subject in turn. Following the quantisation

method for this comparison described in Section 3.2.4, Fig. 8 shows the annotation result of the system

compared with the five sets of human annotation. On average, our system produces 22.2% annotation accu-racy under the scale of 150 categories, which is composed of 21.6% and 20.83% annotation accuracy for

concrete and abstract keywords respectively.

Fig. 7. The interface for subjects to annotate images.

0%

5%

10%

15%

20%

25%

30%

1 2 3 4 5Judges

Ann

otat

ion

accu

racy

Fig. 8. Annotation accuracy of only/at least one relevant keyword(s) per image.


This graph also shows that each general user has his/her own opinion about the relevance of some key-

words to particular images. Further, in fact these human annotations must be somewhat different from the

ground truth of Corel dataset since our human subjects were asked to provide multiple keywords, where as

Corel would provide only one per image (in effect). Furthermore, in some cases the human subjects did notselect the Corel category at all. For example, there is an image which belongs to the game category of Corel,

but four of the five judges assign parades and three judges out of the five assign children and others to the

image.

On average, the human subjects assigned 3.07 keywords to an image. When we examine the agreement

between the human annotations, most judges (i.e. at least three) assign the same 0.93 abstract keywords and

1.4 concrete keywords to each of the 60 images. It is interesting that the proportion of the randomly chosen

abstract and concrete keywords in the test set is 1:4 (i.e. 6 abstract vs. concrete keywords) from the Corel

dataset shown in Table 2, but the proportion of the abstract and concrete keywords assigned by at least 3subjects is 17:37 (i.e. near 1:2) which is very different from the ground truth of Corel of 1:4. This also shows

that human subjects may not agree with the pre-defined categories of Corel.

For many practical image retrieval applications users� views about the relevance or otherwise of a par-

ticular keyword to a particular image are more important that the perhaps differing views of professional

indexers. The existence of a difference is illustrated by the gap between the assessments of our users and the

Corel categories just noted. A general user may not find the Corel categories especially useful since they

appear to poorly reflect intuitive notions of relevance. This is of great importance in automatic image index-

ing research, since it suggests that a system which poorly mimics Corel in terms of categorisation behaviourmight in fact do a good job of reflecting more intuitive notions of the relevance of keywords.

Therefore this implies that using some chosen ground truth dataset, or certainly at least one in which

each image is associated with a single keyword or category (like Corel) is insufficient to validate image

annotation systems as it does not adequately reflect human indexing behaviour. This leads to the need

for additional user-centred evaluation to fully assess and understand the performance of these systems,

which is our claim for the proposed evaluation methodology in this paper.

4.2.5. Data analysis

According to the comparison results shown in Fig. 8, for the correlation coefficient of considering only/at

least one relevant keyword assigned by the system, we obtained r = 0.678 from the human annotations

which is significant at the 0.01 level (p > 0.99). (For considering the relevance of the five assigned keywords

per image, we obtained r = 0.709 from the human annotations which is also significant at the 0.01 level

(p > 0.99).) The results show that the human annotations are moderately correlated when using 150 con-

trolled keywords.


As Fig. 8 shows the system annotation performance based on human annotation, our system is not com-

patible with human annotation when using a controlled vocabulary of 150 keywords. That is, the difference

in performance of CLAIRE and the human annotations was assessed by using the t-test. We obtained

t = �39.575 (a = 0.00 and df = 4). Therefore, we keep the Null Hypothesis 2.

For Question 3: considering issues of inter-annotator agreement amongst human annotators; there are20 categories or keywords that all of the five judges applied to precisely the same set of images. There were

54 keywords for which there was a clear majority view (i.e. three out of five judges) about the images to

which they were relevant. Therefore, for our controlled set of 150 keywords, there was broad agreement

about the images to which they applied.

5. Discussion

These results, especially those of Section 4.2.5 show that even a state of the art automatic image indexing

system like CLAIRE cannot match the performance of human annotators in terms annotation accuracy,

especially when the classification scale (i.e. number of words in the indexing vocabulary) increases. How-

ever a closer inspection of these results indicates that the performance actually achieved may be useful

in the context of building practical image retrieval systems which can take initial keyword queries. For

example, presenting 20 thumbnail images on an initial query result screen appears not to be too many.

On average, around four relevant images should be presented, and this would be a useful basis on which

to perform relevance feedback as a query refinement process.Looking over the complete set of results obtained from our two types of evaluation some other interest-

ing issues come to light. By examining the correlation results of Type I evaluation, the scale of the keyword

vocabulary and the evaluation methodology are the key factors of affecting the correlation between human

judgments of relevance and (more or less) spontaneous annotations.

First, as the number of keywords becomes larger, human subjects may have to make more difficult

choices in deciding the annotations to pick for an image. This could decrease the correlation level between

human judgments as well as annotations. In other words the more fine grained the human annotators deci-

sions have to be in assigning a fixed number of keywords to a given image, the harder it is to replicate theset of keywords chosen (whether automatically or manually) and the more likely it is that a subsequent

judge will disagree with the relevance of a particular keyword for a particular image.

Second, and on the other hand, the Type I evaluation method, i.e. to directly assess the system�s outputsfor the relevance of the keywords assigned, could increase the assessed effectiveness of an image annotation

system compared to a Type II evaluation, as happened here. This is because an automatic image indexing

system cannot match increasingly variable human keyword selections as the number of available keywords

increases, whilst the keywords actually assigned may actually be judged relevant to the image, even when

they are not the keyword which the judge would have selected.This last point is extremely important. It may be that the current apparent ceiling in automatic image

indexing system performance, in which keyword vocabularies or numbers of categories can only be in-

creased with corresponding decreases in assignment accuracy, may be an artefact of the evaluation task.

6. Conclusion

Evaluation is a critical issue for information retrieval, and to fully understand the performance of IRsystems it is necessary to consider both system- and user-centred evaluations. Image retrieval has become

an active research area, and in image retrieval, much of the current research effort is focused on automat-

ically annotating or indexing images to facilitate search in image databases. Most of the existing automatic


image annotation systems are evaluated against their annotation or classification accuracy based on some

chosen ground truth data set, such as Corel, which are not necessarily ideal test collections. So, as currently

there is no standard image dataset for evaluation, full human centred assessments are required, but these

are difficult to design and expensive to administer.

We have presented a qualitative evaluation methodology for systems which automatically assign key-words to images. It is composed of two evaluation methods: the first of which is based on a human assess-

ment of annotation accuracy and second the construction of a comparable pre-defined ground truth for

further evaluation. Most systems or approaches reported in the literature use only one of these two meth-

ods, making it difficult to assess the extent to which the results can be extrapolated to real annotation and

retrieval situations. In this paper, we show that this two stage human-centred evaluation methodology pro-

vides deeper understanding about not only the performance of an image annotation system but also the

consistency of human judgments about the relevance of high-level concept terms to images. The combina-

tion allows us to draw well-founded conclusions from relatively simple and modest scale human centredevaluations.

Turning to the CLAIRE system we used as our example for evaluation, according to the first evaluation

method (results assessment by humans), the system performance of CLAIRE is promising. For a fixed

vocabulary of 150 index terms, the system assigns rather different keywords than the human annotators.

However, this is the assessment of indexing results instead of retrieval ones. Using the user assessment

of the results of image annotation (reported in Section 4.1.4) we have indications that the CLAIRE system

was sufficiently accurate at assigning keywords to be potentially useful for practical image retrieval, pro-

vided it was combined with other techniques like thumbnail browsing, relevance feedback or other queryingand browsing techniques. Our evaluation methodology allowed us to draw this conclusion despite the fact

that the retrieval performance did not successfully emulate human indexing performance.

The application of the evaluation methodology reported in this paper must be regarded as an initial step

towards a more systematic and comprehensive evaluation of image retrieval systems. Amongst areas which

require further exploration are: precision of automatic indexing versus precision in querying; appropriate

baselines against which to measure performance; comfort and usability of automatic indexing strategies in

the context of integrated image retrieval systems; differences between individual searcher behaviour and

preferences according to task and context; and the production of standardised test collections and judgedquery sets against which indexing strategies may be assessed in isolation.

At present we are working to embed the CLAIRE indexing engine in a complete image retrieval system

to explore some of these issues.

This study indicates that combining different evaluation strategies can produce new results and a deeper

understanding of the system performance. We hope that future systems are assessed robustly by conducting

a detailed evaluation methodology as proposed in this paper.

Acknowledgement

The authors would like to thank Chris Stokoe, James Malone, Sheila Garfield, Mark Elshaw, and Jean

Davison to participate the system evaluation.

References

Applegate, R. (1993). Models of user satisfaction: understanding false positives. Reference Quarterly, 32(4), 525–539.

Armitage, L., & Enser, P. G. B. (1997). Analysis of user need in image archives. Journal of Information Science, 23(4), 287–299.

Baeza-Yates, R., & Ribeiro-Neto, B. (1999). Modern information retrieval. England: Addison Wesley.

Barnard, K., Duygulu, P., Forsyth, D., de Freitas, N., Blei, D. M., & Jordan, M. I. (2003). Matching words and pictures. Journal of

Machine Learning Research, 3, 1107–1135.


Barnard, K., & Shirahatti, N. V. (2003). A method for comparing content based image retrieval methods. In Proceedings of SPIE

Internet imaging IV (vol. 5018). Santa Clara, California, USA.

Belkin, N. J., Cool, C., Kelly, D., Lin, S.-J., Park, S. Y., Perez-Carballo, J., & Sikora, C. (2001). Iterative exploration, design and

evaluation of support for query reformulation in interactive information retrieval. Information Processing and Management, 37(3),

403–434.

Black Jr., J. A., Fahmy, G., & Panchanathan, S. (2002). A method for evaluating the performance of content-based image retrieval

system based on subjectivity determined similarity between images. In Proceedings of the international conference on image and video

retrieval (pp. 356–366). London, UK.

Chen, H.-L. (2001). An analysis of image retrieval tasks in the field of art history. Information Processing and Management, 37(5),

701–720.

Choi, Y., & Rasmussen, E. M. (2002). Users� relevance criteria in image retrieval in American history. Information Processing and

Management, 38, 695–726.

Ciocca, G., & Schettini, R. (1999). A relevance feedback mechanism for content-based image retrieval. Information Processing and

Management, 35(5), 605–632.

Conniss, L. R., Ashford, A. J., & Graham, M. E. (2000). Information seeking behaviour in image retrieval: VISOR I final report.

Technical Report, Institute for Image Data Research, University of Northumbria at Newcastle.

Cox, I. J., Miller, M. L., Minka, T. P., Papathomas, T. V., & Yianilos, P. N. (2000). The Bayesian image retrieval system, PicHunter:

theory, implementation and psychophysical experiments. IEEE Transactions on Image Processing, 9(1), 20–37.

Craswell, N., Hawking, D., Wilkinson, R., & Wu, M. (2003). Overview of the TREC 2003 web track. In Proceedings of the 12th text

retrieval conference (TREC 2003).

Dunlop, M. (2000). Reflections on Mira: interactive evaluation in information retrieval. Journal of the American Society for

Information Science, 51(14), 1269–1274.

Eakins, J. P. (2002). Towards intelligent image retrieval. Pattern Recognition, 35, 3–14.

Efthimiadis, E. N., & Fidel, R. (2000). The effect of query type on subject searching behavior of image databases: an exploratory study.

In Proceedings of the 23rd annual international ACM SIGIR conference on research and development in information retrieval

(pp. 328–330). Athens, Greece.

Fidel, R. (1997). The image retrieval task: implications for the design and evaluation of image databases. New Review of Hypermedia

and Multimedia, 3, 181–199.

Goodrum, A., & Spink, A. (2001). Image searching on the Excite Web search engine. Information Processing and Management, 37(2),

295–311.

Gorkani, M. M., & Picard, R. W. (1994). Texture orientation for sorting photos �at a glance�. In Proceeding of the IEEE international

conference on pattern recognition (pp. 459–464). Jerusalem, Israel.

Gudivada, V. N., & Raghavan, V. V. (1997). Modeling and retrieving images by content. Information Processing and Management,

33(4), 427–452.

Han, K.-A., & Myaeng, S.-H. (1996). Image organization and retrieval with automatically constructed feature vectors. In Proceedings

of the 19th annual international ACM SIGIR conference on research and development in information retrieval (pp. 157–165). Zurich,

Switzerland.

Heath, M., Sarkar, S., Sanocki, T., & Bowyer, K. (1998). Comparison of edge detectors: a methodology and initial study. Computer

Vision and Image Understanding, 69(1), 38–54.

Hersh, W. R., Buckley, C., Leone, T. J., & Hickam, D. H. (1994). OHSUMED: an interactive retrieval evaluation and new large test

collection for research. In Proceedings of the ACM SIGIR conference on research and development in information retrieval (pp. 192–

201). Dublin, Ireland, July 3–6.

Hersh, W., Turpin, A., Price, S., Kraemer, D., Olson, D., Chan, B., & Sacherek, L. (2001). Challenging conventional assumptions of

automated information retrieval with real users: Boolean searching and batch retrieval evaluations. Information Processing and

Management, 37(3), 383–402.

Jorgensen, C. (1998). Attributes of images in describing tasks. Information Processing and Management, 34(2–3), 427–452.

Jorgensen, C., Jaimes, A., Benitez, A. B., & Chang, S.-F. (2001). A conceptual framework and research for classifying visual

descriptors. Journal of the American Society for Information Science, Special Issue on Image Access: Bridging Multiple Needs and

Multiple Perspectives, 52(11), 938–947.

Jose, J. M., Furner, J., & Harper, D. J. (1998). Spatial querying for image retrieval: a user-oriented evaluation. In Proceedings of the

21st annual international ACM SIGIR conference on research and development in information retrieval (pp. 232–241). Melbourne,

Australia.

Kuroda, K., & Hagiwara, M. (2002). An image retrieval system by impression words and specific object names—IRIS.

Neurocomputing, 43(1–4), 259–276.

Li, J., & Wang, J. Z. (2003). Automatic linguistic indexing of pictures by a statistical modeling approach. IEEE Transactions on Pattern

Analysis and Machine Intelligence, 25(9), 1075–1088.


Liu, W., Zhang, S., Li, S., Sun, Y.-F., & Zhang, H. J. (2001). A performance evaluation protocol for content-based image retrieval

algorithms/systems. In Proceedings of the IEEE CVPR workshop on empirical evaluation methods in computer vision. Kauai, USA.

Markkula, M., & Sormunen, E. (2000). End-user searching challenges indexing practices in the digital newspaper photo archive.

Information Retrieval, 1(4), 259–285.

Markkular, M., Tico, M., Sepponen, B., Nirkkonen, K., & Sormunen, E. (2001). A test collection for the evaluation of content-based

image retrieval algorithms—a user and task-based approach. Information Retrieval, 4(3–4), 275–293.

Martin, D., Fowlkes, C., Tal, D., & Malik, J. (2001). A database of human segmented natural images and its application to evaluating

segmentation algorithms and measuring ecological statistics. In Proceedings of the IEEE international conference on computer vision

(pp. 416–423). Vancouver, Canada.

McDonald, S., Lai, T.-S., & Tait, J. (2001). Evaluating a content-based image retrieval system. In Proceedings of the 24th annual

international ACM SIGIR conference on research and development in information retrieval (pp. 232–240). New Orleans, Louisiana,

USA.

McDonald, S., & Tait, J. (2003). Search strategies in content-based image retrieval. In Proceedings of the 26th annual international

ACM SIGIR conference on research and development in information retrieval (pp. 80–87). Toronto, Canada, USA.

Mehtre, B. M., Kankanhalli, M. S., & Lee, W. F. (1998). Content-based image retrieval using a composite color-shape approach.

Information Processing and Management, 34(1), 109–120.

Minka, T. P., & Picard, R. W. (1997). Interactive learning with a ‘‘a society of models’’. Pattern Recognition, 30(4), 565–581.

Muller, H., Marchand-Maillet, S., & Pun, T. (2002). The truth about Corel—evaluation in image retrieval. In Proceedings of the

international conference on image and video retrieval (pp. 38–49). London, UK.

Muller, H., Muller, W., & Squire, D. M. (2000). Learning feature weights from user behavior in content-based image retrieval. In

Workshop on multimedia data mining, in conjunction with sixth ACM SIGKDD international conference on knowledge discovery &

data mining (pp. 67–72). Boston, MA, USA.

Pagano, R. R. (2001). Understanding statistics in the behavioral sciences (Sixth Edition). California: Wadsworth/Thomson Learning.

Park, S. B., Lee, J. W., & Kim, S. K. (2004). Content-based image classification using a neural network. Pattern Recognition Letters,

25(3), 287–300.

Rasmussen, E. M. (1997). Indexing images. Annual Review of Information Science and Technology, 32, 169–196.

Rodden, K., Basalaj, W., Sinclair, D., & Wood, K. (2001). Does organisation by similarity assist image browsing? In Proceedings of the

SIG-CHI on human factors in computing systems (pp. 190–197). Seattle, Washington, USA.

Rosch, E. (1973). On the internal structure of perceptual and semantic categories. In T. E. Moore (Ed.), Cognitive development and the

acquisition of language. New York: Academic Press.

Sanchez, D., Chamorro-Martınez, J., & Vila, M. A. (2003). Modelling subjectivity in visual perception of orientation for image

retrieval. Information Processing and Management, 39(2), 251–266.

Saracevic, T. (1995). Evaluation of evaluation in information retrieval. In Proceedings of the 18th annual international ACM SIGIR

conference on research and development in information retrieval (pp. 138–146). Seattle, Washington, USA.

Saracevic, T., Mokros, H., & Su, L. (1990). Nature of interaction between users and intermediaries in online searching: a qualitative

analysis. In Proceedings of the 53rd annual meeting of the American society for information science (vol. 27, pp. 47–54).

Schamber, L. (1994). Relevance and information behaviour. Annual Review of Information Science and Technology, 29, 3–48.

Shaffrey, C. W., Jermyn, I. H., & Kingsbury, N. G. (2002). Psychovisual evaluation of image segmentation algorithms. In Proceedings

of advanced concepts for intelligent vision systems. Ghent University, Belgium.

Spink, A. (2002). A user-centered approach to evaluating human interaction with Web search engines: an exploratory study.

Information Processing and Management, 38(3), 401–426.

Squire, D. M., & Pun, T. (1998). Assessing agreement between human and machine clustering of image databases. Pattern Recognition,

31(12), 1905–1919.

Tsai, C.-F., McGarry, K., & Tait, J. (2003). Image classification using hybrid neural networks. In Proceedings of the ACM SIGIR

conference on research and development in information retrieval (pp. 431–432). Toronto, Canada.

Tsai, C.-F., McGarry, K., & Tait, J. (2004). Automatic metadata annotation of images via a two-level learning framework. In

Proceedings of the ACM SIGIR workshop on semantic web (pp. 32–42). Sheffield, UK, July 25–29.

Vailaya, A., Figueiredo, M. A. T., Jain, A. K., & Zhang, H.-J. (2001). Image classification for content-based indexing. IEEE

Transactions on Image Processing, 10(1), 117–130.

Wang, J. Z., Li, J., & Lin, S. C. (2003). Evaluation strategies for automatic linguistic indexing of pictures. In Proceedings of the IEEE

international conference on image processing (pp. 617–620). Barcelona, Spain.

Wu, J. K., & Narasimhalu, A. D. (1998). Fuzzy content-based retrieval in image databases. Information Processing and Management,

34(5), 513–534.

Xie, H. I. (1997). Planned and situated aspects in interactive IR: patterns of user interactive intentions and information seeking

strategies. In Proceedings of the 53rd annual meeting of the American society for information science (vol. 34, pp. 101–110).

Xu, J., & Croft, W. B. (1998). Corpus-based stemming using cooccurrence of word variants. ACM Transactions on Information

Systems, 16(1), 61–81.

qualitative evaluation of automatic assignment of keywords to images

Documents