qualitative evaluation of automatic assignment of keywords to images
TRANSCRIPT
![Page 1: Qualitative evaluation of automatic assignment of keywords to images](https://reader031.vdocuments.mx/reader031/viewer/2022020604/575073871a28abdd2e8fea2c/html5/thumbnails/1.jpg)
Information Processing and Management 42 (2006) 136–154
www.elsevier.com/locate/infoproman
Qualitative evaluation of automatic assignmentof keywords to images
Chih-Fong Tsai *, Ken McGarry, John Tait
School of Computing and Technology, University of Sunderland, Sunderland SR6 0DD, UK
Received 6 July 2004; accepted 1 November 2004
Available online 10 December 2004
Abstract
In image retrieval, most systems lack user-centred evaluation since they are assessed by some chosen ground truth
dataset. The results reported through precision and recall assessed against the ground truth are thought of as being
an acceptable surrogate for the judgment of real users. Much current research focuses on automatically assigning
keywords to images for enhancing retrieval effectiveness. However, evaluation methods are usually based on system-level
assessment, e.g. classification accuracy based on some chosen ground truth dataset. In this paper, we present a qualitative
evaluation methodology for automatic image indexing systems. The automatic indexing task is formulated as one
of image annotation, or automatic metadata generation for images. The evaluation is composed of two individual
methods. First, the automatic indexing annotation results are assessed by human subjects. Second, the subjects are asked
to annotate some chosen images as the test set whose annotations are used as ground truth. Then, the system is tested by
the test set whose annotation results are judged against the ground truth. Only one of these methods is reported for most
systems on which user-centred evaluation are conducted. We believe that both methods need to be considered for full
evaluation. We also provide an example evaluation of our system based on this methodology. According to this study,
our proposed evaluation methodology is able to provide deeper understanding of the system�s performance.
� 2004 Elsevier Ltd. All rights reserved.
Keywords: Qualitative evaluation; Image annotation; Image retrieval; Statistical analysis
1. Introduction
Evaluation is a critical issue for Information Retrieval (IR). Assessment of the performance or the value
of an IR system for its intended task is one of the distinguishing features of the subject. The type of
0306-4573/$ - see front matter � 2004 Elsevier Ltd. All rights reserved.
doi:10.1016/j.ipm.2004.11.001
* Corresponding author.
E-mail addresses: [email protected] (C.-F. Tsai), [email protected] (K. McGarry), john.tait@
sunderland.ac.uk (J. Tait).
![Page 2: Qualitative evaluation of automatic assignment of keywords to images](https://reader031.vdocuments.mx/reader031/viewer/2022020604/575073871a28abdd2e8fea2c/html5/thumbnails/2.jpg)
C.-F. Tsai et al. / Information Processing and Management 42 (2006) 136–154 137
evaluation to be considered depends on the objectives of the retrieval system. In general, retrieval perfor-
mance evaluation is based on a test reference collection, e.g. TREC, and on an evaluation measure, e.g.
precision and recall (Baeza-Yates & Ribeiro-Neto, 1999).
Saracevic (1995) reviews the history and nature of evaluation in IR and describes six different levels of IR
evaluation from system to user levels. However, most IR evaluations are only based on the system level(s)and lack user-centred evaluation. To achieve a more comprehensive picture of IR performance and users�needs, both system- and user-centred evaluations are needed. That is, we need to evaluate at different levels
as appropriate and/or against different types of relevance (Dunlop, 2000). Examples of some recent studies
focusing on user judgments are Belkin et al. (2001), Hersh et al. (2001), and Spink (2002).
Due to the advances in computing and multimedia technologies, the size of image collections is increas-
ing rapidly. Content-Based Image Retrieval (CBIR) has been an active research area for the last decade
whose main goal is to design mechanisms for searching large image collections. Similar to traditional
IR, studies on user issues of image retrieval are lacking (Fidel, 1997; Rasmussen, 1997).Current CBIR systems index and retrieve images based on their low-level features, such as colour, tex-
ture, and shape, and it is difficult to find desired images based on these low-level features, because they have
no direct correspondence to high-level concepts in humans� minds. This is the so-called semantic gap prob-
lem. Bridging the semantic gap in image retrieval has attracted much work generally focussing on making
systems more intelligent and automatically understanding image contents in terms of high-level concepts
(Eakins, 2002). Image annotation systems, i.e. automatic assignment of one or multiple keywords to an im-
age, have been developed for this purpose (Barnard et al., 2003; Kuroda & Hagiwara, 2002; Li & Wang,
2003; Park, Lee, & Kim, 2004; Tsai, McGarry, & Tait, 2003; Vailaya, Figueiredo, Jain, & Zhang, 2001).To evaluate the annotation results, most of these systems are only based on some chosen dataset with
ground truth, such as Corel. However, the problem is that currently there is no standard image dataset
for evaluation, like the web track of TREC for IR (Craswell, Hawking, Wilkinson, & Wu, 2003). As IR
systems also need to consider human subjects for evaluation, quantitative evaluation of current annotation
systems are insufficient to validate their performances. Therefore, user-centred evaluation of image anno-
tation systems is also necessary.
This paper is organised as follows. Section 2 reviews related work on conducting qualitative evaluation
for image retrieval related algorithms, systems, etc. Section 3 presents our qualitative evaluation method-ology for image annotation systems. Section 4 shows an example of assessing our image annotation system
based on the proposed methodology. Section 5 provides some discussion of the user-centred evaluation.
Finally, some conclusions are drawn in Section 6.
2. Related work
For human assessment image retrieval systems, the general approach is to ask human subjects to eval-uate directly the systems� outputs. For example, a questionnaire can be devised to ask the human judges to
rank the level of preference for each specific retrieved image, or the ease with which they were able to find
desired images. For image annotation, keywords associated with their images can be selected as relevant or
irrelevant by the judges. Then, conclusion can be drawn from the analysis of the qualitative data gathered.
An alternative approach is to ask human subjects to decide their desired outputs by manual processing
on a given dataset. That is, people are asked to choose relevant images of some specific queries according
their subjective opinions. In the case of image annotation, people might be asked to annotate a given set of
images. If the results show a certain degree of consistency, they can be treated as ground truth. Then, thesystem can be tested by the same dataset given to the human subjects and its outputs can be compared with
the ground truth. Note that the difference between the ground truth dataset, such as Corel and the one of
human annotation here is that the former is based on some unknown selected professional indexers, but the
![Page 3: Qualitative evaluation of automatic assignment of keywords to images](https://reader031.vdocuments.mx/reader031/viewer/2022020604/575073871a28abdd2e8fea2c/html5/thumbnails/3.jpg)
138 C.-F. Tsai et al. / Information Processing and Management 42 (2006) 136–154
later can be naıve or real amateur users. In other words, one obtains a different �ground truth� depending onwhether the data is derived from the viewpoint of a professional indexer, or from the viewpoints of more
general classes of users. In addition, we believe that the later ground truth could much better represent gen-
eral users� opinions.We classify qualitative evaluation of image retrieval related work into four research domains. They are
CBIR and/or relevance feedback algorithms/systems, image annotation algorithms/systems, user needs/
behaviour in searching, and others such as edge detection and image segmentation. In addition, the papers
describing their evaluation can be further classified into the two types of evaluation methods described
above. Table 1 shows the results where Type I represents �results assessment by humans� and Type II
�pre-define a ground truth dataset by humans for further testing or evaluation�. Note that we are not inter-
ested in the technical issues of these works and they are not described here.
According to this table, it is clear that most qualitative evaluation in literature is based on the Type I
method to directly assess the results/outputs of their systems to draw conclusions. Few studies considerthe Type II method to compare their systems� outputs with the correct ones as defined by some human sub-
jects in advance.
For the domain of CBIR and/or relevance feedback, users are generally asked to evaluate the retrieval
results of these systems for validation, but few ask some chosen human subjects to pre-define the retrieval
results by some given queries to assess the systems� retrieval results. None of them consider both evaluation
methods at the same time.
Table 1
A comparison of related qualitative evaluation methods for image retrieval
Domains Methods
Type I Type II
CBIR and/or relevance feedback Barnard and Shirahatti (SPIE�03) Black et al. (CIVR�02)Ciocca and Schettini (IP&M�99) Han and Myaeng (SIGIR�96)Cox et al. (IEEE TOIP�00) Markkula et al. (IR�01)Gudivada and Raghavan (IP&M�97) Minka and Picard (PR�97)Markkula et al. (IR�01) Sanchez et al. (IP&M�03)McDonald et al. (SIGIR�01) Squire and Pun (PR&M98)
Mehtre et al. (IP M�98) Wu and Narasimhalu (IP&M�98)Muller et al. (MDM/KDD�00)Rodden et al. (SIGCHI�01)Sanchez et al. (IP&M�03)
Image annotation Barnard and Shirahatti (SPIE�03) Black et al. (CIVR�02)Wang et al. (ICIP�03) Gorkani and Picard (ICPR�94)
Liu et al. (CVPR�01)Wang et al. (ICIP�03)
User needs and/or behaviour in searching, etc. Armitage and Enser (JOIS�97) Conniss et al. (2000)
Chen, H.-L. (IP&M�01) Saracevic et al. (ASIS�90)Choi and Rasmussen (IP&M�02) Xie (ASIS�97), etc.Efthimiadis and Fidel (SIGIR�00)Goodrum and Spink (IP&M�01)Jorgensen (IP&M�98)Jose et al. (SIGIR�98)Markkula and Sormunen (IR�00)McDonald and Tait (SIGIR�03)
Others Heath et al. (CVIU�98) Martin et al. (ICCV�01)(edge detection, image segmentation, etc.) Martin et al. (ICCV�01)
Shaffrey et al. (ACIVS�02)
![Page 4: Qualitative evaluation of automatic assignment of keywords to images](https://reader031.vdocuments.mx/reader031/viewer/2022020604/575073871a28abdd2e8fea2c/html5/thumbnails/4.jpg)
C.-F. Tsai et al. / Information Processing and Management 42 (2006) 136–154 139
It should be noted that studies of user needs/behaviour in search do not necessarily conduct Type II eval-
uation since their goal is to understand user behaviour in the context of information searching rather than
to assess systems in terms of retrieval effectiveness. Some studies, which aim to model information seeking
behaviour, focus on contextual studies of people in the workplace without considering a front-end system
(Conniss, Ashford, & Graham, 2000).For the domain of edge detection, image segmentation, etc. few systems are assessed by independent
users. Generally success in identifying the pre-determined regions of interest is assessed. Only Martin,
Fowlkes, Tal, and Malik (2001) consider both evaluation methods. They found that different human seg-
mentations of the same image are highly consistent.
For the domain of image annotation systems, which is the main focus of this paper, although a number
of image annotation systems have been reported in the literature, most of them are evaluated by using some
chosen ground truth datasets for analysing classification accuracy. However, they generally lack of user-
centred evaluation. Therefore, it is hard to draw a valid general conclusion for image retrieval from thisdata. Only Wang, Li, and Lin (2003) consider both evaluation methods. However, for their Type II method
only the first author pre-assigned some keywords to four images as the ground truth test set to compare the
system�s performance. Therefore, the data collection strategy is not very objective and reliable. Moreover,
they did not report how reliable and consistent the judgments for the Type I method were and of course it is
not available for their Type II method. It can be said that if the judgments are not consistent, the validation
may not be reliable.
In conclusion, for evaluating an image annotation system by human subjects there is no work that con-
siders both Type I and II evaluation methods with both statistical analysis of consistency and of the signif-icance level of the results. That is, while conducting qualitative evaluation, no answers are produced to the
questions: ‘‘how reliable are the assessments of the automatic annotation results?’’ and ‘‘how consistent are the
human annotations with the automatic indexing annotations?’’. This is the reason why our evaluation meth-
odology is proposed to validate our image annotation system and could be considered for future image
annotation systems.
3. The evaluation methodology
The conclusion of Section 2 motivates proposing a user-centred evaluation methodology for existing im-
age annotation systems in terms of effectiveness, i.e. quality and accuracy of image annotation. Fig. 1 shows
the evaluation procedure. It is composed of the Type I and Type II evaluation methods described above.
Both types of evaluation contain three steps, which are research question formulation, data collection, and
data analysis, which can provide different kinds of understanding of an image annotation system. For the
results assessment
results comparison
Type I Evaluation(Human judgments)
Research Questionsfor
Human judgments
Data Analysis(Statistical measure)
Type II Evaluation(Human annotations)
Research Questionsfor
Human annotations
Data Collectionfrom
Human annotations
Data Collectionfrom
Human judgments
System Outputs (image annotations)
Fig. 1. The evaluation procedure.
![Page 5: Qualitative evaluation of automatic assignment of keywords to images](https://reader031.vdocuments.mx/reader031/viewer/2022020604/575073871a28abdd2e8fea2c/html5/thumbnails/5.jpg)
140 C.-F. Tsai et al. / Information Processing and Management 42 (2006) 136–154
Type I evaluation method, the system is tested against some chosen dataset and the annotation results are
assessed by some human subjects, thus �results assessment�. For the Type II evaluation method, some
human subjects are asked to annotate a chosen set of images and then, the system is tested by annotating
this set of images and comparing this with the human, thus �results comparison�. The following subsections
describe the following three steps.
3.1. Research questions
To conduct qualitative evaluation for image annotation systems, the first step is to formulate some re-
search questions to collect appropriate data and materials for further analyses. Depending on the research
aims and objectives, different research questions can be answered. The main research questions of the Type
I and Type II evaluation methods are based on first examining the reliability (i.e. consistency) of human
judgments (Type I) and annotations (Type II). This is because if data collected from human performanceare not consistent to a certain degree, it is difficult to design a reliable validation scheme for a system. We
think that this is the first and a required step to conduct user-centred (qualitative) evaluation. Then, the
second research questions of both evaluation methods attempt to understand the system performance by
some comparisons. The final questions we are interested in are the agreement of human judgments and
annotations in terms of concept- or keyword-based image annotation.
3.1.1. Research questions for Type I evaluation
• Question 1: are the judgments correlated and consistent?
The first research question for Type I evaluation, �results assessment�, is to see how well human judgments
are correlated with system annotation results. In order to make this evaluation reliable, we also need to
assess how consistent humans are in their judgments of the relevance of the assigned annotations.• Question 2: does the system outperform the baseline?
A random guessing approach is used as the baseline in this work. This randomly assigns keyword annota-
tions to images. We think that an image annotation system should outperform the random guessing ap-proach as an absolute minimum. Currently there are a number of automatic image annotation systems
reported in the literature, but no agreed baseline for comparison. In the future a more demanding baseline
reflecting the performance of real systems could be adopted without changing the evaluation framework,
but it is difficult to establish what such a more demanding baseline should be right now.
• Question 3: which class(es) obtain most agreement between the system annotations and human judgments
and which show little agreement if any?
Human judges might, for example, accept the system�s assignment of annotation keywords corresponding
to concrete objects (grass, trees, car) but be more likely to disagree with the assignment of more abstract
annotations (festival, dance, happy). Further, might there be more disagreement between human judgments
about the appropriate assignment of these abstract annotations.
3.1.2. Research questions for Type II evaluation
• Question 1: are the human annotations consistent?
The first research question for Type II evaluation, �results comparison�, is to determine whether and to what
extent different human judges and the same human judges on different occasions produce the same anno-
tations for a set of images.
![Page 6: Qualitative evaluation of automatic assignment of keywords to images](https://reader031.vdocuments.mx/reader031/viewer/2022020604/575073871a28abdd2e8fea2c/html5/thumbnails/6.jpg)
C.-F. Tsai et al. / Information Processing and Management 42 (2006) 136–154 141
• Question 2: do the annotation results of the system show compatibility with the benchmark?
The benchmark is based on the human annotations. This research question is to see whether the system has
similar performances with the benchmark, i.e. the system assigns similar annotations as humans.
• Question 3: which class(es) obtain most agreement between the benchmark and the system performance and
which little agreement if any?
The question here is whether there are some classes (perhaps concrete keywords) for the system has high
levels of agreement with the benchmark annotations, and others (perhaps abstract) in which such high lev-
els of agreement are not achieved.To answer these questions for evaluation, some null hypotheses can be formulated which are the reverse
of what we believe. Then, some statistical measures can be used such as the t-test to test the hypotheses for
data analysis. Section 3.3 describes this issue in detail.
3.2. Data collection
3.2.1. The judges
To evaluate image annotation systems, we believe that any judges who can recognise the association be-tween image contents and keywords are qualified. That is, when looking at an image, acceptable subjects
only need to reliably link its content to some relevant keywords to judge the relevance of the assigned key-
words by the system. However, cultural issues and the subject�s background and expertise may affect the
judgments. Therefore, we consider using native speaker subjects who are not experts in image indexing
as the focus group for evaluation.
About the number of human subjects, however, it may vary between experiments� goals. As the goal is to
evaluate image annotation results rather than users� behaviours in searching (Schamber, 1994) or the level of
satisfaction with retrieval results (Applegate, 1993) for example, our initial study focuses on a small group ofcontrolled human subjects. This makes it easier to maintain a consistent experimental setting and procedure.
Clearly, a larger sample would be needed if we wished to make general claims about user behaviour.
3.2.2. The test set
The size of the test set for qualitative evaluation should not be so large as to make the manual annotation
and assessment work excessively labour intensive. For example, people may be unwilling to assess a very large
number of images to determine whether those images have relevant associated keywords. In addition, their
judgments or annotations may be affected by long assessment or annotation sessions (Black Jr., Fahmy, &Panchanathan, 2002). On the other hand, a very small test set, say five images, which is used to test a 100-cat-
egory classification system, may not be adequate since it is very difficult to ensure it is a representative sample
of the full test set or there are enough judgments or annotation decisions to assess statistical significance, unless
a large group of subjects is used. For example the Type II method of Wang et al. (2003) suffers from this prob-
lem of an overly small sample sizemaking it difficult to reliably draw conclusions from the experimental results.
Although it depends on the task, we think that the range from 50 to 100 images should be appropriate
for an initial study while making the human judgments or annotations reasonably consistent. If these num-
bers were too large, our analysis would show inconsistent results, although clearly one would need toundertake experiments with different numbers of images to determine whether an overly small sample is
the primary cause of the inconsistency, and this has not been done in the work reported here.
3.2.3. The tools and rules for human judgments and annotations
Once the test set is chosen, a tool or interface needs to be provided for the human subjects to assess the
system�s outputs and annotate images for the Type I and II methods respectively. It should be as simple as
![Page 7: Qualitative evaluation of automatic assignment of keywords to images](https://reader031.vdocuments.mx/reader031/viewer/2022020604/575073871a28abdd2e8fea2c/html5/thumbnails/7.jpg)
142 C.-F. Tsai et al. / Information Processing and Management 42 (2006) 136–154
possible to avoid compounding factors from interface issues (for example ambiguous command button
labelling, confusing screen layout and so on). In addition, how to assess the results and annotate images
should be defined before collecting the data. That is, ‘‘what keywords can be thought of as relevant (Type
I)?’’ and ‘‘which and how many relevant keywords can be assigned to images (Type II)?’’
3.2.4. Data representation and quantisation
As the data are collected qualitatively, i.e. relevant or perhaps irrelevant keywords associated with their
images are selected (Type I) or images are annotated (Type II), the next step is to consider the data quan-
tisation for qualitative data representations.
• For the Type I evaluation method, the rate of classification accuracy is measured by
PmiPNi
whereP
ni
denotes the total number of the relevant keyword i selected and the total number of keyword i assigned
by the system. For example, the system assigns the keyword grass to 10 images from the test set, and the
judge selects 4 images out of 10 assignments as relevant, then the system has 40% annotation accuracy
for grass.
• For the Type II evaluation method (i.e. results comparison), we consider a simple and direct data quan-tisation method. If the annotator assigns two relevant keywords (e.g. keywords i and j ) to an image, and
there is at least or only one keyword assigned to the image by the system (e.g. keywords i, k, l, m, n)
which is the same as one of the two keywords assigned by the annotator, e.g. keyword i, the system
has 100% annotation accuracy for keyword i, but 0% for keywords k, l, m, n for this image. Note that
in practice it is unlikely that a system score 100% accuracy because we will be assigning only a very small
proportion of the available keywords to each image. In our example evaluation discussed in the next sec-
tion, we constrain the system to assign five keywords out of 150 controlled keywords for each image (cf.
Section 4.1.2) and the human subjects to assign between two and five keywords out of these 150 to eachimage (cf. Section 4.2.3).
3.3. Data analysis
Next we need to statistically analyse the collected data. The aim is to measure the correlation between
different human judgments or annotations and to measure the level of significance of these results. Analysis
of inter-informant agreement and levels of significance is generally lacking in related work and a systematicapproach to this problem in the context of content-based image retrieval has not previously been reported.
Two statistical tools are used to answer the research questions of Section 3.1. They are Pearson Product-
Moment Correlation Coefficient and the t-test (Pagano, 2001). The Pearson product-moment correlation
coefficient is the most widely used measure of correlation. It is a measure of the degree of relationship be-
tween two variables, i.e. we can know whether or not one can predict another. The t-test is typically used to
compare the means of two populations and determine whether or not the means in two sample populations
are significantly different.
Therefore, the correlation coefficient can measure the consistency of human judgments and annotations,which can answer Research Question 1 of both evaluation methods. That is, the average result of correla-
tion for each pair of the judgments and/or annotations can show the reliability and consistency of user-cen-
tred evaluation.
For answering Research Question 2, the distribution t-test can be used to assess the result of two differ-
ent approaches (Type I: the annotation system vs. random guessing; Type II: the annotation system vs.
human annotations) for the confidence level of data distribution. The following hypotheses correspond to
the research questions of Types I and II evaluations. Note that for Research Question 3 of both evaluation
![Page 8: Qualitative evaluation of automatic assignment of keywords to images](https://reader031.vdocuments.mx/reader031/viewer/2022020604/575073871a28abdd2e8fea2c/html5/thumbnails/8.jpg)
C.-F. Tsai et al. / Information Processing and Management 42 (2006) 136–154 143
methods, a null hypothesis is not made because we do not assume which classes obtain most agreement
between two systems in this study.
3.3.1. Hypothesis testing for Type I evaluation
• Null hypothesis 1 for Question 1 (Type I): the judgments are not correlated with a low level of signifi-
cance. That is, we intend to first of all test whether the human judgments are consistent or correlated in a
certain degree by the correlation coefficient analysis. In addition, we believe that the judgments are cer-
tainly correlated with a high level of significance.
• Null hypothesis 2 for Question 2 (Type I): the system does not outperform the random guessing
approach. That is, we assume that the results of our system outperform the random guessing approach
within a high level of significance.
3.3.2. Hypothesis testing for Type II evaluation
• Null hypothesis 1 for Question 1 (Type II): the human annotations are not correlated with a low level of
significance. To test this hypothesis, we assume that the human annotations are certainly correlated with
a high level of significance via the correlation coefficient analysis.
• Null hypothesis 2 for Question 2 (Type II): the annotation results of the system are not compatible with
human annotations. That is, we hypothesise that the system annotations are similar to human annota-tions within a high level of significance.
4. An experimental example
4.1. Type I evaluation: human assessments for the annotation results
4.1.1. The judges
We asked five judges (PhD research students) who are not experts in image indexing and retrieval to de-
cide whether the keywords which are assigned by our system and the random guessing approach are rele-
vant to that image. There were three male judges and two females who were all English first language
speakers.
4.1.2. The test set
We considered two datasets. One was the Corel image collection and the other one was supplied by
Washington University. 1 Our prototype system, CLAIRE (CLAssifying Images for REtreieval), is imple-
mented based on a two-level learning framework. Colour and texture classifiers are used for low-level clas-
sification as a first-level learning machine and a high-level concept classifier which learns from the outputs
of the first-level classifiers is used for the final decisions (image annotation) as the second-level learning de-
vice (Tsai, McGarry, & Tait, 2004). In addition, each image is first resized into 128*128 pixel resolution
and then partitioned into five equal-sized patches based on the tiling scheme shown in Fig. 2. That is, each
image contains four tiles corresponding to the four quadrants of the image and one tile for the centre sub-image. This scheme was adopted because of the expectation that one of the major subjects of interest in a
1 Available at: http://www.cs.washington.edu/research/imagedatabase/groundtruth/.
![Page 9: Qualitative evaluation of automatic assignment of keywords to images](https://reader031.vdocuments.mx/reader031/viewer/2022020604/575073871a28abdd2e8fea2c/html5/thumbnails/9.jpg)
Fig. 2. The tiling scheme.
144 C.-F. Tsai et al. / Information Processing and Management 42 (2006) 136–154
photograph is usually placed at or close to the centre of the image. Each tile subimage is run through the
learning machines separately, so each image has five keywords assigned by CLAIRE.
In these experiments CLAIRE is trained on an extract from Corel to assign 60 keywords to unseen
images, such as sky, tree, grass, building, etc. In particular, the trained learning machines assign one of
the 60 keywords to each of the five tiles in the unseen image. Note that we did not consider the pre-definedconceptual categories of Corel for training and testing in this test. The testing images are outside of the
training examples and composed of the two datasets. At the beginning, we manually selected 800 images
whose contents reflect the 60 keywords, in which 650 images are from the first dataset (Corel) and 150
images from the second (Washington). Each of the 60 keywords is assigned to at least 10 images out of
the 800. Then, we randomly selected 60 images from the 800 images as the test set for human judgments.
Next, 40 images out of these 60 were annotated with keywords by CLAIRE and 20 were assigned keywords
by the random guessing approach. Note that the five judges do not know which images were processed by
which approach.CLAIRE has been developed as a step towards an integrated content-based image retrieval system in
which users are able to query an image database using a combination of analytic keywords queries, simi-
larity searching and browsing. An assumption of our framework is that users of such a system will be happy
with automatic indexing strategies which focus on achieving high recall in response to (initial) keyword que-
ries. This is because, compared to text, dealing with irrelevant images (we posit) requires a low cognitive
and interaction load if the system provides an interface incorporating thumbnail presentation of images,
the use of spatial metaphors, query by (multiple) image example and so on. An exploration of these
assumptions goes well beyond this paper and forms part of our future research programme.However, because we are trying to obtain high recall through automatic indexing, it is more important to
annotate an image with a relevant keyword than it is to assign an irrelevant keyword to that image. We
therefore report as successful any annotation of any image with a relevant keyword, regardless of the num-
ber of irrelevant keywords assigned. Of course this approach cannot be taken to the limit: one could, for
example, in principle assign all the keywords in the system�s vocabulary to every image.
The application of our evaluation framework here never assigns more than five keywords to any image,
so even given the limited scale of vocabulary used in the experiments reported in this paper, only a small
proportion of keywords are assigned.In a sense the restriction on the numbers of keywords assigned is acting as a balancing precision enhanc-
ing measure. However the use of evaluation frameworks which fully balance precision and recall, like
reporting at the 11 standard points of recall (Hersh, Buckley, Leone, & Hickam, 1994), precision at 10
(Xu & Croft, 1998), van Rijsbergen�s F-measure and so on (as are commonly adopted in text retrieval)
go well beyond any evaluations reported in Section 2, and would require further development of our eval-
uation framework. We will return to this issue in the conclusion.
It is worth noting that in this test CLAIRE only assigns 47 of the possible 60 keywords to any image in
the experimental set.One further point needs to be made about our approach to indexing images by automatically annotating
them with keywords. We have formulated the problem of accurately assigning keywords as the task of
correctly identifying that class of images to which a human indexer is likely to find a particular key word
![Page 10: Qualitative evaluation of automatic assignment of keywords to images](https://reader031.vdocuments.mx/reader031/viewer/2022020604/575073871a28abdd2e8fea2c/html5/thumbnails/10.jpg)
C.-F. Tsai et al. / Information Processing and Management 42 (2006) 136–154 145
relevant. Having formulated the task in this way it is often convenient to discuss our experiments and their
results using the language and metrics of automatic classification. We are conscious of sometimes sliding
between the language of automatic classification and that of automatic indexing where one or the other
seems to aid comprehension or readability. We hope the reader will find that acceptable.
4.1.3. The assessment tool and rules
We designed a simple interface shown in Fig. 3, a Microsoft Access database form, for the judges to eval-
uate whether the images have any relevant keywords assigned. The judges were asked to tick one or more of
the five selection boxes corresponding to five keywords per image if the keyword(s) are relevant to their
associated images; otherwise, they are asked to move to the next image for more judgments. They can also
go back to change their selections if any. As described above, our system assigns five keywords to an image,
which correspond to each of the five tiles. Therefore, if some of the five keywords were duplicated and rel-
evant, they were asked to select all of them. For example, in Fig. 3 if any one of the five judges thinks thatgrass is relevant to the image, the third and fifth selection boxes corresponding to grass will be selected.
That is, if the two grass keywords are selected, the system is thought to be able to assign 2 relevant key-
words out of 5 (i.e. 40% accuracy) in this case, but 20% (1 relevant keyword out of 5) if only one occurrence
of the keyword was considered. In retrospect it might have been useful to adopt other approaches to nor-
malising for the effects of duplicated system generated keywords. However, in practice the alternatives seem
unlikely to substantially change the results. To complete this assessment, each judge spent about 10–15 min.
4.1.4. Results
As described in Section 3.2.4, the collected data/materials are quantised for rates of classification accu-
racy or precision. For example, in Fig. 3 if a judge selects sky, grass, mountain, and grass (from the second
to fifth keywords) as relevant to the image, then the system assigns 4 relevant keywords out of 5 to the im-
age. The following shows the results of human judgments.
4.1.4.1. Human judgments for the number of relevant keywords assigned. Fig. 4 shows the number of auto-
matically assigned relevant keywords for each image of the test set as assessed by the judges. This result
shows something about the subjectivity and variability of human judgments. For example, Judge 1 is
Fig. 3. The interface for the judges to select relevant keyword(s) if any.
![Page 11: Qualitative evaluation of automatic assignment of keywords to images](https://reader031.vdocuments.mx/reader031/viewer/2022020604/575073871a28abdd2e8fea2c/html5/thumbnails/11.jpg)
146 C.-F. Tsai et al. / Information Processing and Management 42 (2006) 136–154
consistently more likely to judge a keyword as relevant than Judge 5. This illustrates the necessity of mea-
suring the correlation between the judgments as part of our claim for the Research Question 1/Null hypoth-
esis 1.
4.1.4.2. Result comparisons. Fig. 5 shows the results of our system and the random guessing approach fromthe judgments. The result show that CLAIRE has not only lower errors in the context of assigning five irrel-
evant keywords to an image, but also higher accuracies for assigning 1–5 relevant keywords to an image
than the random guessing approach.
4.1.4.3. Annotation accuracy and disagreement between different judges. Fig. 6 shows the rate of classification
accuracy for those classes or keywords for which there is an especially high level of variation between
the assessments from different human judges. In other words the keywords or classes for which selection(s)
(i.e. classification rates) are well correlated are omitted. As CLAIRE assigns 47 different keywords to thetest set, there are 13 keywords about which the five judges have different opinions. Although human
judgments for image annotation are subjective, under the scale of 60-category classification these general
(non-professional) users agree with most of the keywords assigned to the images. The following statistical
0
5
10
15
20
25
30
0 1 2 3 4 5
No. of relevant keyword(s)
No.
of i
mag
e(s)
Judge 1Judge 2Judge 3Judge 4Judge 5
Fig. 4. The number of relevant keyword(s) of the test set selected by the judges.
8/40
32/40
20/40
9/40
4/401/40
10/20 10/20
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
0 >= 1 >= 2 >= 3 >= 4 5
No. of relevant keyword(s)
Rel
evan
t/irr
elev
ant r
ate
CLAIRE
Guessing
Fig. 5. Average numbers of images which have 0, 1, 2, 3, 4, and 5 relevant keywords assigned.
![Page 12: Qualitative evaluation of automatic assignment of keywords to images](https://reader031.vdocuments.mx/reader031/viewer/2022020604/575073871a28abdd2e8fea2c/html5/thumbnails/12.jpg)
tree
train
stre
etsky
sailin
g
ocea
n
mou
ntai
n
grou
nd
gras
s
clou
ds
city
scap
e
build
ing
boat
s
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Prec
isio
n
Judge 1
Judge 2Judge 3
Judge 4
Judge 5
Fig. 6. The keywords which have particularly different selections from the judges.
C.-F. Tsai et al. / Information Processing and Management 42 (2006) 136–154 147
data analysis will measure the extent of (dis)agreements between different judges and the automatic
indexing system.
4.1.5. Data analysis
In this section we present two results from our analysis of the behaviour of the human judges in the TypeI experiments. They are, first, the number of keywords assigned by each judge on the test set as shown in
Fig. 4 and, second, the actual keywords associated with each image, measured as classification accuracy or
precision, partly shown in Fig. 6. For the judgments about the number of relevant keywords over the test
set, we obtained r = 0.8385 of the five judges which is significant at the 0.05 level. This shows there is a sig-
nificant (p > 0.95) level of correlation between the number of automatically assigned keywords judged as
relevant by different judges. For the particular keywords associated with each image, we obtained
r = 0.9468 of the five judges which is significant at the 0.01 level. This demonstrates that there is a very sig-
nificant (p > 0.99) degree of correlation between the automatically assigned keywords judged as relevant bydifferent judges. Therefore, the human judgments are highly correlated and thus consistent and reliable. As
a result, the Null hypothesis 1 (that the judgments are not correlated) is rejected.
The difference in performance of CLAIRE and the random guessing approach was assessed by using the
t-test. We obtained t = 2.77 > 2.306 (a = 0.05 and df = 4). This rejects the Null hypothesis 2, that there
would be no difference in the human judges assessment of the keywords assign by CLAIR and the random
guessing approach. Therefore, there is a statistically significant evidence that our system improves on or
outperforms the random guessing approach.
For Question 3, the qualitative data show that for most of the 60 keywords the five judges agree onwhether or not a particular keyword is being assigned to relevant images. (Remember, however, in practice
CLAIRE only ever assigned 47 of the 60 keywords, but the random guessing approach may assign any of
the 60.) However, there are 13 keywords, shown in Fig. 6, for which there is significant variation between
the behaviour of the five judges. In addition, over half of the judges have different selections for the clouds,
grass, ground, tree classes. That is, we only got r = 0.4965 for the four classes, but r = 0.8938 for the other
nine classes.
Although human judgments are subjective, the results imply that they are consistent under the scale of
the 60 categories. That is, (untrained) people do not have very variable opinions about the relevance of the60 keywords corresponding to their images. Therefore, this evaluation can validate that our system im-
proves upon or outperforms the random guessing approach.
![Page 13: Qualitative evaluation of automatic assignment of keywords to images](https://reader031.vdocuments.mx/reader031/viewer/2022020604/575073871a28abdd2e8fea2c/html5/thumbnails/13.jpg)
Table 2
The 150 keywords
100 Concrete classes agates coast homes pills tall ship
antelope cuisines horses plants texture
antique dessert jewellery polo things
balloon dogs lighthouse predator tools
beaches dogsled machinery primates train
bobsled doors mammals pub signs tulips
bonsai drinks men puma valley
botany everglade marble pyramids vegetable
beads fabric masks race car volcano
building firearms minerals reptile war plane
buses firework monuments road water fall
butterfly flags mountain rock form waves
cactus flora mushroom rodeo wildcats
cars flower old dish roses wild bird
cards flower bed old doll sail wild fish
castles foliage orchids sculpture wild goat
cats fractals owls shells wild nests
children fruit palaces stamps whale
churches graffiti penguin steam engine work ship
clothing hawk perennial subsea women
50 Abstract classes architecture dawn gardens night space
autumn desert glamour office sports
aviation estate golf old works summer
ballet farm harbours parades sunsets
barbecue fashion industry park surfing
barnyard festival interior pastoral tropical
battles fitness kitchen rafting vineyard
com. tech. forests leisure ruins waterway
couples fountain market rural wet sport
cruise game nature scene winter
148 C.-F. Tsai et al. / Information Processing and Management 42 (2006) 136–154
This discussion also shows how a Type I evaluation method can provide a powerful means of assessing
image retrieval performance and hence help enhance the level of image retrieval system effectiveness.
4.2. Type II evaluation: annotation comparison between the system and humans
4.2.1. The annotators
To assess consistency of evaluation, the five human subjects participated in Type I evaluation were asked
to annotate a given set of images. Note that these two types of evaluation were conducted at different times.
4.2.2. The test set
In this test the five human subjects were asked to annotate a given number of images with the 150 key-
words listed in Table 2. The selection of these 150 keywords is based on the corresponding categories on the
Corel dataset (CD1, 7, and 8). One third of the 150 categories were selected to be abstract concepts, such as
festival, parades, tropical, etc. and two third of the 150 categories were concrete concepts, such as car, but-
terfly, antelope, etc. We follow WordNet, 2 taking concrete concepts to be a physical object or entity and
abstract to be marked as an abstraction, human activity, or an assemblage of multiple physical objects or
entities. This gives a comparable but specific definition of different levels of high-levels concepts as used in
2 Available at: http://www.cogsci.princeton.edu/~wn/.
![Page 14: Qualitative evaluation of automatic assignment of keywords to images](https://reader031.vdocuments.mx/reader031/viewer/2022020604/575073871a28abdd2e8fea2c/html5/thumbnails/14.jpg)
C.-F. Tsai et al. / Information Processing and Management 42 (2006) 136–154 149
previous work, like internal category structure (i.e. levels of categorisation) (Rosch, 1973), generic/specific/
abstract levels (Jorgensen, Jaimes, Benitez, & Chang, 2001) and Level 2 and 3 (Eakins, 2002) where Level 2
involves some degree of inference about the identity of objects and Level 3 involves complex reasoning
about the significance of the objects or scene depicted.
For the experiment with the human subjects we manually selected 3000 unseen images in which each ofthe 150 keywords (or really here categories) has 20 images. Then, 30 sets of images including 24 reflecting
concrete concepts and 6 reflecting abstract concepts were randomly chosen from the 150 keywords. Next,
two images indexed by each of the 30 keywords were randomly selected to be the test set which therefore
contains 60 images.
For system annotation, CLAIRE was first trained using the 150 categories, so each category has 30
training examples. Note that these categories do not include the Corel classes for which Muller, Marc-
hand-Maillet, and Pun (2002) would predict problematical good performance. Next, the chosen test set
containing the 60 images is used to test the trained system for annotation.
4.2.3. The annotation tool and rules
Similar to the assessment tool, Fig. 7 shows the interface for human annotations. During annotation, the
150 keywords shown in Table 2 were provided on paper to help the subjects to annotate the 60 images. Five
blanks associated with each image allow the subjects to annotate images with this controlled set of key-
words. The annotation requirement is to assign at least two relevant keywords to each image and so the
maximum is five keywords per image. Subjects can go back to change their annotations if any; otherwise,
they are asked for next annotation. On average, every annotator took about 40 min to finish this task, sospending about 45 s per image on average.
4.2.4. Results
The collected data and material, i.e. human annotations, are used as a set of ground truths in which
annotations of each subject are somewhat different and treated individually. The keyword annotation re-
sults of CLAIRE were compared with the annotations of each subject in turn. Following the quantisation
method for this comparison described in Section 3.2.4, Fig. 8 shows the annotation result of the system
compared with the five sets of human annotation. On average, our system produces 22.2% annotation accu-racy under the scale of 150 categories, which is composed of 21.6% and 20.83% annotation accuracy for
concrete and abstract keywords respectively.
Fig. 7. The interface for subjects to annotate images.
![Page 15: Qualitative evaluation of automatic assignment of keywords to images](https://reader031.vdocuments.mx/reader031/viewer/2022020604/575073871a28abdd2e8fea2c/html5/thumbnails/15.jpg)
0%
5%
10%
15%
20%
25%
30%
1 2 3 4 5Judges
Ann
otat
ion
accu
racy
Fig. 8. Annotation accuracy of only/at least one relevant keyword(s) per image.
150 C.-F. Tsai et al. / Information Processing and Management 42 (2006) 136–154
This graph also shows that each general user has his/her own opinion about the relevance of some key-
words to particular images. Further, in fact these human annotations must be somewhat different from the
ground truth of Corel dataset since our human subjects were asked to provide multiple keywords, where as
Corel would provide only one per image (in effect). Furthermore, in some cases the human subjects did notselect the Corel category at all. For example, there is an image which belongs to the game category of Corel,
but four of the five judges assign parades and three judges out of the five assign children and others to the
image.
On average, the human subjects assigned 3.07 keywords to an image. When we examine the agreement
between the human annotations, most judges (i.e. at least three) assign the same 0.93 abstract keywords and
1.4 concrete keywords to each of the 60 images. It is interesting that the proportion of the randomly chosen
abstract and concrete keywords in the test set is 1:4 (i.e. 6 abstract vs. concrete keywords) from the Corel
dataset shown in Table 2, but the proportion of the abstract and concrete keywords assigned by at least 3subjects is 17:37 (i.e. near 1:2) which is very different from the ground truth of Corel of 1:4. This also shows
that human subjects may not agree with the pre-defined categories of Corel.
For many practical image retrieval applications users� views about the relevance or otherwise of a par-
ticular keyword to a particular image are more important that the perhaps differing views of professional
indexers. The existence of a difference is illustrated by the gap between the assessments of our users and the
Corel categories just noted. A general user may not find the Corel categories especially useful since they
appear to poorly reflect intuitive notions of relevance. This is of great importance in automatic image index-
ing research, since it suggests that a system which poorly mimics Corel in terms of categorisation behaviourmight in fact do a good job of reflecting more intuitive notions of the relevance of keywords.
Therefore this implies that using some chosen ground truth dataset, or certainly at least one in which
each image is associated with a single keyword or category (like Corel) is insufficient to validate image
annotation systems as it does not adequately reflect human indexing behaviour. This leads to the need
for additional user-centred evaluation to fully assess and understand the performance of these systems,
which is our claim for the proposed evaluation methodology in this paper.
4.2.5. Data analysis
According to the comparison results shown in Fig. 8, for the correlation coefficient of considering only/at
least one relevant keyword assigned by the system, we obtained r = 0.678 from the human annotations
which is significant at the 0.01 level (p > 0.99). (For considering the relevance of the five assigned keywords
per image, we obtained r = 0.709 from the human annotations which is also significant at the 0.01 level
(p > 0.99).) The results show that the human annotations are moderately correlated when using 150 con-
trolled keywords.
![Page 16: Qualitative evaluation of automatic assignment of keywords to images](https://reader031.vdocuments.mx/reader031/viewer/2022020604/575073871a28abdd2e8fea2c/html5/thumbnails/16.jpg)
C.-F. Tsai et al. / Information Processing and Management 42 (2006) 136–154 151
As Fig. 8 shows the system annotation performance based on human annotation, our system is not com-
patible with human annotation when using a controlled vocabulary of 150 keywords. That is, the difference
in performance of CLAIRE and the human annotations was assessed by using the t-test. We obtained
t = �39.575 (a = 0.00 and df = 4). Therefore, we keep the Null Hypothesis 2.
For Question 3: considering issues of inter-annotator agreement amongst human annotators; there are20 categories or keywords that all of the five judges applied to precisely the same set of images. There were
54 keywords for which there was a clear majority view (i.e. three out of five judges) about the images to
which they were relevant. Therefore, for our controlled set of 150 keywords, there was broad agreement
about the images to which they applied.
5. Discussion
These results, especially those of Section 4.2.5 show that even a state of the art automatic image indexing
system like CLAIRE cannot match the performance of human annotators in terms annotation accuracy,
especially when the classification scale (i.e. number of words in the indexing vocabulary) increases. How-
ever a closer inspection of these results indicates that the performance actually achieved may be useful
in the context of building practical image retrieval systems which can take initial keyword queries. For
example, presenting 20 thumbnail images on an initial query result screen appears not to be too many.
On average, around four relevant images should be presented, and this would be a useful basis on which
to perform relevance feedback as a query refinement process.Looking over the complete set of results obtained from our two types of evaluation some other interest-
ing issues come to light. By examining the correlation results of Type I evaluation, the scale of the keyword
vocabulary and the evaluation methodology are the key factors of affecting the correlation between human
judgments of relevance and (more or less) spontaneous annotations.
First, as the number of keywords becomes larger, human subjects may have to make more difficult
choices in deciding the annotations to pick for an image. This could decrease the correlation level between
human judgments as well as annotations. In other words the more fine grained the human annotators deci-
sions have to be in assigning a fixed number of keywords to a given image, the harder it is to replicate theset of keywords chosen (whether automatically or manually) and the more likely it is that a subsequent
judge will disagree with the relevance of a particular keyword for a particular image.
Second, and on the other hand, the Type I evaluation method, i.e. to directly assess the system�s outputsfor the relevance of the keywords assigned, could increase the assessed effectiveness of an image annotation
system compared to a Type II evaluation, as happened here. This is because an automatic image indexing
system cannot match increasingly variable human keyword selections as the number of available keywords
increases, whilst the keywords actually assigned may actually be judged relevant to the image, even when
they are not the keyword which the judge would have selected.This last point is extremely important. It may be that the current apparent ceiling in automatic image
indexing system performance, in which keyword vocabularies or numbers of categories can only be in-
creased with corresponding decreases in assignment accuracy, may be an artefact of the evaluation task.
6. Conclusion
Evaluation is a critical issue for information retrieval, and to fully understand the performance of IRsystems it is necessary to consider both system- and user-centred evaluations. Image retrieval has become
an active research area, and in image retrieval, much of the current research effort is focused on automat-
ically annotating or indexing images to facilitate search in image databases. Most of the existing automatic
![Page 17: Qualitative evaluation of automatic assignment of keywords to images](https://reader031.vdocuments.mx/reader031/viewer/2022020604/575073871a28abdd2e8fea2c/html5/thumbnails/17.jpg)
152 C.-F. Tsai et al. / Information Processing and Management 42 (2006) 136–154
image annotation systems are evaluated against their annotation or classification accuracy based on some
chosen ground truth data set, such as Corel, which are not necessarily ideal test collections. So, as currently
there is no standard image dataset for evaluation, full human centred assessments are required, but these
are difficult to design and expensive to administer.
We have presented a qualitative evaluation methodology for systems which automatically assign key-words to images. It is composed of two evaluation methods: the first of which is based on a human assess-
ment of annotation accuracy and second the construction of a comparable pre-defined ground truth for
further evaluation. Most systems or approaches reported in the literature use only one of these two meth-
ods, making it difficult to assess the extent to which the results can be extrapolated to real annotation and
retrieval situations. In this paper, we show that this two stage human-centred evaluation methodology pro-
vides deeper understanding about not only the performance of an image annotation system but also the
consistency of human judgments about the relevance of high-level concept terms to images. The combina-
tion allows us to draw well-founded conclusions from relatively simple and modest scale human centredevaluations.
Turning to the CLAIRE system we used as our example for evaluation, according to the first evaluation
method (results assessment by humans), the system performance of CLAIRE is promising. For a fixed
vocabulary of 150 index terms, the system assigns rather different keywords than the human annotators.
However, this is the assessment of indexing results instead of retrieval ones. Using the user assessment
of the results of image annotation (reported in Section 4.1.4) we have indications that the CLAIRE system
was sufficiently accurate at assigning keywords to be potentially useful for practical image retrieval, pro-
vided it was combined with other techniques like thumbnail browsing, relevance feedback or other queryingand browsing techniques. Our evaluation methodology allowed us to draw this conclusion despite the fact
that the retrieval performance did not successfully emulate human indexing performance.
The application of the evaluation methodology reported in this paper must be regarded as an initial step
towards a more systematic and comprehensive evaluation of image retrieval systems. Amongst areas which
require further exploration are: precision of automatic indexing versus precision in querying; appropriate
baselines against which to measure performance; comfort and usability of automatic indexing strategies in
the context of integrated image retrieval systems; differences between individual searcher behaviour and
preferences according to task and context; and the production of standardised test collections and judgedquery sets against which indexing strategies may be assessed in isolation.
At present we are working to embed the CLAIRE indexing engine in a complete image retrieval system
to explore some of these issues.
This study indicates that combining different evaluation strategies can produce new results and a deeper
understanding of the system performance. We hope that future systems are assessed robustly by conducting
a detailed evaluation methodology as proposed in this paper.
Acknowledgement
The authors would like to thank Chris Stokoe, James Malone, Sheila Garfield, Mark Elshaw, and Jean
Davison to participate the system evaluation.
References
Applegate, R. (1993). Models of user satisfaction: understanding false positives. Reference Quarterly, 32(4), 525–539.
Armitage, L., & Enser, P. G. B. (1997). Analysis of user need in image archives. Journal of Information Science, 23(4), 287–299.
Baeza-Yates, R., & Ribeiro-Neto, B. (1999). Modern information retrieval. England: Addison Wesley.
Barnard, K., Duygulu, P., Forsyth, D., de Freitas, N., Blei, D. M., & Jordan, M. I. (2003). Matching words and pictures. Journal of
Machine Learning Research, 3, 1107–1135.
![Page 18: Qualitative evaluation of automatic assignment of keywords to images](https://reader031.vdocuments.mx/reader031/viewer/2022020604/575073871a28abdd2e8fea2c/html5/thumbnails/18.jpg)
C.-F. Tsai et al. / Information Processing and Management 42 (2006) 136–154 153
Barnard, K., & Shirahatti, N. V. (2003). A method for comparing content based image retrieval methods. In Proceedings of SPIE
Internet imaging IV (vol. 5018). Santa Clara, California, USA.
Belkin, N. J., Cool, C., Kelly, D., Lin, S.-J., Park, S. Y., Perez-Carballo, J., & Sikora, C. (2001). Iterative exploration, design and
evaluation of support for query reformulation in interactive information retrieval. Information Processing and Management, 37(3),
403–434.
Black Jr., J. A., Fahmy, G., & Panchanathan, S. (2002). A method for evaluating the performance of content-based image retrieval
system based on subjectivity determined similarity between images. In Proceedings of the international conference on image and video
retrieval (pp. 356–366). London, UK.
Chen, H.-L. (2001). An analysis of image retrieval tasks in the field of art history. Information Processing and Management, 37(5),
701–720.
Choi, Y., & Rasmussen, E. M. (2002). Users� relevance criteria in image retrieval in American history. Information Processing and
Management, 38, 695–726.
Ciocca, G., & Schettini, R. (1999). A relevance feedback mechanism for content-based image retrieval. Information Processing and
Management, 35(5), 605–632.
Conniss, L. R., Ashford, A. J., & Graham, M. E. (2000). Information seeking behaviour in image retrieval: VISOR I final report.
Technical Report, Institute for Image Data Research, University of Northumbria at Newcastle.
Cox, I. J., Miller, M. L., Minka, T. P., Papathomas, T. V., & Yianilos, P. N. (2000). The Bayesian image retrieval system, PicHunter:
theory, implementation and psychophysical experiments. IEEE Transactions on Image Processing, 9(1), 20–37.
Craswell, N., Hawking, D., Wilkinson, R., & Wu, M. (2003). Overview of the TREC 2003 web track. In Proceedings of the 12th text
retrieval conference (TREC 2003).
Dunlop, M. (2000). Reflections on Mira: interactive evaluation in information retrieval. Journal of the American Society for
Information Science, 51(14), 1269–1274.
Eakins, J. P. (2002). Towards intelligent image retrieval. Pattern Recognition, 35, 3–14.
Efthimiadis, E. N., & Fidel, R. (2000). The effect of query type on subject searching behavior of image databases: an exploratory study.
In Proceedings of the 23rd annual international ACM SIGIR conference on research and development in information retrieval
(pp. 328–330). Athens, Greece.
Fidel, R. (1997). The image retrieval task: implications for the design and evaluation of image databases. New Review of Hypermedia
and Multimedia, 3, 181–199.
Goodrum, A., & Spink, A. (2001). Image searching on the Excite Web search engine. Information Processing and Management, 37(2),
295–311.
Gorkani, M. M., & Picard, R. W. (1994). Texture orientation for sorting photos �at a glance�. In Proceeding of the IEEE international
conference on pattern recognition (pp. 459–464). Jerusalem, Israel.
Gudivada, V. N., & Raghavan, V. V. (1997). Modeling and retrieving images by content. Information Processing and Management,
33(4), 427–452.
Han, K.-A., & Myaeng, S.-H. (1996). Image organization and retrieval with automatically constructed feature vectors. In Proceedings
of the 19th annual international ACM SIGIR conference on research and development in information retrieval (pp. 157–165). Zurich,
Switzerland.
Heath, M., Sarkar, S., Sanocki, T., & Bowyer, K. (1998). Comparison of edge detectors: a methodology and initial study. Computer
Vision and Image Understanding, 69(1), 38–54.
Hersh, W. R., Buckley, C., Leone, T. J., & Hickam, D. H. (1994). OHSUMED: an interactive retrieval evaluation and new large test
collection for research. In Proceedings of the ACM SIGIR conference on research and development in information retrieval (pp. 192–
201). Dublin, Ireland, July 3–6.
Hersh, W., Turpin, A., Price, S., Kraemer, D., Olson, D., Chan, B., & Sacherek, L. (2001). Challenging conventional assumptions of
automated information retrieval with real users: Boolean searching and batch retrieval evaluations. Information Processing and
Management, 37(3), 383–402.
Jorgensen, C. (1998). Attributes of images in describing tasks. Information Processing and Management, 34(2–3), 427–452.
Jorgensen, C., Jaimes, A., Benitez, A. B., & Chang, S.-F. (2001). A conceptual framework and research for classifying visual
descriptors. Journal of the American Society for Information Science, Special Issue on Image Access: Bridging Multiple Needs and
Multiple Perspectives, 52(11), 938–947.
Jose, J. M., Furner, J., & Harper, D. J. (1998). Spatial querying for image retrieval: a user-oriented evaluation. In Proceedings of the
21st annual international ACM SIGIR conference on research and development in information retrieval (pp. 232–241). Melbourne,
Australia.
Kuroda, K., & Hagiwara, M. (2002). An image retrieval system by impression words and specific object names—IRIS.
Neurocomputing, 43(1–4), 259–276.
Li, J., & Wang, J. Z. (2003). Automatic linguistic indexing of pictures by a statistical modeling approach. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 25(9), 1075–1088.
![Page 19: Qualitative evaluation of automatic assignment of keywords to images](https://reader031.vdocuments.mx/reader031/viewer/2022020604/575073871a28abdd2e8fea2c/html5/thumbnails/19.jpg)
154 C.-F. Tsai et al. / Information Processing and Management 42 (2006) 136–154
Liu, W., Zhang, S., Li, S., Sun, Y.-F., & Zhang, H. J. (2001). A performance evaluation protocol for content-based image retrieval
algorithms/systems. In Proceedings of the IEEE CVPR workshop on empirical evaluation methods in computer vision. Kauai, USA.
Markkula, M., & Sormunen, E. (2000). End-user searching challenges indexing practices in the digital newspaper photo archive.
Information Retrieval, 1(4), 259–285.
Markkular, M., Tico, M., Sepponen, B., Nirkkonen, K., & Sormunen, E. (2001). A test collection for the evaluation of content-based
image retrieval algorithms—a user and task-based approach. Information Retrieval, 4(3–4), 275–293.
Martin, D., Fowlkes, C., Tal, D., & Malik, J. (2001). A database of human segmented natural images and its application to evaluating
segmentation algorithms and measuring ecological statistics. In Proceedings of the IEEE international conference on computer vision
(pp. 416–423). Vancouver, Canada.
McDonald, S., Lai, T.-S., & Tait, J. (2001). Evaluating a content-based image retrieval system. In Proceedings of the 24th annual
international ACM SIGIR conference on research and development in information retrieval (pp. 232–240). New Orleans, Louisiana,
USA.
McDonald, S., & Tait, J. (2003). Search strategies in content-based image retrieval. In Proceedings of the 26th annual international
ACM SIGIR conference on research and development in information retrieval (pp. 80–87). Toronto, Canada, USA.
Mehtre, B. M., Kankanhalli, M. S., & Lee, W. F. (1998). Content-based image retrieval using a composite color-shape approach.
Information Processing and Management, 34(1), 109–120.
Minka, T. P., & Picard, R. W. (1997). Interactive learning with a ‘‘a society of models’’. Pattern Recognition, 30(4), 565–581.
Muller, H., Marchand-Maillet, S., & Pun, T. (2002). The truth about Corel—evaluation in image retrieval. In Proceedings of the
international conference on image and video retrieval (pp. 38–49). London, UK.
Muller, H., Muller, W., & Squire, D. M. (2000). Learning feature weights from user behavior in content-based image retrieval. In
Workshop on multimedia data mining, in conjunction with sixth ACM SIGKDD international conference on knowledge discovery &
data mining (pp. 67–72). Boston, MA, USA.
Pagano, R. R. (2001). Understanding statistics in the behavioral sciences (Sixth Edition). California: Wadsworth/Thomson Learning.
Park, S. B., Lee, J. W., & Kim, S. K. (2004). Content-based image classification using a neural network. Pattern Recognition Letters,
25(3), 287–300.
Rasmussen, E. M. (1997). Indexing images. Annual Review of Information Science and Technology, 32, 169–196.
Rodden, K., Basalaj, W., Sinclair, D., & Wood, K. (2001). Does organisation by similarity assist image browsing? In Proceedings of the
SIG-CHI on human factors in computing systems (pp. 190–197). Seattle, Washington, USA.
Rosch, E. (1973). On the internal structure of perceptual and semantic categories. In T. E. Moore (Ed.), Cognitive development and the
acquisition of language. New York: Academic Press.
Sanchez, D., Chamorro-Martınez, J., & Vila, M. A. (2003). Modelling subjectivity in visual perception of orientation for image
retrieval. Information Processing and Management, 39(2), 251–266.
Saracevic, T. (1995). Evaluation of evaluation in information retrieval. In Proceedings of the 18th annual international ACM SIGIR
conference on research and development in information retrieval (pp. 138–146). Seattle, Washington, USA.
Saracevic, T., Mokros, H., & Su, L. (1990). Nature of interaction between users and intermediaries in online searching: a qualitative
analysis. In Proceedings of the 53rd annual meeting of the American society for information science (vol. 27, pp. 47–54).
Schamber, L. (1994). Relevance and information behaviour. Annual Review of Information Science and Technology, 29, 3–48.
Shaffrey, C. W., Jermyn, I. H., & Kingsbury, N. G. (2002). Psychovisual evaluation of image segmentation algorithms. In Proceedings
of advanced concepts for intelligent vision systems. Ghent University, Belgium.
Spink, A. (2002). A user-centered approach to evaluating human interaction with Web search engines: an exploratory study.
Information Processing and Management, 38(3), 401–426.
Squire, D. M., & Pun, T. (1998). Assessing agreement between human and machine clustering of image databases. Pattern Recognition,
31(12), 1905–1919.
Tsai, C.-F., McGarry, K., & Tait, J. (2003). Image classification using hybrid neural networks. In Proceedings of the ACM SIGIR
conference on research and development in information retrieval (pp. 431–432). Toronto, Canada.
Tsai, C.-F., McGarry, K., & Tait, J. (2004). Automatic metadata annotation of images via a two-level learning framework. In
Proceedings of the ACM SIGIR workshop on semantic web (pp. 32–42). Sheffield, UK, July 25–29.
Vailaya, A., Figueiredo, M. A. T., Jain, A. K., & Zhang, H.-J. (2001). Image classification for content-based indexing. IEEE
Transactions on Image Processing, 10(1), 117–130.
Wang, J. Z., Li, J., & Lin, S. C. (2003). Evaluation strategies for automatic linguistic indexing of pictures. In Proceedings of the IEEE
international conference on image processing (pp. 617–620). Barcelona, Spain.
Wu, J. K., & Narasimhalu, A. D. (1998). Fuzzy content-based retrieval in image databases. Information Processing and Management,
34(5), 513–534.
Xie, H. I. (1997). Planned and situated aspects in interactive IR: patterns of user interactive intentions and information seeking
strategies. In Proceedings of the 53rd annual meeting of the American society for information science (vol. 34, pp. 101–110).
Xu, J., & Croft, W. B. (1998). Corpus-based stemming using cooccurrence of word variants. ACM Transactions on Information
Systems, 16(1), 61–81.