automating insect identification: exploring the limitations of a prototype system

8
JAE 012 "0888# J[ Appl[ Ent[ 012\ 0Ð7 "0888# Þ 0888\ Blackwell Wissenschafts!Verlag\ Berlin ISSN 9820!1937 Automating insect identification: exploring the limitations of a prototype system P[ J[ D[ Weeks 0 \ M[ A[ O|Neill 1 \ K[ J[ Gaston 2 and I[ D[ Gauld 0 0 Department of Entomology\ The Natural History Museum\ London\ UK^ 1 Oxford Orthopaedic Engineering Centre\ Oxford\ UK and 2 Department of Animal and Plant Sciences\ University of She.eld\ She.eld\ UK Abstract] Automated identi_cation systems based on computer image analysis technology provide an attractive\ but as yet unexploited potential solution to the growing burden of routine species identi_cations presently faced by a dwindling community of expert taxonomists[ DAISY "the Digital Automated Identi_cation SYstem# is a prototype novel automated identi_cation system\ developed to explore this possibility[ In its pilot phase\ the DAISY algorithms were developed to discriminate _ve species of parasitic wasp\ based on di}erences in their wing structure[ Here\ again using wing form\ the ability of DAISY to discriminate amongst an order of magnitude more species Ð 38 species of closely related biting midges is examined[ In so doing an attempt is made to establish a set of basic {benchmark| tests of the e.cacy\ and weaknesses\ of such an automated identi_cation system[ 0 Introduction Reliable species!level identi_cation of organisms is a fundamental part of most biological work[ Accurate identi_cation underpins the control of agricultural pests "LATTIN and KNUTSON\ 0871^ HAWKSWORTH\ 0883#\ is vital for formulating conservation legislation "MAY\ 0889#\ assists with pharmaceutical prospecting "REID et al[\ 0882#\ and is essential for monitoring the spread of pollution and disease vectors "CHALMERS\ 0885#[ Species names are the language of biology\ and the label by which one accesses information about organisms[ Notwithstanding this importance\ the body of taxo! nomic expertise available to carry out reliable identi! _cations of insect pests\ pathogens and environmental indicators is being steadily eroded world!wide "GASTON and MAY\ 0881# and demand for routine identi_cations now far outstrips the capabilities of the dwindling bio! systematics community "HOLDEN\ 0878^ HOUSE OF LORDS\ 0880#[ This steadily worsening situation has attracted considerable international attention\ most recently from the intergovernmental Subsidiary Body for Scienti_c\ Technical and Technological Advice "SBSTTA# to the Convention on Biological Diversity[ The SBSTTA recognized that increasing taxonomic capacity is a sine qua non for the implementation of the Convention\ and has recommended the strengthening of infrastructure for taxonomy in biodiversity!rich tropical countries\ together with the transference from developed countries of modern technologies for tax! onomy "UNEP:CBD:SBSTTA:1\ 0885# in order to pro! vide a basis for the monitoring\ inventory making and sustainable utilization of biological diversity "DI CASTRI et al[\ 0881^ JANZEN\ 0882#[ The problem of insu.cient resources being available for the identi_cation of arthropods is further aggravated by the taxonomic community themselves[ Their e}orts U[ S[ Copyright Clearance Center Code Statement] 9820Ð1937:88:1290Ð9990 , 03[99:9 are often primarily focused on areas where intellectual debate can easily be conducted\ such as phylogenetic reconstruction[ The more mundane tasks of mono! graphic\ ~oristic and faunistic studies are less attractive\ both to scientists\ and to funding agencies[ Further! more\ traditional {applied| taxonomic products\ printed keys\ are often almost impossible to use without both adequate reference collections and an extensive knowl! edge of arcane specialist terminology[ Consequently\ even where the literature to identify organisms exists\ many biologists cannot and do not use it "GAULD\ 0875^ TILLING\ 0876^ ALBERCH\ 0882#[ In the jargon of the marketplace\ the products of the taxonomic community are generally not appropriate for the needs of the poten! tial user community[ In attempts to rectify this situation and overcome the resulting {taxonomic impediment|\ traditional taxo! nomic products are beginning to be augmented by the use of computerized multi!access keys\ beginning with text!based keys "e[g[ PANKHURST\ 0867# and cul! minating recently in multimedia works such as CABI! KEY "WHITE and SCOTT\ 0883#[ Whilst undoubtedly an advance over hard copy works\ computerized keys still rely on the ability of users to compare pictorial information with specimens[ Such skills are honed by years of practice in taxonomists\ but other biologists often experience great di.culty in appreciating the sub! tle di}erences in shape and form which discriminate taxa\ particularly among many groups of invertebrates[ Using computers to present taxonomic characters\ while relying on users to compare specimens to images or illustrations\ represents a failure fully to utilize the immense potential o}ered by information technology[ Image analysis techniques and technology\ in particu! lar\ have seen tremendous advances in recent years\ raising the possibility of automating\ or at least semi!

Upload: p-j-d-weeks

Post on 06-Jul-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Automating insect identification: exploring the limitations of a prototype system

JAE 012 "0888#

J[ Appl[ Ent[ 012\ 0Ð7 "0888#Þ 0888\ Blackwell Wissenschafts!Verlag\ BerlinISSN 9820!1937

Automating insect identification: exploring the limitations of aprototype systemP[ J[ D[ Weeks0\ M[ A[ O|Neill1\ K[ J[ Gaston2 and I[ D[ Gauld0

0Department of Entomology\ The Natural History Museum\ London\ UK^ 1Oxford Orthopaedic EngineeringCentre\ Oxford\ UK and 2Department of Animal and Plant Sciences\ University of She.eld\ She.eld\ UK

Abstract] Automated identi_cation systems based on computer image analysis technology provide an attractive\ but asyet unexploited potential solution to the growing burden of routine species identi_cations presently faced by a dwindlingcommunity of expert taxonomists[ DAISY "the Digital Automated Identi_cation SYstem# is a prototype novel automatedidenti_cation system\ developed to explore this possibility[ In its pilot phase\ the DAISY algorithms were developed todiscriminate _ve species of parasitic wasp\ based on di}erences in their wing structure[ Here\ again using wing form\ theability of DAISY to discriminate amongst an order of magnitude more species Ð 38 species of closely related bitingmidges is examined[ In so doing an attempt is made to establish a set of basic {benchmark| tests of the e.cacy\ andweaknesses\ of such an automated identi_cation system[

0 Introduction

Reliable species!level identi_cation of organisms is afundamental part of most biological work[ Accurateidenti_cation underpins the control of agricultural pests"LATTIN and KNUTSON\ 0871^ HAWKSWORTH\ 0883#\is vital for formulating conservation legislation "MAY\0889#\ assists with pharmaceutical prospecting "REID

et al[\ 0882#\ and is essential for monitoring the spreadof pollution and disease vectors "CHALMERS\ 0885#[Species names are the language of biology\ and the labelby which one accesses information about organisms[Notwithstanding this importance\ the body of taxo!nomic expertise available to carry out reliable identi!_cations of insect pests\ pathogens and environmentalindicators is being steadily eroded world!wide "GASTON

and MAY\ 0881# and demand for routine identi_cationsnow far outstrips the capabilities of the dwindling bio!systematics community "HOLDEN\ 0878^ HOUSE OF

LORDS\ 0880#[ This steadily worsening situation hasattracted considerable international attention\ mostrecently from the intergovernmental Subsidiary Bodyfor Scienti_c\ Technical and Technological Advice"SBSTTA# to the Convention on Biological Diversity[The SBSTTA recognized that increasing taxonomiccapacity is a sine qua non for the implementation of theConvention\ and has recommended the strengthening ofinfrastructure for taxonomy in biodiversity!rich tropicalcountries\ together with the transference fromdeveloped countries of modern technologies for tax!onomy "UNEP:CBD:SBSTTA:1\ 0885# in order to pro!vide a basis for the monitoring\ inventory making andsustainable utilization of biological diversity "DI CASTRIet al[\ 0881^ JANZEN\ 0882#[

The problem of insu.cient resources being availablefor the identi_cation of arthropods is further aggravatedby the taxonomic community themselves[ Their e}orts

U[ S[ Copyright Clearance Center Code Statement] 9820Ð1937:88:1290Ð9990 , 03[99:9

are often primarily focused on areas where intellectualdebate can easily be conducted\ such as phylogeneticreconstruction[ The more mundane tasks of mono!graphic\ ~oristic and faunistic studies are less attractive\both to scientists\ and to funding agencies[ Further!more\ traditional {applied| taxonomic products\ printedkeys\ are often almost impossible to use without bothadequate reference collections and an extensive knowl!edge of arcane specialist terminology[ Consequently\even where the literature to identify organisms exists\many biologists cannot and do not use it "GAULD\ 0875^TILLING\ 0876^ ALBERCH\ 0882#[ In the jargon of themarketplace\ the products of the taxonomic communityare generally not appropriate for the needs of the poten!tial user community[

In attempts to rectify this situation and overcome theresulting {taxonomic impediment|\ traditional taxo!nomic products are beginning to be augmented by theuse of computerized multi!access keys\ beginning withtext!based keys "e[g[ PANKHURST\ 0867# and cul!minating recently in multimedia works such as CABI!KEY "WHITE and SCOTT\ 0883#[ Whilst undoubtedlyan advance over hard copy works\ computerized keysstill rely on the ability of users to compare pictorialinformation with specimens[ Such skills are honed byyears of practice in taxonomists\ but other biologistsoften experience great di.culty in appreciating the sub!tle di}erences in shape and form which discriminatetaxa\ particularly among many groups of invertebrates[Using computers to present taxonomic characters\ whilerelying on users to compare specimens to images orillustrations\ represents a failure fully to utilize theimmense potential o}ered by information technology[

Image analysis techniques and technology\ in particu!lar\ have seen tremendous advances in recent years\raising the possibility of automating\ or at least semi!

Page 2: Automating insect identification: exploring the limitations of a prototype system

1 P[ J[ D[ Weeks et al[

automating\ much of the process of routine taxonomicidenti_cation[ However\ such an approach has to dateonly been used in a very limited fashion "WEEKS andGASTON\ 0886#[ Thus\ for example] "i# DALY et al["0871# used image analysis to measure 14 morphometriccharacters of honey bees^ the origin of the bees\ eitherEuropean or African\ was determined by discriminantanalysis^ "ii# ZHOU et al[ "0874# used image analysismethods to describe the venation of mosquito wings by_tting the coe.cients of polynomials to each vein^ and"iii# YU et al[ "0881# measured the wings of ich!neumonids by semi!automated image analysis^ dis!criminant analysis was used to identify _ve species\ onthe basis of di}erences in their wings\ achieving 099)accuracy[ Such studies encourage the belief that imageanalysis techniques may represent a way forwardtowards a large!scale taxonomic identi_cation systembased on computer vision\ but as yet no such systemexists[

In an attempt to begin to bridge this gap\ we havedeveloped a prototype novel automated identi_cationsystem "DAISY Ð the Digital Automated Identi_cationSYstem# the technical details of which have been fullydescribed elsewhere "WEEKS et al[\ 0887#\ which builton recent developments in human face detection "TURK

and PENTLAND\ 0880#[ Published tests of the systemthus far have revealed that it can discriminate between_ve species of ichneumonid wasps\ on the basis of wingstructure alone\ with 83) accuracy in correct identi!_cations and a reasonably predictable pattern of errors"WEEKS et al[\ 0886#[ In this paper two things areattempted[ First\ the ability of DAISY to discriminateamongst an order of magnitude more species of anentirely di}erent group of organisms\ biting midges\again based on di}erences in wing structure is examined[Second\ in so doing an attempt is made to establisha set of basic {benchmark| tests of the e.cacy\ andweaknesses of such an automated identi_cation system[

1 Materials and methods

1[0 The study organisms

The study was based on wings of specimens of biting midgesof the genera Culicoides and Forcipomyia "Dipt[\ Cera!topogonidae#[ Wing pattern is very important in the tax!onomy of the genus Culicoides[ However\ a quantitative rep!resentation of wing pattern has only rarely been achieved"LANE\ 0870#[ Forty!nine species were selected] Culicoides ach!rayi Kettle + Lawson\ C[ aitkeni Wirth + Blanton\ C[ bre!vipalpis Del_nado\ C[ brevifrontis Smatov + Isimbek[\ C[ bros!seti Vattier + Adam\ C[ brucei Austen\ C[ brunnicans Edwards\C[ cataneii Clastrier\ C[ circumscriptus Kie}er\ C[ citroneusCarter\ Ingram + Mac_e\ C[ confusus Carter\ Ingram +Mac_e\ C[ cornutus de Meillon\ C[ cubitalis Edwards\ C[ deke!yseri Clastrier\ C[ delta Edwards\ C[ distinctipennis Austen\C[ duddingstoni Kettle + Lawson\ C[ dzhafarovi Remm\ C[engubandei de Meillon\ C[ exspectator Clastrier\ C[ fascipennisStaeger\ C[ fulvithorax Austen\ C[ furcillatus Call[\ Kremer +P[\ C[ furens Poey\ C[ gambiae Clastrier + Wirth\ C[ gejgelensisDzhafarov\ C[ grahamii Austen\ C[ grisescens Edwards\ C[hortensis Khamala + Kettle\ C[ imicola Kie}er\ C[ impunctatusGoetghebuer\ C[ kingi Austen\ C[ krameri Clastrier\ C[ kur!ensis Dzhafarov\ C[ lailae Khalaf\ C[ langeroni Kie}er\ C[odibilis Austen\ C[ pallidicornis Kie}er\ C[ praetermissus

Carter\ Ingram + Mac_e\ Forcipomyia biannulata Ingram +Mac_e\ F[ bipunctata Linnaeus\ F[ castanea Walker\ F[ nigraWinnertz\ F[ phlebotomoides Bangerter\ F[ psilonota Kie}er\ F[pulchrithorax Edwards\ F[ radicicola Edwards\ F[ sphagnophilaKie}er\ F[ titillans Winnertz[

1[1 Data acquisition

Microscope slides of the right fore wings of 19 specimens "amixture of males and females# of each of the test species wereplaced on the transmitted light stage of a Zeiss Stemi SV00Apo stereomicroscope with planachromat S 0[9× objective[A Kontron ProgRes 2999 colour CCD camera was mountedon the microscope|s TV camera tube with integral 9[4× objec!tive[ A video sample from the camera "viewed on an adjacentmonitor#\ allowed the scale and orientation of the wing imageto be manipulated prior to image capture[ The size of the wingwas altered using the zoom control on the microscope\ suchthat the wing image almost _lled a rectangle of size 439 × 219pixels[ The wing was orientated by adjusting the microscopeslide until the anterior margin "which was assumed to bestraight# was parallel with the x!axis[ Once focused\ an imagewas captured and stored to disk on a personal computerrunning an image analysis software package "KS399^ KontronElektronik GmbH\ Munich\ Germany#[ The red and bluecomponents of the captured colour image were immediatelydiscarded\ while the green component\ which when isolated istransposed to a greyscale image\ was corrected for shadingusing a previously captured shading reference image[ Grey!scale images comprise integer pixel values in the range 9Ð144[ The light intensity and other optical settings remainedconstant throughout[ The stored images were cropped suchthat the wings completely _lled a boundary rectangle[ Imageswere then reduced in size\ maintaining the original aspectratio\ to 049 × 099 pixels[ The images were reduced in orderto make the computation of principal components tractableand they were then preprocessed to caricature the venationand pigmentation patterns on the wings "WEEKS et al[\ 0886#[

1[2 Principal component imagery

Digitized wing images were rearranged into column vectorsconsisting of concatenated rows of pixel intensities[ Thus\ anyparticular image k was represented by a column vector ak ofdimensionality I × 0 "where I equals the width times the heightof the wing image in pixels#[ A set of K wing images of thesame species were arranged into an I × K matrix A\ such thatak was the kth column of A[ The average wing image of theset a was subtracted from each image in A[ Matrix A wasthen subject to principal components analysis "PCA#\ whichcomputes a set of orthogonal eigenfunctions p0\ p1\ [ [ [ pi whichcharacterize the modes of variation of the wing images in A[Eigenfunctions are ordered such that the _rst eigenfunctionp0 accounts for the largest amount of variation\ the secondeigenfunction p1 for the second largest and so on[

Any image in A may be reconstructed exactly as a linearcombination of the I eigenfunctions and eigenvalues of A[Furthermore\ images in A may approximately be recon!structed using only the eigenfunctions with the largest eigen!values\ say the _rst K eigenfunctions

ak ¼ a? � a¹ ¦ Pbk "0#

where a? is an estimate of ak\ P is an I × K matrix of eigen!functions of A\ and bk is a vector of eigenfunction weightswhich describe the contribution of each eigenfunction in rep!resenting the image ak[ Eigenfunction weights are calculatedfrom the scalar product of the image ak and each eigenfunctionas follows]

Page 3: Automating insect identification: exploring the limitations of a prototype system

2Automating insect identi_cation

bk � PT "ak − a¹# "1#

Since the _rst K eigenfunctions account for the most vari!ance within matrix A\ the error between the original andapproximated image is minimized[

1[3 Image reconstruction metric

Consider an ensemble of images of the right fore wings ofspecies a[ Applying PCA to these images\ as described above\yields a set of principal eigenfunctions which best describe thevariation within this ensemble[ If variation due to rotation\translation and scaling is accounted for\ then the set of eigen!functions generated by PCA may be regarded as a basis setwith which to describe the wing morphology of a[ Now\ if aninput image\ a0\ of a specimen of a is reconstructed in termsof this basis set of eigenfunctions\ then the approximate recon!struction will be visually very similar to the original inputimage "a0#[ However\ if an input image a1\ of a specimen of adi}erent species is reconstructed in terms of these eigen!functions\ then the image will be reconstructed poorly\ sinceits form is not well described by the basis set of a[ It followstherefore that the di}erence between the reconstruction of a0

and the original image a0 will be small\ while the di}erencebetween the reconstruction of a1 and the original image a1 willbe larger[

The Kendall!t statistic "PRESS et al[\ 0883# and a simplevector!di}erence metric were used as nonparametric tests ofthe di}erence between reconstructed and original images[ Themeasures return a value of 0[9 for perfect correlation and 9[9for no correlation[ The Kendall!t statistic is]

ca � Kt"ai reconstructed\ ai original# "2#

where Kt is the Kendall!t rank di}erence metric\ ca is a mea!sure of the correlation of image ai with the images of wings ofspecies a\ and ai is the ith pixel[

The vector!di}erence metric is]

ca � ="ai reconstructed#1 − "ai original#1 = "3#

where both reconstructed and original images are vector nor!malized prior to this calculation[

Essentially\ a species classi_er may be built from the _rst_ve to 09 principal eigenfunctions of a training set of wingimages of a particular species[ Input wing images of the samespecies as those used to train a classi_er\ return high cor!relation coe.cients when compared with that classi_er[ Inputwing images of a di}erent species\ return lower correlationcoe.cients[ Generating multiple species classi_ers and com!paring an input wing image with each of them facilitatesidenti_cation^ the classi_er to which the input image correlatesmost strongly\ i[e[ that producing the highest correlationcoe.cient\ is predicted as the {correct| species[

This is an acceptable identi_cation scheme providing all theinput wing images are from species upon which the classi_ershave been trained[ Wing images of species for which no trainedclassi_er exists still generate correlation coe.cients\ althoughthese are generally low[ These images may be discriminated bysetting a threshold correlation coe.cient below which imagesmay be regarded as unknowns[ The setting of threshold valueshas been discussed elsewhere "WEEKS et al[\ 0886#[

1[4 Species identi_cation using image reconstruction

The viability of this approach to species!level identi_cation ofbiting midges was assessed using the database of wing images[For each species\ 00 of its 19 specimen images were designated{training| images with the remainder designated {test| images[Species classi_ers were generated for each of the 38 speciesusing randomly chosen training images of the same species[

The test images were correlated with each of the classi_ers\with the highest correlation predicted as the correct identi!_cation[

The e}ect on the pro_ciency of identi_cation of usingdi}erent specimens to train species classi_ers was assessed byusing randomly picked training images to populate the train!ing sets[ The size of image training sets was also varied[ Thee}ect on identi_cation of the number of species included wasassessed by incrementing the number of classi_ers "1Ð38# andreprocessing the test images[

In addition to this {_rst!past!the!post| analysis\ DAISY|sidenti_cation performance was monitored in alternative ways[First\ the test image was only deemed to have been identi_edcorrectly if its correlation with the {correct| species classi_erwas a certain magnitude greater than its correlation with anyother classi_er[ The number of correct identi_cations will tendto decrease when this disparity is larger\ but con_dence in apredicted identi_cation will increase[ Second\ the test imagewas only deemed to have been identi_ed correctly if its cor!relation with the {correct| species classi_er was ranked in atleast the top 09 correlation values[ Using this method\ thenumber of correct identi_cations will increase\ while the ident!ity of a test image will have been narrowed down from 38possible species to up to 09 species[

Finally\ the mean correlation of each test image with the 09species classi_ers of the genus Forcipomyia and with the 28species classi_ers of the genus Culicoides was determined[ Themean of these means was calculated with test images of thesame species[ This gives an indication of the degree of clus!tering of each species with the two genera[ This may allowimages to be discriminated into their respective genera priorto being identi_ed to species[

2 Results

The highest level of accurate identi_cation obtainedfor the biting midges was about 75) "_g[ 0#[ This wasachieved when each of the 38 species classi_ers wastrained on the largest sized training set used\ 00 images[In general\ accuracy was found to increase as classi_erswere trained on progressively larger numbers of trainingimages\ although the magnitude of the improvementwas not directly proportional to the increase in trainingset size "_g[ 0#\ However\ even with training sets of onlythree specimens more than 69) of identi_cations werecorrect[

Accuracy also depended on the precise compositionof the training set\ particularly for smaller sets\ forwhich the standard deviation in the proportion of cor!rect identi_cations becomes quite marked "_g[ 0^ note\with 00 images in the training set there is no standarddeviation as all training set images are in use#[

Increasing the number of species to be discriminatedresulted in a decrease in the proportion of correct identi!_cations "_g[ 1#[ With just two classi_ers\ more than87) of test images of those species were correctly identi!_ed\ whilst with all 38 classi_ers\ the proportiondeclined to 75)[ The form of the relationship betweenthe proportion of correct identi_cations and the numberof species classi_ers implies that with yet more classi_ersthis decline would continue "_g[ 1#[

The identi_cation results thus far were based on animage being correlated to each of up to 38 species clas!si_ers\ with the classi_er to which the test images cor!relate most strongly being deemed the species to which

Page 4: Automating insect identification: exploring the limitations of a prototype system

3 P[ J[ D[ Weeks et al[

Fig[ 0[ DAISY identi_es correctly a greater proportionof specimens as the number of images used to train thespecies classi_ers increases[ "ž# Kendall!t metric^ "Ž#vector!difference metric

Fig[ 1[ DAISY identi_es correctly a greater proportionof specimens as the number of species classi_ers to whichspecimens belong decreases[ "ž# Kendall!t metric^ "Ž#vector!difference metric

Fig[ 2[ DAISY identi_es correctly a greater proportionof specimens as the magnitude of the winning marginstipulated between the winning and second place cor!relation coef_cient decreases[ "ž# Kendall!t metric^ "Ž#vector!difference metric

a specimen belongs[ This took little account of exactlyhow well a test image correlated with a particular clas!si_er[ Figure 2 shows how the accuracy of identi_cationchanged when a winning margin between _rst andsecond place classi_er correlation coe.cients was stipu!lated[ Identi_cation accuracy dropped from the high of75) to approximately 59) when a winning margin of9[94 is stipulated\ con_rming that the wings areextremely similar[

Figure 3 shows the e}ect on identi_cation accuracyof accepting an identi_cation as correct provided thecorrelation coe.cient with the {correct| classi_er isranked _rst\ in the _rst two\ _rst three and so on[ Identi!_cation accuracy increased to more than 89) when onespeci_ed {correct| as being one of three possible species[This was of practical importance as it shows there isconsiderable potential for using DAISY to reduce a setof possible identities from 38 to a very few[

Figures 4 and 5 show the degree of correlation ofthe test images\ grouped within their species\ with theclassi_ers representing the genera Culicoides and For!cipomyia[ Both _gures demonstrate that images of spec!ies within the genus Culicoides are highly correlatedwith classi_ers representing that genus\ while images ofspecies within the genus Forcipomyia are highly cor!related with classi_ers representing that genus[

Throughout the above analyses\ the proportion ofcorrect identi_cations was slightly higher using the Ken!dall!t metric rather than the vector!di}erence metric

Page 5: Automating insect identification: exploring the limitations of a prototype system

4Automating insect identi_cation

Fig[ 3[ DAISY identi_es correctly a greater proportionof specimens when the rank of the correlation of a speci!men with its {correct| species classi_er is considered[ Ifthe correlation is ranked at least _rst\ second or third andso on\ a specimen is considered to have been correctlyidenti_ed[ "ž# Kendall!t metric^ "Ž# vector!differencemetric

"_gs 0\ 1\ 2\ 3#[ However\ the latter was substantiallyfaster to compute[

3 Discussion

These results provide some support for the notion thatthe approach to automated identi_cation of organismsembodied in DAISY is a useful one[ The overall levelof 75) successful identi_cation of the 330 specimens of38 species of biting midges is encouragingly high[ Thisis particularly so given that whilst wing pattern hasbeen used extensively in the taxonomy of the genusCulicoides\ it has not provided an absolute means ofdistinguishing between species "LANE\ 0870#[ Fur!thermore\ if the data are divided into their respectivegenera and reprocessed\ the 28 species classi_ers rep!resenting the genus Culicoides achieve 78) successfulidenti_cation "higher than predicted] _g[ 1#[ Thus\expanding DAISY to include species of a visually simi!lar genus served only to reduce the e.cacy of identi!_cation[

This level of successful identi_cation was achievedwith species classi_ers trained on only 00 specimens ofeach species\ far less than the number of specimensusually available for many species where there is a highdemand for identi_cation[ Using more specimens wouldbetter represent the phenotypic variation present in a

Fig[ 4[ Using the Kendall!t metric\ images of specimenswithin the genus Culicoides are highly correlated withclassi_ers representing that genus\ while images of speci!mens within the genus Forcipomyia are highly correlatedwith classi_ers representing that genus[ "ž# Species of thegenus Culicoides^ "Ž# species of the genus Forcipomyia

Fig[ 5[ Using the vector!difference metric\ images ofspecimens within the genus Culicoides are highly cor!related with classi_ers representing that genus\ whileimages of specimens within the genus Forcipomyia arehighly correlated with classi_ers representing that genus["ž# Species of the genus Culicoides^ "Ž# species of thegenus Forcipomyia

Page 6: Automating insect identification: exploring the limitations of a prototype system

5 P[ J[ D[ Weeks et al[

species[ These results suggest that the level of accuracywould increase were classi_ers to be trained on morespecimens "_g[ 0#\ although the improvements may notnecessarily be dramatic[ Whilst\ the training of speciesclassi_ers takes longer with more training images\ thereis no time penalty associated within the actual identi!_cation phase[ Thus\ if more training images are avail!able they should be used[ Equally notable\ is the rela!tively high frequency of correct identi_cations achievedeven with species classi_ers trained on only three speci!mens "×69)^ _g[ 0#[ This is far fewer than the numberof specimens which many taxonomists feel con_dentabout using as a basis for discriminating species[

Moreover\ these levels of correct identi_cation canbe achieved relatively quickly[ The slide mounting ofsu.cient wings to establish a single species| trainingset "00 specimens# takes approximately 29min\ whileimaging those wings takes less than 09min "and gen!erates a lasting data set#[ Training of a species classi_ertakes a few seconds[ One of the signi_cant modi_cationsthat has been made in recent versions of DAISY is inthe speed of analysis[ Once a specimen|s wing has beendetached\ mounted\ imaged and processed "approxi!mately 3min#\ the predicted speci_c identity may bedetermined from up to 38 species classi_ers in a total ofless than 2 s[

An overall _gure of 75) successful identi_cations isobviously too low in itself to be of great practical value[Signi_cant additional problems would also appear tobe posed by two factors[ First\ the decline in the pro!portion of correct identi_cations as the number of spec!ies classi_ers is increased "_g[ 1# implies that the systemhas limits[ This decrease in the proportion of correctidenti_cations with increasing number of species to bediscriminated results from an increasing overlap ofcharacters[ Unfortunately\ the e}ective separation ofvisually very similar objects using a visual identi_cationsystem will not always be successful[

Second\ the narrowness of the {winning margin| forcorrect identi_cations "_g[ 2# suggests that the di}erencebetween a correct and an incorrect identi_cation is typi!cally very small[ Whilst a re~ection of the high level ofsimilarity of many such closely related species\ this isan undesirable property for an identi_cation systemto possess[ However\ on the positive side\ the systemprovides great promise for very accurately reducing theset of possible identities of any test specimen\ from all\to one of a very few species "_g[ 3#[

The challenge in developing an automated identi!_cation system is not to correctly identify specimensmore often than not[ Rather\ it is to attain frequenciesof correct identi_cation that mean that economic pestscan be recognized during quarantine inspection\ that agiven species is recognized by a single unique epithetthroughout its range\ and that nonspecialists canidentify many of the most common components of theirlocal faunas\ thus contributing to knowledge of patternsof biodiversity "WHITTEN\ 0885#[ To many systematists\little short of 099) accuracy is acceptable\ but inreality\ at species level\ this is unlikely to be achieved Ðas perusal of specimens identi_ed in the past and pre!served in museum collections will show; Even whenworking with a well!known fauna\ expert taxonomists

do not attain such levels of accuracy\ as deformed\undersized or probable hybrid individuals cause prob!lems[ Nor is a 099) accurate species identi_cationnecessarily what a user requires[ A quarantine o.cerwill certainly want to know if a ~y infesting a cargo is aNew World Screwworm\ but if it cannot be identi_edto species\ it is just as important for her or him to knowit is not this pest but a member of a genus of dung~ies[It is perhaps most realistic to argue that what is impor!tant is to avoid incorrect determination[ Thus if a speci!men to be identi_ed closely resembles three extremelysimilar species\ it is better to say it is a member of thisspecies!complex than it is to wrongly assign it to onespecies of the three[ It is noteworthy\ that as DAISYpresently uses only one character set\ a wing\ the use ofother character sets may o}er ways of resolving suchproblems[ Thus DAISY may have a practical appli!cation as part of an identi_cation system\ eliminatinglarge numbers of highly improbable species and reduc!ing the _nal identi_cation to a choice between three orfour species\ which may then be discriminated by theuser examining other features\ such as genitalia[ What!ever\ using the one feature\ wings\ DAISY has dis!criminatory powers that are as good as or better thanmany expert taxonomists[ In blind tests\ one of theauthors "I[D[G[# with considerable experience of thetaxa to be identi_ed "GAULD\ 0880#\ and working onlywith wing slides of _ve species of pimpline ichneumonid\achieved a lower rate of accurate identi_cation than the83) achieved by DAISY "WEEKS et al[\ 0886#[

4 Future directions

Accepting that\ although not without problems\ DAISYpotentially o}ers a way of circumventing the taxonomicimpediment\ the question arises\ how might the level ofcorrect identi_cations provided by DAISY markedly beimproved< Several methods suggest themselves]

"i# One possible way to discriminate very similarspecimens is to use local feature analysis[ This methodattempts to _nd di}erences between images whichemerge at the local level\ such as subtle di}erences inthe shape or pigmentation of the pterostigma[ Localfeature analysis would identify specimens in the sameway as DAISY presently does\ but instead of usingPCA components which are holistic\ emergent localfeature maps could be generated from the PCA com!ponents in the manner described by PENIO and ATICK

"0885#["ii# An alternative or perhaps complementary methodwould be to train a neural network on the correlationcoe.cients produced by specimens of the same spec!ies[ In this way a specimen may not necessarily pro!duce the highest correlation\ however\ the pattern ofits correlations with the other classi_ers may indicateits species["iii# The likelihood of successful identi_cation isstrongly in~uenced by the quality of the images ofspecimens captured at the outset "see also WEEKS

et al[\ 0886#[ Careful re!imaging of specimens whichhave previously been incorrectly identi_ed can oftensubsequently yield correct identi_cations[ Obtaining

Page 7: Automating insect identification: exploring the limitations of a prototype system

6Automating insect identi_cation

images when specimens are appropriately orientatedand illuminated\ in particular\ is important[ An objec!tive method of capturing images in a more consistentfashion would potentially improve overall levels ofcorrect identi_cation markedly[ One such methodinvolves automatically extracting whole wings fromcaptured images using active contour snakes"CURWEN et al[\ 0880#[ Once a wing is demarcated inthis way\ its rotation\ orientation and scale may bereadily recorded and standardized\ thereby dis!pensing with the di.culties associated with manuallyaligning wings[ Removing the necessity to preciselyalign specimen slides\ will move the process a stepcloser to developing an identi_cation system of prac!tical value["iv# The decline in the level of correct identi_cationswith greater numbers of species classi_ers "_g[ 1#\combined with the tendency of specimens to be morestrongly correlated with classi_ers for the genus towhich they belong than to ones for a genus to whichthey don|t "_gs 4 and 5#\ suggests that a structuredhierarchical approach to identi_cation may be moreappropriate[ If classi_ers were trained on a selectionof images representing species of the same genus\ itmay be possible to identify specimens to genera\ usinggenus classi_ers\ and then\ using only the appropriatespecies classi_ers\ to species[ Depending on itssuccess\ this scheme could be extended to many taxo!nomic levels[ Of course\ as with traditional keys\ aspecimen misidenti_ed at\ say\ family level wouldstand no chance of being correctly identi_ed["v# As a last resort\ if a specimen can only be narroweddown to\ for example\ one of three species\ the usermay always refer to the original specimen where aconvenient or even obvious character may be used todiscriminate between species[ Whilst this is perhapsan undesirable end!point\ providing DAISY providesthe user with su.cient taxonomic information to_nalize an identi_cation then this may be deemedacceptable[

The discrimination of closely related species of mostgroups of organisms is not a simple task\ even for experttaxonomists\ and it would be foolish to expect todevelop a {perfect| automated system without many iter!ations of testing and modi_cation[ However\ as a _rststep down this road\ the results obtained from DAISYprovide an encouraging start[

Acknowledgements

We are grateful to J[ BOORMAN for the use of the slidesof biting midges[ This work was supported by AFRC grant39:A90653[ K[ J[ GASTON is a Royal Society UniversityResearch Fellow[

References

ALBERCH\ P[\ 0882] Museums\ collections and biodiversityinventories[ Trends Ecol[ Evol[ 7\ 261Ð264[

CHALMERS\ N[ R[\ 0885] Monitoring and inventorying biodi!versity] collections\ data and training\[ In] Biodiversity\science and development[ Towards a new partnership[

Ed[ by DI CASTRI\ F[^ YOUNEŠS\ T[ Wallingford] CABInternational\ 060Ð068[

CURWEN\ R[ M[^ BLAKE\ A[^ CIPOLLA\ R[\ 0880] Parallelimplementation of Langrangian dynamics for real!timesnakes[ In] Proc[ Brit[ Mach[ Vis[ Conf[ "Glasgow#[ Ed[by MOUFORTH\ P[ London\ Springer Verlag\ 18Ð24[

DALY\ H[ V[^ HOELMER\ K[^ NORMAN\ P[^ ALLEN\ T[\ 0871]Computer!assisted measurement and identi_cation ofhoney bees "Hymenoptera[ Apidae#[ Ann[ Entomol[ Soc[Am[ 64\ 480Ð483[

DI CASTRI\ F[^ ROBERTSON VERNHES\ J[^ YOUNEŠS\ T[\ 0881]Inventorying and monitoring Biodiversity[ Biol[ Int\ Spe!cial Issue 16\ 0Ð17[

GASTON\ K[ J[^ MAY\ R[ M[\ 0881] Taxonomy of taxonomists[Nature 245\ 170Ð171[

GAULD\ I[ D[\ 0875] Taxonomy\ its limitations and its role inunderstanding parasitoid biology[ In] Insect parasitoids[Ed[ by WAAGE\ J[^ GREATHEAD\ D[ London] AcademicPress\ 0Ð10[

*\ 0880] The Ichneumonidae of Costa Rica\ 0[ The sub!families Rhyssinae\ Pimplinae\ Poemeniinae\ Acaenitinaeand Cylloceriinae[ Mem[ Am[ Ent[ Inst[ 36\ 0Ð478[

HAWKSWORTH\ D[ L[\ 0883] The identi_cation and charac!terisation of pest organisms[ Wallingford] CAB Inter!national[ 375pp[

HOLDEN\ C[\ 0878] Entomologists wane as insects wax[ Sci!ence 135\ 643Ð645[

HOUSE OF LORDS\ 0880] Systematic biology research[ Reportof the Select Committee on Science and Technology[ Lon!don] HMSO[ 095 pp[

JANZEN\ D[ H[\ 0882] Taxonomy] universal and essentialinfrastructure for development and management of tropi!cal wildland biodiversity[ In] Proc[ Norway:UNEPExpert Conference on Biodiversity\ Trondheim\ Norway[Ed[ by SANDLUND\ O[ T[^ SCHEI\ P[ J[ Trondheim] NINA\099Ð002[

LANE\ P[ L[\ 0870] A quantitative analysis of wing pattern inthe Culicoides pulicaris species group "Diptera\ Cer!atopogonidae#[ Zoo[ J[ Linn[ Soc[ 61\ 10Ð30[

LATTIN\ J[ D[^ KNUTSON\ L[\ 0871] Taxonomic informationand services on arthropods of importance to human wel!fare in Central and South America[ FAO Plant ProtectionBull[ 29\ 81Ð84[

MAY\ R[ M[\ 0889] Taxonomy as destiny[ Nature 236\ 018Ð029[

PANKHURST\ R[ J[\ 0867] Biological Identi_cation[ London]Arnold[ 093 pp[

PENIO\ P[ S[^ ATICK\ J[ J[\ 0885] Local feature analysis] ageneral statistical theory for object representation[ Net!work] Computation Neural Systems 6 "2#\ 366Ð499[

PRESS\ W[ H[^ TEUKOLSKY\ S[ A[^ VETTERLING\ W[ T[^ FLAN!

NERY\ B[ P[\ 0883] Numerical recipes in C[ Cambridge]Cambridge University Press[

REID\ W[ V[^ LAIRD\ S[ A[^ MEYER\ C[ A[^ GAłMEZ\ R[^ SIT!

TENFELD\ A[^ JANZEN\ D[ H[^ GOLLIN\ M[ A[^ JUMA\ C[\0882] Biodiversity prospecting] using genetic resourcesfor sustainable development[ Washington\ DC] WorldResources Institute[

TILLING\ S[ M[\ 0876] Education and taxonomy] the roleof the Field Studies Council and AIDGAP[ In] Nature\natural history and ecology[ Ed[ by BERRY\ R[ J[^CROTHERS\ J[ H[ London] Academic Press\ 76Ð85[

TURK\ M[^ PENTLAND\ A[\ 0880] Eigenfaces for recognition[J[ Cogn[ Neuroscience 2\ 60Ð75[

UNEP:CBD:SBSTTA:1 [ [ [ \ 0885] Report of the SubsidiaryBody on Scienti_c\ Technical and Technological Adviceon the work of its second meeting[ Montreal] Secretariatto the Convention on Biological Diversity[

WEEKS\ P[ J[ D[^ GASTON\ K[ J[\ 0886] Image analysis\ neural

Page 8: Automating insect identification: exploring the limitations of a prototype system

7 P[ J[ D[ Weeks et al[

networks\ and the taxonomic impediment to biodiversitystudies[ Biodiver[ Conserv[ 5\ 152Ð163[

WEEKS\ P[ J[ D[^ GAULD\ I[ D[^ GASTON\ K[ J[^ O|NEILL\ M[A[\ 0886] Automating the identi_cation of insects] a newsolution to an old problem[ Bull[ Entomol[ Res[ 76\ 192Ð100[

WEEKS\ P[ J[ D[^ O|NEILL\ M[ A[^ GASTON\ K[ J[^ GAULD\ I[D[\ 0887] Species!identi_cation of wasps using principalcomponent associative memories[ Image and Vision Com!puting[

WHITE\ I[ M[^ SCOTT\ P[ R[\ 0883] Computer informationresources for pest identi_cation] a review[ In] The identi!_cation and characterisation of pest organisms[ Ed[ byHAWKSWORTH\ D[ L[ Wallingford] CAB International\018Ð026[

WHITTEN\ A[\ 0885] Field guides] useful tools in environ!

mental planning and management[ World Bank Environ[Department\ Dissemination Notes[ Washington DC\World Bank\ 40\ 0Ð3[

YU\ D[ S[^ KOKKO\ E[ G[^ BARRON\ J[ R[^ SCHAALJE\ G[ B[^GOWEN\ B[ E[\ 0881] Identi_cation of ichneumonid waspsusing image analysis of wings[ Syst[ Ent[ 06\ 278Ð284[

ZHOU\ Y[ H[^ LING\ L[ B[^ ROHLF\ F[ J[\ 0874] Automaticdescription of the venation of mosquito wings from digi!tized images[ Syst[ Zool[ 23\ 235Ð247[

Authors| addresses] Dr I[ D[ GAULD "corresponding author#\Dr P[ J[ D[ WEEKS\ Department of Entomology\ The NaturalHistory Museum\ Cromwell Road\ London SW6 4BD UK^Dr M[ A[ O|NEILL\ Oxford Orthopaedic Engineering Centre\Nu.eld NHS Trust\ Windmill Road\ Oxford OX2 6LD\ UK^Dr K[ J[ GASTON\ Department of Animal and Plant Sciences\University of She.eld\ She.eld S09 1TN\ UK