a laboratory information management system for dna barcoding workflows

12
This journal is c The Royal Society of Chemistry 2012 Integr. Biol. Cite this: DOI: 10.1039/c2ib00146b A laboratory information management system for DNA barcoding workflowswz Thuy Duong Vu,* a Ursula Eberhardt, b Sza´niszlo´ Szo¨ke, a Marizeth Groenewald b and Vincent Robert a Received 19th October 2011, Accepted 30th December 2011 DOI: 10.1039/c2ib00146b This paper presents a laboratory information management system for DNA sequences (LIMS) created and based on the needs of a DNA barcoding project at the CBS-KNAW Fungal Biodiversity Centre (Utrecht, the Netherlands). DNA barcoding is a global initiative for species identification through simple DNA sequence markers. We aim at generating barcode data for all strains (or specimens) included in the collection (currently ca. 80 k). The LIMS has been developed to better manage large amounts of sequence data and to keep track of the whole experimental procedure. The system has allowed us to classify strains more efficiently as the quality of sequence data has improved, and as a result, up-to-date taxonomic names have been given to strains and more accurate correlation analyses have been carried out. 1 Introduction DNA barcoding is a global initiative, which aims at streamlining species identification through simple DNA sequence markers. 19 The leading principles of the approach are (i) general agreement on (or a few) marker regions; (ii) usage of vouchered strains; (iii) and the assembly of sequence data and the specimen metadata in public databases. 38 The vision is that DNA barcoding will open the field of species identification to non-experts and therefore will enhance biological and medical progress at large. This is no overstatement, because in many fields of biology, agriculture, medicine, and even commerce, species form the key to describing biological interactions. 11,28 However, species are theoretical constructs, irrespective of the species concept applied. 11 Even putting aside the question of the species concept adopted, the assignment of organisms to species is often a matter of opinion, especially in microbiology. This becomes obvious if contradictory interpretations of either species descriptions or observations on the organism are possible, or if the characters available for identification are ambiguous. In addition, taxonomy is a dynamic system in which new insights are constantly incorporated, so that identifications that were correct according to the state of the art 10 years ago could be wrong today. This may be rarely the case for enigmatic organisms, but it is very common in small or inconspicuous life forms (i.e. ref. 8, 16 and 22). To actually gain new insights from species lists, it is often necessary to investigate physiological properties of species. The most direct avenue is the experimental approach. Notably, not species, only representatives of species can be experimentally challenged. Another possibility is to record occurrences and conditions, under which species have been observed and draw conclusions from many of such records (i.e. ref. 28). Genomic (transcriptomic, proteomic, metabolomic) data can be used to predict physiological features of species, 41 though to date only few eukaryote taxa and per taxon, often only few individuals have been investigated. In addition, in many cases there is only circumstantial evidence, i.e. based on similarity. Most experi- mental and observation data reside in a disparate literature. a Bioinformatics Group, CBS-KNAW Fungal Biodiversity Centre, Utrecht, The Netherlands. E-mail: [email protected] b Collection Group, CBS-KNAW Fungal Biodiversity Centre, Utrecht, The Netherlands w Published as part of an iBiology themed issue entitled ‘‘Computa- tional Integrative Biology’’ Guest Editor: Prof. Jan Baumbach. z Electronic supplementary information (ESI) available. See DOI: 10.1039/c2ib00146b Insight, innovation, integration This paper describes the results of research where the association of molecular and a large amount of physiological data has been analyzed (integration) in order to evaluate a new and advanced algorithm to cluster very large amounts of data (innovation). The latter algorithm has been included in new software capable of managing any biological data (innovation). This software and its capacity to manage, cluster and classify data (insight) properly have been studied. Integrative Biology Dynamic Article Links www.rsc.org/ibiology PAPER Downloaded by Yale University Library on 06 March 2012 Published on 17 February 2012 on http://pubs.rsc.org | doi:10.1039/C2IB00146B View Online / Journal Homepage

Upload: huflit

Post on 11-Mar-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

This journal is c The Royal Society of Chemistry 2012 Integr. Biol.

Cite this: DOI: 10.1039/c2ib00146b

A laboratory information management system for DNA barcoding

workflowswzThuy Duong Vu,*

aUrsula Eberhardt,

bSzaniszlo Szoke,

aMarizeth Groenewald

band

Vincent Roberta

Received 19th October 2011, Accepted 30th December 2011

DOI: 10.1039/c2ib00146b

This paper presents a laboratory information management system for DNA sequences (LIMS)

created and based on the needs of a DNA barcoding project at the CBS-KNAW Fungal

Biodiversity Centre (Utrecht, the Netherlands). DNA barcoding is a global initiative for species

identification through simple DNA sequence markers. We aim at generating barcode data for all

strains (or specimens) included in the collection (currently ca. 80 k). The LIMS has been

developed to better manage large amounts of sequence data and to keep track of the whole

experimental procedure. The system has allowed us to classify strains more efficiently as the

quality of sequence data has improved, and as a result, up-to-date taxonomic names have been

given to strains and more accurate correlation analyses have been carried out.

1 Introduction

DNA barcoding is a global initiative, which aims at streamlining

species identification through simple DNA sequence markers.19

The leading principles of the approach are (i) general agreement

on (or a few) marker regions; (ii) usage of vouchered strains; (iii)

and the assembly of sequence data and the specimen metadata

in public databases.38 The vision is that DNA barcoding will

open the field of species identification to non-experts and

therefore will enhance biological and medical progress at large.

This is no overstatement, because in many fields of biology,

agriculture, medicine, and even commerce, species form the key

to describing biological interactions.11,28

However, species are theoretical constructs, irrespective of

the species concept applied.11 Even putting aside the question

of the species concept adopted, the assignment of organisms to

species is often a matter of opinion, especially in microbiology.

This becomes obvious if contradictory interpretations of either

species descriptions or observations on the organism are possible,

or if the characters available for identification are ambiguous. In

addition, taxonomy is a dynamic system in which new insights

are constantly incorporated, so that identifications that were

correct according to the state of the art 10 years ago could be

wrong today. This may be rarely the case for enigmatic

organisms, but it is very common in small or inconspicuous

life forms (i.e. ref. 8, 16 and 22).

To actually gain new insights from species lists, it is often

necessary to investigate physiological properties of species.

The most direct avenue is the experimental approach. Notably,

not species, only representatives of species can be experimentally

challenged. Another possibility is to record occurrences and

conditions, under which species have been observed and draw

conclusions from many of such records (i.e. ref. 28). Genomic

(transcriptomic, proteomic, metabolomic) data can be used to

predict physiological features of species,41 though to date only

few eukaryote taxa and per taxon, often only few individuals

have been investigated. In addition, in many cases there is only

circumstantial evidence, i.e. based on similarity. Most experi-

mental and observation data reside in a disparate literature.

a Bioinformatics Group, CBS-KNAW Fungal Biodiversity Centre,Utrecht, The Netherlands. E-mail: [email protected]

b Collection Group, CBS-KNAW Fungal Biodiversity Centre,Utrecht, The Netherlands

w Published as part of an iBiology themed issue entitled ‘‘Computa-tional Integrative Biology’’ Guest Editor: Prof. Jan Baumbach.z Electronic supplementary information (ESI) available. See DOI:10.1039/c2ib00146b

Insight, innovation, integration

This paper describes the results of research where the

association of molecular and a large amount of physiological

data has been analyzed (integration) in order to evaluate a

new and advanced algorithm to cluster very large amounts of

data (innovation). The latter algorithm has been included in

new software capable of managing any biological data

(innovation). This software and its capacity to manage,

cluster and classify data (insight) properly have been studied.

Integrative Biology Dynamic Article Links

www.rsc.org/ibiology PAPER

Dow

nloa

ded

by Y

ale

Uni

vers

ity L

ibra

ry o

n 06

Mar

ch 2

012

Publ

ishe

d on

17

Febr

uary

201

2 on

http

://pu

bs.r

sc.o

rg |

doi:1

0.10

39/C

2IB

0014

6BView Online / Journal Homepage

Integr. Biol. This journal is c The Royal Society of Chemistry 2012

Whatever the source of information, a lot of human input is

needed to draw conclusions about particular traits of species.

Keeping track of which taxonomic concept was applied at a

given time is impossible if no vouchers are mentioned (which is

the case for many studies).

In the CBS fungal collection we have an almost unique

situation in that many of the requirements necessary for

linking species with physiological and other properties are

already in place. Moreover, the data are easily accessible. The

CBS-KNAW fungal collection is a culture collection of fungal

strains. The organisms remain available for testing physiological

traits. A BioloMICS-based34 collection database containing

many kinds of metadata is in operation and publicly accessible

(http://www.cbs.knaw.nl/collections). Being the main source of

information on the collection, the database is continuously

curated and will keep being curated. The CBS DNA-Barcoding

project is aiming at generating barcode data for all strains

included in the collection (currently ca. 80 k). Another project,

the Fungal Growth Database,12 is assembling further ecologically

relevant data, using the same database system.

Fungi, the test case group of this paper, are also a group of

organisms in which taxonomy and identification practice has

been drastically altered through the availability of PCR and

sequence based methods.2,4,20,36 Only a fraction (ca. 74 K) of

the estimated 1.5 million fungal species is formally described.18

Molecular data have revealed numerous undescribed taxa

contained in described taxa, many of which are currently

considered as so-called cryptic taxa. On the other hand, hitherto

distinguished taxa have been shown to be the same. While the

latter case is less problematic, the discovery of new taxa poses a

huge challenge for taxonomy and the management of species

metadata.

The nuclear ribosomal genes are the DNA region(s) commonly

used for fungal identification.2 The ITS (internal transcribed

spacer) is most often applied. It is likely to become the designated

fungal barcode region.35 Currently, ITS data for about 1.5% of the

estimated total of fungal species have been published.28 Taxonomic

progress and the usual sources of error like mix-ups and

misidentification have led to a situation in which an estimated

20% of the ITS data published for fungi is wrongly labelled.29

Curated databases containing reference sequence data are

disparate and exist only for very limited groups of fungi.2

With the CBS DNA barcoding project we ultimately aim to

publish a sizable reference data set of ITS and LSU (large

subunit of the nuclear ribosomal genes) that is taxonomically

validated. The curators of the CBS collection database of about

80 000 strains are constantly working on keeping identifications

up to date. However, a large proportion of the strains have been

identified last at the time of their accession (the oldest strains

have been deposited in 1907) with the methods that were then

the state of the art. Our expectation is that we will find, even

among the identified strains, a considerable portion of potentially

undescribed diversity (see also ref. 6). This implies that all data

produced must be validated, and, if initially there is no suitable

reference data available, possibly re-validated at regular intervals.

In this paper, we describe two tools that were added to the

BioloMICS software34 to facilitate the DNA barcoding process.

Faced with the logistics of the CBS Barcoding project, the

LIMS software module has been created for generating and

managing sequence data. The DNA barcoding workflow is princi-

pally the same for all kinds of Sanger sequencing projects in which

a small set of loci is amplified for a large set of DNA extracts.

The second tool described in this paper is a very versatile

clusteringmechanism for sequence data, that automatically assigns

names to groups of sequences. Many applications are possible

with this module. It could be used as an identification tool to

assign taxon names from a reference data set to unidentified

sequences. Likewise, it could be used for species discovery, if a

reference set with data for all relevant reference data of known

relatives are available. Within the barcoding project of CBS the

foremost function of this tool will be taxonomic validation:

validation of sequence data, but also validation of the taxonomic

assignment of strains in the collection. Moreover, the similarity of

the strains (read: individuals) comprised in the same clusters can be

calculated for other properties recorded in the database, including

physiological traits.

2 LIMS software description

The DNA barcoding lab workflow is as follows: DNA is

extracted from 94 biological samples in the 96 well format

(leaving room for controls in the PCR reactions). As a

standard, two PCR reactions for two different loci (ITS and

LSU) are carried out for each plate of extracts. Positive PCR

products are assembled in 96 well plates for bidirectional

Sanger sequencing.13 In the LIMS workflow, lists of biological

samples to be extracted, in this case strains of fungi from the CBS

collection, are imported manually into the system as tabulated

text files. Details of methods, primers, cycler programmes, etc.

are entered into the system and linked with the microplates,

extracts, PCR and cycle sequencing samples as appropriate

during their passage through the LIMS. Lists of trace-file names

are exported from the LIMS, which later facilitate automated

import of trace-files into the system. During trace-file import,

contigs of forward and reverse files are automatically assembled

and saved in the system, either in existing or newly created

sequence records, which are again automatically linked with all

appropriate records in other tables. Quality values for traces

and contigs are computed and saved. The system recognizes

primer names and can thus create different links for different loci

if the system has been set up in this way. Technical (sequence

editing) and taxonomic (BLAST) validations are carried out

manually within the system. Automated sequence validation

is also implemented (see Fig. 1).

2.1 Data structure

The data structure of the LIMS is designed based on ten tables

as illustrated in Fig. 2. Of these, only three, the strain/specimen,

DNA extraction and sequence tables, are user defined, while the

other seven tables are implemented as default in the system.

However, additional fields can be added to all tables as

required. In our application, the strain table is the strain table

of the fungal collection of CBS. DNA extraction, PCR and

sequencing tables contain for each extract or PCR or cycle

sequencing reaction a unique record with associated informa-

tion. Records of the microplate table are linked to all sample

records (DNA extracts, PCR or sequencing) assembled in one

plate. The protocol and primer tables contain records for each

Dow

nloa

ded

by Y

ale

Uni

vers

ity L

ibra

ry o

n 06

Mar

ch 2

012

Publ

ishe

d on

17

Febr

uary

201

2 on

http

://pu

bs.r

sc.o

rg |

doi:1

0.10

39/C

2IB

0014

6B

View Online

This journal is c The Royal Society of Chemistry 2012 Integr. Biol.

protocol and primer. They are linked as appropriate to records

in the extract, PCR and sequencing tables. Files (trace files,

contigs, work instructions, and gel images, etc.) are stored in

the file table. In the result table, a record is created for each

contig that is created or modified during automated tracefile

importation.

The arrows in Fig. 2 illustrate the links between the tables in

the database. These links allow end-users to track data related

to a given record through all linked tables. For instance, one

can easily find all information about DNA extraction, PCR

and cycle sequencing reactions for a given sequence (strain),

starting from the strain table.

2.2 Implementation

The LIMS module has been implemented and embedded into

BioloMICS,33 a software for the management, the identification,

the classification and statistics on biological data. The software

has originally been developed to perform physiological tests of

the yeast strains.32,33 The GUI (Graphical User Interface) and

middleware of LIMS have been implemented in Visual Basic.Net

and Visual C++.Net. Two versions of the software are

available: a Windows version and a web version using MySql

(or ORACLE) for data storage. Another database management

system called MongoDB27 can also be used as an alternative to

MySQL for the management and the analyses of very large

datasets. In this instance, the application of the software has been

installed in a central server, and the remote access of end-users to

the server is done by using a web browser and Citrix XenApp 6.0.39

2.3 Tools of the LIMS module

The LIMS module consists of eight (sub-)modules including

data storage, sample tracking, complex query building, generation

of reports, automatic importation of trace files, trace file edition,

sequence identification and automated annotation of sequences

that are able to manage data produced by DNA barcoding

workflows from the creation of DNA extractions to the

identification of the obtained sequences.

Data storage. This module allows end-users to save all

information of every step in a Sanger sequencing workflow

with a minimum of manual data entry or sample selection. For

the initial assembly of samples in a plate, the user is presented

with a virtual plate (see Fig. 3) to which samples can be

transferred from a list of available samples, i.e. strains for

DNA extraction, DNA extracts for PCR or (positive) PCR

products for cycle sequencing. A number of routines are

incorporated into the system to limit these lists to the appropriate

candidate samples for the action at hand, i.e. select done DNA

extracts (as opposed to ones that are in preparation) for PCR

or positive PCR products (as opposed to unsuccessful PCR

reactions) for cycle sequencing. As long as a plate is not

marked as processed (done) in the system, changes in the set-up

of the plate in the lab can be incorporated using the unfill option,

which returns a sample to its old position. Once samples are

assembled in a plate, the sample ids and the sample positions

can be duplicated for the following step, i.e. from DNA

extraction plates to PCR plates, from PCR plates to PCR

plates (i.e. if extracts that were produced independently are

assembled for a series of different PCR reactions), PCR plates

to cycle sequencing plates.

Details that apply to all samples in a plate, i.e. processing

date or methods, can be filled in a single action. The system

administrator can determine which information needs to be

entered into the system before a sample is allowed to move on

to the next step in the virtual lab workflow.

Fig. 1 Sequencing workflow.

Fig. 2 LIMS data structure.

Dow

nloa

ded

by Y

ale

Uni

vers

ity L

ibra

ry o

n 06

Mar

ch 2

012

Publ

ishe

d on

17

Febr

uary

201

2 on

http

://pu

bs.r

sc.o

rg |

doi:1

0.10

39/C

2IB

0014

6B

View Online

Integr. Biol. This journal is c The Royal Society of Chemistry 2012

All tables include automatically created serial numbers as

unique identifiers. Record labels are (semi-) automatically

created for all tables. Labels of microplates and DNA extracts

should be unique, but can be changed manually, for example

for accommodating barcode labels.

Sample tracking and visualization. Any field of the LIMS

tables can be searched. In addition, there is a graphical tool

that visualizes the history of each sample or for a number of

samples at a time, meaning from which extraction, PCR or

cycle sequencing plate it comes from.

The plates themselves are visualized as shown in Fig. 3. Any

information contained in the sample tables (DNA extraction,

PCR or sequencing) can be displayed in the fields of the virtual

sample plate. Plates are shown in the condition in which they

currently are. For example, if half of the samples have been

removed, the respective fields of the virtual plate will be

displayed empty. With the option show original samples the

original sample assembly is shown in the virtual plate. Fields

can be displayed in colour for indicating the success of PCR

reactions or the quality of trace files for cycle sequencing

reactions. To help manual rearrangement of samples in plates,

the fields can also be coloured according to their original

source plate.

The picture gallery function of BioloMICS34 can be used to

display gel images side by side for comparison.

Complex query building. Queries, however complex, pertaining

data that are stored in a single table can be done with the query

interface of BioloMICS.With the help of the programming tool in

BioloMICS, even more complex queries can be built combining

queries of different tables. This way, all kinds of questions can be

answered. For instance, a list of DNA extracts can be assembled

that have not produced a single successful PCR reaction or which

have only produced low quality sequence data. The programming

tool can also be used to create formalized reports.

Generation of reports. Currently, three kinds of reports have

been implemented in the system that can be viewed and

printed by the LIMS without employing third party software:

� The protocol and sample sheet are created to aid the

technicians in the lab. These reports display a table representing

the sample plate, and the protocols used. If appropriate, pipetting

instructions are included, calculated specifically for the number

of samples at hand.

� A list of trace file labels is created to be exported to the

ABI 3730XL sequencer software, fulfilling the requirements of

the ABI software and providing the necessary information to

later import the trace files automatically into the LIMS.

� A notification is sent to the submitter of a sequencing plate

when the tracefiles are imported into the LIMS. The report

summarizes the results obtained from the trace files such as the

quality of the trace files, the quality of the contigs, the consensus

sequences and, if required, a summary of the BLAST1 results.

Automatic importation of trace files. Trace file importation

can commence according to a pre-set schedule or can be done

manually. When a trace file is imported into the file table, the

software will search for sequence records of the same DNA

region and extract which have not previously been edited. In

this case the trace file is linked to the existing record and the

trace is added to the existing contig and the new consensus is

saved. If no such record is found, a new sequence table record

is being created, linked to a new contig and its consensus is

saved into the record alongside with other relevant information

and links to records of other tables.

The consensus sequence is automatically compared against

a local reference sequence database and/or against Genbank.10

A report (see above) is then saved in the result table and an

email is sent to the owner of the task as an identification and

quality report. Note that the quality (or confidence) of a trace

file is computed as the sum of all bases’ confidence. The quality

of the contig is computed as the percentage of the identical

bases over the length of the alignment.

Trace file edition. A Sanger sequence-editing module is

integrated in the BioloMICS software. Aligned trace files are

displayed synchronously with the respective positions in the

consensus sequence. Signal quality of trace files25 can optionally

be displayed on the peaks. Base editions can be done for each

Fig. 3 LIMS virtual microplate.

Dow

nloa

ded

by Y

ale

Uni

vers

ity L

ibra

ry o

n 06

Mar

ch 2

012

Publ

ishe

d on

17

Febr

uary

201

2 on

http

://pu

bs.r

sc.o

rg |

doi:1

0.10

39/C

2IB

0014

6B

View Online

This journal is c The Royal Society of Chemistry 2012 Integr. Biol.

base separately or in the consensus for all aligned sequences at

once. Conflicts are visually tagged and can be searched automati-

cally. A simple base caller using the relative peak heights of the four

traces allows for the (reversible) replacement of ambiguous

base calls. Further editing aids include peak magnification

options in x and y directions and the possibility to selectively

hide each of the four traces (Fig. 4).

The consensus sequence is automatically saved into a text

field in the sequence table. From there it can easily be exported

to a Fasta file if required. The main advantage of including

this editing module in the system is that it is thus a one-stop

solution for DNA barcoding needs. Raw data, editing and

storage of edited data are all included in a single system. Loss

of data and work effort or involuntary duplication of efforts

through multiple users or (delayed) data transfer between

different applications can be effectively avoided. BioloMICS

also provides additional downstream DNA sequence data

analysis options,34 and an export function to GenBank.10

Sequence validation and identification. Manual sequence valida-

tion and identification can be achieved through two mechanisms:

direct comparison with selected sequences (or sequences of

selected strains) in multiple alignments. A BLAST tool1 allows

us to search for closest matches of single or batch sequences

against a local reference sequence database and/or against

Genbank.10 The taxonomic conclusions from the pairwise

alignments or BLAST results are left to the freedom of the

user (in our case, to the curators of the collections).

Automatic annotation of sequences. Despite the high degree

of complexity of currently available sequence similarity search

tools such as BLAST,1 with the rapid growth of biological

data, it remains quite difficult and sometimes impossible to

classify and identify very large amounts of sequences produced

on a daily basis. The aim is to have a system that is able to

divide a huge number of sequences into groups or clusters, and

to predict taxonomic names for sequences of each cluster in

order to have up-to-date taxonomic names associated with the

sequenced material. Recently, a tool calledMultilevel clustering40

to cluster massive databases has been developed. Based on this

clustering tool, a newmodule for the LIMS has been implemented

to taxonomically annotate strains and sequences automatically.

In the following sections, the latter module is detailed and tested.

3 Evaluation of automated annotations

The goal of Multilevel clustering like other clustering

tools5,14,15,21,23,30,31,42,43 is to classify homologous sequences

into groups. Two sequences are considered homologous if their

similarity, obtained by a variant of the Smith–Waterman

pairwise local alignment function,37 is higher than a given

threshold. A major difference between Multilevel clustering

and other methods is that it does not compute all pairwise

similarities between the objects to be classified. Instead, it only

compares objects when it is necessary (i.e. when they are

relatively closely related). Therefore, it reduces the total runtime

for clustering significantly and limits memory problems. Multi-

level clustering gives similar results as Transitivity clustering.43

However, it is able to deal with hundreds of thousands of

sequences in a relatively short amount of time.40 The system

allows alien, low or bad quality sequences to be flagged and

further analysed by the curators. This system allows the latter to

save a lot of time by concentrating on the problems only and

avoiding them to review, on a daily basis, thousands of

sequences with high quality scores.

The material and methods as well as the results of the

evaluation of the automated taxonomic annotation of strains

and sequences by Multilevel clustering are described in this

section. In order to demonstrate the ability of the tool to

properly cluster data, a well curated reference dataset of yeast

strains has been used to compare clustering results obtained on

the basis of:

1. ITS sequences only

2. LSU sequences only

3. ITS and LSU sequences combined

The correlation of the clustering results obtained using the three

datasets and physiological properties of the strains belonging to

the obtained clusters has been studied. This has been done in order

Fig. 4 LIMS trace file edition module.

Dow

nloa

ded

by Y

ale

Uni

vers

ity L

ibra

ry o

n 06

Mar

ch 2

012

Publ

ishe

d on

17

Febr

uary

201

2 on

http

://pu

bs.r

sc.o

rg |

doi:1

0.10

39/C

2IB

0014

6B

View Online

Integr. Biol. This journal is c The Royal Society of Chemistry 2012

to evaluate the ability of Multilevel clustering to produce

homogeneous groupings.

3.1 Material and methods

Predicting an optimal threshold to cluster sequences of a given

locus. To place sequences in the right taxonomic groups and

automatically annotate them, an optimal similarity threshold

(OST) has to be established. This threshold allows placing all

sequences of the same species into the same cluster and sequences

representing other species are excluded. In order to achieve this

goal, for each kind of sequences, a gold standard dataset has

been created and contains sequences checked and validated by

experts. This dataset is then clustered with different thresholds.

The OST is the one that gives the best quality for clustering

which is computed by the F-measure function proposed by

Paccanaro et al. [ref. 30; see below as well]. Sequences from

different loci can have different OSTs.

Quality of clustering. In order to evaluate the quality of

clustering, the result of clustering needs to be compared

against some gold standard dataset. This dataset consists of

pre-clustered sequences containing well-known and curated

information, created, updated and checked by experts. To do

this, the F-measure function proposed by Paccanaro et al.30

has been used and is described below.

Let us consider a set V of sequences. Let C= (C1, . . ., Cl) be

the gold standard partition of V, and K = (K1, . . ., Km) the

partition obtained by clustering objects of V. The F-measure

function F(K, C) is defined as follows:

FðK,CÞ ¼ 1

n

Xlj¼1

n j �max1�i�m2n j

i

ni þ n j

!

where n is the number of sequences in V, ni is the number of

sequences inKi, nj is the number of sequences inCj, and n j

i is the

number of sequences in Ki - Cj for 1 r i r m and 1 r j r l.

The value of F(K,C) is between 0 and 1. Clustering results are

considered as very good when they are very similar to the gold

standard dataset, i.e. the F-measure is equal to 1, and very bad

when the F-measure is equal to 0. The more reliable the gold

standard dataset is the better evaluation of clustering will be.

Annotation of strains using only one locus. Having predicted

an OST for a set of sequences of a given locus, all the

sequences of the ITS and LSU datasets are then clustered

with their respective OSTs. Based on the obtained groups, one

can already attribute taxon names to the sequences. All

sequences of a given cluster have the same taxon name which

is the most common or frequent name amongst the sequences

of the cluster. Since sequences of the same strain must have the

same taxon name, a more accurate prediction method is

proposed to attribute taxon names to the sequences as follows.

First of all, it is to ensure that sequences from different

clusters cannot be given the same taxonomic name. If a

sequence belonging to a cluster has a taxonomic name that

appears in another cluster, this name is appended with the

index of the cluster in order to distinguish it from the previous

name. A candidate taxonomic name (CTN) of a cluster is a

given taxonomic name of one of the sequences of the cluster.

The quality of a candidate taxonomic name of a cluster is the

percentage of the sequences having this name in the cluster.

However, if this candidate taxonomic name is given to a

validated sequence of the cluster that is linked to a type strain,

its quality is set to 1.

We define a CTN and its quality for a given strain as

follows. Every CTN of a sequence of a strain is also a CTN

of the strain. LetT be a CTN of a strain s, andQ the quality of T.

Let n be the number of sequences of s having the CTN T, and N

the number of all sequences of s. The quality of T with respect

to s is given by q ¼ n�QN

. The taxonomic name attributed to s is

then the CTN of s with the highest quality value. All sequences

of s are also annotated with this taxonomic name.

Annotation of strains using several loci. In ref. 35, there is an

agreement that there should be only one DNA barcode locus

for fungi which is the ITS. Although the use of ITS as an

official fungal barcode represents a solution that will satisfy

many fungal groups, it does not resolve them all.36 In addition,

the aim of this study is to develop a system that can be applied

not only to fungi but to other organisms as well, where a

second or even more barcoding markers will be necessary for

precise species identification. Therefore, in LIMS an algorithm

that combines different loci to annotate strains and sequences

has been implemented. This algorithm can predict the taxonomic

name of strains as well as its quality for each loci. The final

taxonomic name of a strain is the one that has the highest

quality produced by one of the loci. Experiments show that

this combination gives better annotations than the single locus

approach (see results below).

Comparisons of gold standard datasets. The gold standard

datasets used comprise yeast strains from the CBS collection

(see Table S3 in ESIz) for which DNA sequences of both the

ITS and/or LSU loci are available, and that have been

validated by the curators of the CBS collection9 on the basis

of their morphological, sexual, physiological and molecular

properties. The ITS dataset then consists of 1907 sequences

from 653 strains, while the LSU dataset consists of 1999

sequences from 683 strains. For both datasets, strains are

belonging to 344 different species.

Comparisons of physiological properties. Eighty-eight physio-

logical characteristics (2-keto-D-gluconate, 5-keto-D-gluconate,

a,a-trehalose, acetic acid 1%, arbutin, butane 2,3-diol, cadaverine,

cellobiose, citrate, creatine, creatinine, cycloheximide 0.01%,

cycloheximide 0.1%, D-arabinose, D-galactonate, D-galactose,

D-galacturonate, D-glucarate, D-glucitol, D-gluconate, D-glucono-

1,5-lactone, D-glucosamine, D-glucose, D-glucuronate, DL-lactate,

D-mannitol, D-proline, D-ribose, D-tartaric acid, D-tryptophan,

D-xylose, erythritol, ethylamine, ethylene glycol, fluconazole,

galactaric acid, galactitol, gentobiose, glucosamine, glycerol,

growth at pH= 3, growth at pH= 9.5, growth on 10%NaCl,

growth on 16%NaCl, growth w/o biotin, growth w/o biotin &

thiamin, growth w/o myo-inositol, growth w/o niacin, growth

w/o PABA, growth w/o pantothenate, growth w/o pyridoxine,

growth w/o pyridoxine & thiamin, growth w/o thiamin,

growth w/o vitamins, imidazole, inulin, lactose, L-arabinitol,

L-arabinose, levulinate, L-lysine, L-malic acid, L-rhamnose,

L-sorbose, L-tartaric acid, maltose, Me a-D-glucoside, melezitose,

melibiose, meso-tartaric acid, myo-inositol, nitrate, nitrite,

Dow

nloa

ded

by Y

ale

Uni

vers

ity L

ibra

ry o

n 06

Mar

ch 2

012

Publ

ishe

d on

17

Febr

uary

201

2 on

http

://pu

bs.r

sc.o

rg |

doi:1

0.10

39/C

2IB

0014

6B

View Online

This journal is c The Royal Society of Chemistry 2012 Integr. Biol.

palatinose, propane 1,2 diol, putrescine, quinic acid, raffinose,

ribitol, salicin, starch, succinate, sucrose, Tween 40, Tween 60,

Tween 80, uric acid, xylitol) have been used to compare the set

of strains used in both ITS and LSU datasets listed in Table 2.

All physiological characteristics of the strains studied are

available from the CBS database (http://www.cbs.knaw.nl/

collections). The results of physiological observations can be

represented by four possible states (Table 1).

To measure a physiological similarity between two strains, the

method introduced by Robert et al.32 has been used. All strains

have been compared to each other’s using all available physio-

logical characteristics. The global physiological similarity

between two strains is the average of the obtained similarities

for each of the physiological tests available for the two strains.

The physiological similarity of an individual to its cluster is

the average physiological similarity with all the members of

the cluster to which it belongs. The physiological similarity of

a cluster is represented by the minimum physiological similarity

of its members.

3.2 Results and discussion

In this section, the results obtained from the automatic

annotation method on the two gold standard datasets (ITS

and LSU) are explained. The correspondence between physio-

logical and molecular (ITS and LSU) data is discussed.

The computed OSTs for ITS and LSU sequences are 0.9897

and 0.9968 respectively (see Fig. 5). The corresponding coefficients

of quality (F-measure) of clustering are 0.8988 and 0.874.

Based on the above OSTs, a summary of the automated

annotation procedure can be found in Table S3 (ESIz) wherecolumns are containing the following information: (1) the CBS

strain numbers, (2) the species names of the strains given by

the curators, (3) the average percentage of physiological

similarity between the strains belonging to the same species

names provided by the curators, (4) the number of ITS

sequences per strain, (5) the automated species name annotation

using the ITS locus only, (6) the average percentage of physio-

logical similarity between the strains belonging to the same

species names provided by the automated annotation using

the ITS locus only, (7) the number of LSU sequences per

strain, (8) the automated species name annotation using the

LSU locus only, (9) the average percentage of physiological

similarity between the strains belonging to the same species

names provided by the automated annotation using the LSU

locus only, (10) the automated species name annotation using

the ITS and LSU loci, (11) the average percentage of physio-

logical similarity between the strains belonging to the same

species names provided by the automated annotation using the

ITS and LSU loci.

The automated annotation using the ITS dataset suggests

that 341 different taxonomic names or clusters are present.

When using the LSU dataset alone, 366 taxonomic names or

clusters are proposed. When combining both ITS and LSU

datasets, 362 taxonomic names or clusters are found. The

percentage of correctly identified strains is 90.5% for the ITS

locus alone, 85.94% for the LSU locus alone and 92.19%

when combining the ITS and LSU loci.

Table 2 shows the list of 62 strains (about 9.5% of the total

number of strains tested) that are not clustered in complete

agreement with the classification provided by the curators

using ITS sequences. Among them, 21 strains (3.2%) have a

predicted name containing a number, meaning that the given

species are divided into sub-clusters because of the high OST

used for clustering. The remaining 41 strains (6.3%) belonging

to 23 different species are clustered into 13 other groups.

Therefore, physiological data have been used to confirm or

infirm the automated clustering system.

Physiological similarities of the clusters provided by the

curators and the ones obtained by the annotation method have

been computed to study the physiological diversity of strains

within a species. The minimum and average physiological

similarities within species are 0.8017 and 0.9901, respectively.

These numbers are 0.7929 and 0.9896 for the clusters when

using the ITS dataset for automated annotation. It should be

noted that the average physiological similarity is very high in

both cases, partly because more than half of the clusters contain

only one strain. There are only 12 species (3.5%:Candida ernobii,

Candida tropicalis, Candida versatilis, Cryptococcus vishniacii,

Cyberlindnera jadinii, Debaryomyces hansenii, Kluyveromyces

lactis var. drosophilarum, Kluyveromyces lactis var. lactis,

Kluyveromyces marxianus, Rhodotorula mucilaginosa, Torulaspora

delbrueckii, Trichosporon asahii) in which the minimum physio-

logical similarity is less than 93% and could therefore be

considered as problematic.Debaryomyces hansenii and Torulaspora

delbrueckii have been reinvestigated by the curators and it

appears that they have been properly annotated by the curators

even if their internal average physiological similarities are relatively

low (0.8839 and 0.8927). For Candida ernobii, Cyberlindnera jadinii

and Kluyveromyces lactis var. drosophilarum physiological

similarities are even lower at 0.8017, 0.85 and 0.8579 respectively.

It is known that physiological features can vary significantly

among strains that belong to the same species. Although this is

not an uncommon observation, low physiological similarity

values provide an indication of possible problems. In the case

Table 1 Possible states for physiological testing results

State Meaning

? for unknown The test has not been performed� for negative The strain could not grow on the test+ for positive The strain could grow on the testV for variable The strain could sometimes grow or not on the test

Fig. 5 The qualities (F-measure values) when clustering the ITS and

LSU datasets with different thresholds. Here the thresholds are

increased by 1/1000 from 0.98 to 1. The OST for clustering a

dataset alone is the one producing the best quality.

Dow

nloa

ded

by Y

ale

Uni

vers

ity L

ibra

ry o

n 06

Mar

ch 2

012

Publ

ishe

d on

17

Febr

uary

201

2 on

http

://pu

bs.r

sc.o

rg |

doi:1

0.10

39/C

2IB

0014

6B

View Online

Integr. Biol. This journal is c The Royal Society of Chemistry 2012

Table2

Listofincorrectlyidentified

strains,associatedspeciesnames

given

bythecurators,automaticannotationusingIT

Sandphysiologicalsimilarities

ofstrainswithin

speciesnames

predictedby

ITS.A

namecontainsanumber

ifitisasub-cluster

ofagiven

species

CBS

number

Speciesnamebycuration

SpeciesnamebyIT

SPhy.sim.by

ITS

1CBS2103

Kluyveromyceslactis(D

ombrowski)Vander

Waltvar.drosophilarum

(El-Tabey

Shehata

etal.)Siderburg

&Lachance

Kluyveromyceslactis(D

ombrowski)Vander

Waltvar.lactis

0.7929

2CBS2105

Kluyveromyceslactis(D

ombrowski)Vander

Waltvar.drosophilarum

(El-Tabey

Shehata

etal.)Siderburg

&Lachance

Kluyveromyceslactis(D

ombrowski)Vander

Waltvar.lactis

0.7929

3CBS8883

Kluyveromyceslactis(D

ombrowski)Vander

Waltvar.drosophilarum

(El-Tabey

Shehata

etal.)Siderburg

&Lachance

Kluyveromyceslactis(D

ombrowski)Vander

Waltvar.lactis

0.7929

4CBS945

Cryptococcusalbidus(Saito)C.E.Skinner

var.albidus

CryptococcusvishniaciiVishniac&

Hem

pflingvar.vishniacii

0.8288

5CBS142

Cryptococcusalbidus(Saito)C.E.Skinner

var.albidus

CryptococcusvishniaciiVishniac&

Hem

pflingvar.vishniacii

0.8288

6CBS1926

Cryptococcusalbidus(Saito)Skinner

var.kuetzingii(Fell&

Phaff)Fonseca,

Scorzetti&

Fell

CryptococcusvishniaciiVishniac&

Hem

pflingvar.vishniacii

0.8288

7CBS8351

CryptococcusadeliensisScorzetti,Petruscu,Yarrow

&Fell

CryptococcusvishniaciiVishniac&

Hem

pflingvar.vishniacii

0.8288

8CBS789

DebaryomycesfabryiM.Ota

Debaryomyceshansenii(Zopf)

Lodder

&Kreger-vanRij

0.8772

9CBS796

DebaryomycesfabryiM.Ota

Debaryomyceshansenii(Zopf)

Lodder

&Kreger-vanRij

0.8772

10

CBS6066

DebaryomycesfabryiM.Ota

Debaryomyceshansenii(Zopf)

Lodder

&Kreger-vanRij

0.8772

11

CBS5230

DebaryomycesfabryiM.Ota

Debaryomyceshansenii(Zopf)Lodder

&Kreger-vanRij

0.8772

12

CBS5138

DebaryomycesfabryiM.Ota

Debaryomyceshansenii(Zopf)Lodder

&Kreger-vanRij

0.8772

13

CBS1793

DebaryomycesfabryiM.Ota

Debaryomyceshansenii(Zopf)Lodder

&Kreger-vanRij

0.8772

14

CBS2330

DebaryomycesfabryiM.Ota

Debaryomyceshansenii(Zopf)Lodder

&Kreger-vanRij

0.8772

15

CBS792

Debaryomycessubglobosus(Zach)Lodder

&Kreger-vanRij

Debaryomyceshansenii(Zopf)Lodder

&Kreger-vanRij

0.8772

16

CBS1796

Debaryomycessubglobosus(Zach)Lodder

&Kreger-vanRij

Debaryomyceshansenii(Zopf)Lodder

&Kreger-vanRij

0.8772

17

CBS5921

DebaryomycesnepalensisS.Goto

&Sugiyama

Debaryomyceshansenii(Zopf)Lodder

&Kreger-vanRij

0.8772

18

CBS11845

Debaryomycespsychrosporus

Debaryomyceshansenii(Zopf)

Lodder

&Kreger-vanRij

0.8772

19

CBS8821

Hanseniaspora

clermontiaeN.Cadez,Poot,Raspor&

M.Th.Smith

Hanseniaspora

meyeriN.Cadez,Poot,Raspor&

M.Th.Smith

0.8846

20

CBS2219

CandidaoleophilaMontrocher

Candidazeylanoides

(Castellani)Langeron&

Guerra

var.zeylanoides

0.9011

21

CBS4515

CandidasantamariaeMontrocher

var.santamariae

Candidazeylanoides

(Castellani)Langeron&

Guerra

var.zeylanoides

0.9011

22

CBS4261

CandidasantamariaeMontrocher

var.santamariae

Candidazeylanoides

(Castellani)Langeron&

Guerra

var.zeylanoides

0.9011

23

CBS5838

CandidasantamariaeMontrocher

var.mem

branifaciensMontrocher

Candidazeylanoides

(Castellani)Langeron&

Guerra

var.zeylanoides

0.9011

24

CBS7623

Trichosporonasteroides

(Rischin)M.Ota

TrichosporonasahiiAkagiex

Sugita,Nishikawa&

Shinodavar.asahii

0.903

25

CBS255

Cyberlindnerasuaveolens(K

locker)Kurtzm

an,Robnett&

Basehoar-Powers

Cyberlindnerasaturnus(K

locker)Kurtzm

an,Robnett&

Basehoar-Powers

0.9114

26

CBS1707

Cyberlindneramrakii(W

ickerham)Kurtzm

an,Robnett&

Basehoar-Powers

Cyberlindnerasaturnus(K

locker)Kurtzm

an,Robnett&

Basehoar-Powers

0.9114

27

CBS2169

SchwanniomycesoccidentalisKlocker

var.persoonii(V

ander

Walt)Kurtzm

an&Robnett

SchwanniomycesoccidentalisKlocker

var.occidentalis

0.9147

28

CBS5674

Meyerozymacaribbica(V

aughan-M

artini,Kurtzm

an,S.A

.Meyer

&ONeill)

Kurtzm

an&

Suzuki

Candidacarpophila(Phaff&

M.W

.Miller)Vaughan-M

artini,Kurtzm

an,S.A

.Meyer

&O

Neill

0.9152

29

CBS9966

Meyerozymacaribbica(V

aughan-M

artini,Kurtzm

an,S.A

.Meyer

&ONeill)

Kurtzm

an&

Suzuki

Candidacarpophila(Phaff&

M.W

.Miller)Vaughan-M

artini,Kurtzm

an,S.A

.Meyer

&O

Neill

0.9152

30

CBS883

Filobasidiellabacillispora

Filobasidiellaneoform

ans

0.9248

31

CBS919

Filobasidiellabacillispora

Filobasidiellaneoform

ans

0.9248

32

CBS6289

Filobasidiellabacillispora

Filobasidiellaneoform

ans

0.9248

33

CBS8273

Filobasidiellabacillispora

Filobasidiellaneoform

ans

0.9248

34

CBS6955

Filobasidiellabacillispora

Filobasidiellaneoform

ans

0.9248

35

CBS2008

Schwanniomycespseudopolymorphus(C

.Ramırez

&Boidin)C.W

.Price

&Phaff

Schwanniomycespolymorphus(K

locker)C.W

.Price

&Phaffvar.africanus

Vander

Walt,Nakase

&Suzuki

0.9507

36

CBS3024

Schwanniomycesvanrijiae(V

ander

Walt&

Tscheuschner)

M.Suzuki&

Kurtzm

anvar.vanrijiae

Schwanniomycespolymorphus(K

locker)C.W

.Price

&Phaffvar.africanus

Vander

Walt,Nakase

&Suzuki

0.9507

37

CBS20

Rhodotorula

glutinis(Fresenius)

F.C.Harrison

Rhodotorula

graminisdiMenna

0.9534

38

CBS6020

Rhodosporidium

babjevaeGolubev

Rhodotorula

graminisdiMenna

0.9534

39

CBS7809

Rhodosporidium

babjevaeGolubev

Rhodotorula

graminisdiMenna

0.9534

40

CBS5001

SporidiobolusruineniaeHolzschu,Tredick&

Phaffvar.ruineniae

SporidiobolusruineniaeHolzschu,Tredick&

Phaffvar.coprophilus

0.9655

Dow

nloa

ded

by Y

ale

Uni

vers

ity L

ibra

ry o

n 06

Mar

ch 2

012

Publ

ishe

d on

17

Febr

uary

201

2 on

http

://pu

bs.r

sc.o

rg |

doi:1

0.10

39/C

2IB

0014

6B

View Online

This journal is c The Royal Society of Chemistry 2012 Integr. Biol.

Table

2(continued

)

CBS

number

Speciesnamebycuration

SpeciesnamebyIT

SPhy.sim.by

ITS

41

CBS1752

Candidaversatilis(Etchells&

T.A

.Bell)S.A

.Meyer

&Yarrow

Candidaversatilis(Etchells&

T.A

.Bell)S.A

.Meyer

&Yarrow

117

0.98

42

CBS1731

Candidaversatilis(Etchells&

T.A

.Bell)S.A

.Meyer

&Yarrow

Candidaversatilis(Etchells&

T.A

.Bell)S.A

.Meyer

&Yarrow

117

0.98

43

CBS1924

Candidavisw

anathiiViswanathan&

H.S.Randhawaex

R.S.Sandhu&

H.S.Randhawa

Candidavisw

anathiiViswanathan&

H.S.Randhawaex

R.S.Sandhu&

H.S.Randhawa131

1

44

CBS9494

CandidazemplininaSipiczki

CandidazemplininaSipiczki114

145

CBS8176

Candidamaritima(Siepmann)vanUden

&H.R

.Buckleyex

S.A

.Meyer

&Yarrow

Candidamaritima(Siepmann)vanUden

&H.R

.Buckleyex

S.A

.Meyer

&Yarrow

204

1

46

CBS6897

Candidahumilis(E.E.Nel

&Vander

Walt)S.A

.Meyer

&Yarrow

Candidahumilis(E.E.Nel

&Vander

Walt)S.A

.Meyer

&Yarrow

173

147

CBS4736

Filobasidium

capsuligenum

(Fell,Statzell,I.L.Hunter&

Phaff)Rodrigues

deMiranda

Filobasidium

capsuligenum

(Fell,Statzell,I.L.Hunter&

Phaff)

Rodrigues

deMiranda126

1

48

CBS277

Hanseniaspora

vineaeVander

Walt&

Tscheuschner

Hanseniaspora

vineaeVander

Walt&

Tscheuschner

41

149

CBS281

Hanseniaspora

valbyensisKlocker

Hanseniaspora

valbyensisKlocker

43

150

CBS4550

Pichia

ferm

entansLodder

Pichia

ferm

entansLodder

37

151

CBS244

Pichia

mem

branifaciens(E.C.Hansen)E.C.Hansen

Pichia

mem

branifaciens(E.C.Hansen)E.C.Hansen16

152

CBS6985

Rhodosporidium

sphaerocarpum

S.Y

.New

ell&

Fell

Rhodosporidium

sphaerocarpum

S.Y

.New

ell&

Fell233

153

CBS349

Rhodosporidium

toruloides

Banno

Rhodosporidium

toruloides

Banno1

154

CBS2382

Rhodotorula

mucilaginosa

(A.Jorgensen)F.C.Harrisonvar.mucilaginosa

Rhodotorula

mucilaginosa

(A.Jorgensen)F.C.Harrisonvar.mucilaginosa

31

55

CBS2247

SaccharomycescerevisiaeMeyen

exE.C.Hansenvar.cerevisiae

SaccharomycescerevisiaeMeyen

exE.C.Hansenvar.cerevisiae60

156

CBS1479

Saccharomycescerevisiaeorparadoxus?

Saccharomycescerevisiaeorparadoxus?

66

157

CBS1489

Saccharomycescerevisiaeorparadoxus?

Saccharomycescerevisiaeorparadoxus?

67

158

CBS1194

Saccharomycescerevisiaeorparadoxus?

Saccharomycescerevisiaeorparadoxus?

68

159

CBS2702

Candidaalbicans(R

obin)Berkhoutvar.albicans

Candidaalbicans(R

obin)Berkhoutvar.albicans75

160

CBS1522

SporidiobolusjohnsoniiNyland

SporidiobolusjohnsoniiNyland110

161

CBS5800

Wickerhamomycessilvicola

(Wickerham)Kurtzm

an,Robnett&

Basehoar-Powers

Wickerhamomycessilvicola

(Wickerham)Kurtzm

an,Robnett&

Basehoar-Powers112

1

62

CBS6736

ZygoascushellenicusM.Th.Smith

ZygoascushellenicusM.Th.Smith226

1

Dow

nloa

ded

by Y

ale

Uni

vers

ity L

ibra

ry o

n 06

Mar

ch 2

012

Publ

ishe

d on

17

Febr

uary

201

2 on

http

://pu

bs.r

sc.o

rg |

doi:1

0.10

39/C

2IB

0014

6B

View Online

Integr. Biol. This journal is c The Royal Society of Chemistry 2012

of the three remaining ‘‘problematic’’ species further extensive

investigations based on additional molecular and sexual data

have been done and the original grouping made by the

curators has been confirmed.

After investigating the physiological similarity of species in

the ITS gold standard dataset, the physiological similarity of

clusters obtained by the annotation method is looked at to

point out incorrectly-grouped clusters. There are 16 clusters

(4.7%: Candida carpophila, Candida ernobii, Candida tropicalis,

Candida zeylanoides, Cryptococcus vishniacii, Cyberlindnera jadinii,

Cyberlindnera saturnus, Debaryomyces hansenii, Filobasidiella

neoformans, Hanseniaspora meyeri, Kluyveromyces lactis var.

lactis, Kluyveromyces marxianus, Rhodotorula mucilaginosa,

Schwanniomyces occidentalis, Torulaspora delbrueckii, Trichosporon

asahii) having physiological similarities lower than 93%. It is

interesting to see that among them, five clusters (Candida ernobii,

Candida tropicalis, Cyberlindnera jadinii, Kluyveromyces marxianus

and Torulaspora delbrueckii) with intra-specific physiological

similarities of 0.8017, 0.9216, 0.85, 0.9149 and 0.8927 (respectively)

are the same as five of the species of the gold standard dataset. In

other words, they are clustered in complete agreement with the

classification provided by the curators. The cluster with the

predicted taxon name Rhodotorula mucilaginosa with an intra-

specific physiological similarity of 0.9145 is a sub-cluster of the

given species because of the high OST used for clustering. The

10 remaining clusters (Candida carpophila, Candida zeylanoides,

Cryptococcus vishniacii, Cyberlindnera saturnus, Debaryomyces

hansenii, Filobasidiella neoformans, Hanseniaspora meyeri,

Kluyveromyces lactis var. lactis, Schwanniomyces occidentalis,

Trichosporon asahii) have a problem of different species sharing

the same ITS sequences. As an example, the cluster with

predicted taxon name Debaryomyces hansenii contains strains

of five different species: Debaryomyces hansenii, Debaryomyces

fabryi M. Ota, Debaryomyces subglobosus (Zach) Lodder &

Kreger-van Rij, Debaryomyces nepalensis S. Goto & Sugiyama,

and Debaryomyces psychrosporus. These species contain highly

similar ITS sequences, and therefore cannot be distinguished by

ITS sequences but can be distinguished by other methods such

as RAPD-PCR analysis, PCR-RFLP or PCR finger printing.24

It is noted that the ten clusters above are the clusters predicted

by ITS containing incorrectly identified strains in Table 2.

The Schwanniomyces polymorphus, Rhodotorula graminis and

Sporidiobolus ruineniae var. coprophilus clusters left from

Table 2 with physiological similarities of 0.9507, 0.9534 and

0.9655 also face the same problem. To solve the problem of

different species sharing the same ITS sequences, one or more

markers or loci would be needed in order to obtain correct

species identification. It also shows that ITS locus is not

discriminative enough in some taxonomic groups.

The next experiment is the annotation of strains and

sequences of the LSU dataset in the same way as for the ITS

dataset. The optimal threshold predicted for this dataset

for sequence clustering is 0.9968 with a quality of 0.874.

The percentage of correctly identified strains in this case is

85.94%. Again, the physiological similarity in a cluster of the

LSU dataset and the dataset obtained by clustering is

very high. Very similar clustering or grouping results to the

ITS dataset have been obtained using the LSU dataset alone

(see Table S3 in ESIz).

When combining the ITS and LSU loci, the percentage of

correctly identified strains rises to 92.19%, indicating that the

quality of the identification process improves whenmore markers

are used. Using both loci, only seven ‘‘problematic’’ clusters

remain (Kluyveromyces lactis var. lactis, Debaryomyces hansenii,

Cyberlindnera saturnus, Candida carpophila, Schwanniomyces

occidentalis, Rhodotorula graminis and Candida zeylanoides) with

low intra-specific physiological similarities (0.7929, 0.8836,

0.9114, 0.9146, 0.9147, 0.9492 and 0.9692).

In conclusion for this section, biologists often identify species

to predict their physiological features. With our experiments,

we can see a strong correspondence between molecular species

identification and their physiological features. Therefore,

physiological properties of strains belonging to a given taxonomic

group can somehow be predicted on the basis of molecular

identification. On the other hand, physiological properties together

with molecular markers can be used to detect wrongly identified

strains. It must be noted that although representatives of the same

species are physiologically quite similar as can be seen from the

data obtained by this study, in this case the identification of species

based on only physiological feature is not reliable, since many

different species can share the same physiological features. In

addition, there are also species for which strains can vary

significantly in their physiological features as can be seen in

some of the clades obtained in the analyses done during

this study.

One can also use Transitivity clustering43 to cluster sequences

of the two ITS and LSU datasets. The optimal thresholds

predicted for them are 0.9826 and 0.9901 with the F-measures

of 0.8945 and 0.8959. The percentages of correctly identified

strains based on the two datasets alone using Transitivity

clustering are 87.44% and 86.53%, and 91.42% when combining

them. So, Transitivity clustering and Multilevel clustering are

both performing well on small datasets. However, on large

sequence datasets (i.e. 4500000) Transitivity clustering can

hardly be used while Multilevel clustering is scaling well.40

It also has to be taken into account that the markers have to

be of good quality and that the signal they give should be

meaningful in the identification process.

4 Conclusion and future work

The described LIMS has been used by the barcoding group of

the CBS-KNAW Fungal Biodiversity Centre since January

2011 (for about 9 months). During this period of time, more

than 11 k DNA extractions, more than 29 k PCR reactions

and almost 42 k cycle sequencing reactions have been carried out.

More than 17 k sequences have been edited within the system.

Owing to other changes implemented around the same period in

the lab work, the effect of the LIMS on the productivity cannot

be easily determined. The LIMS has improved the quality of the

obtained data by avoiding many errors or making the remaining

ones traceable. Within the CBS project, the system does not

require any typing of sample ids; therefore, the risk of confusing

sample ids in the electronic system is minimized. The LIMS

makes the task of lab managers more efficient. The latter can

easily monitor progress, success rates of protocols, decide about

(redo) strategies, and track the fate of individual samples. In terms

of work organisation, one of the greatest advantages of the

Dow

nloa

ded

by Y

ale

Uni

vers

ity L

ibra

ry o

n 06

Mar

ch 2

012

Publ

ishe

d on

17

Febr

uary

201

2 on

http

://pu

bs.r

sc.o

rg |

doi:1

0.10

39/C

2IB

0014

6B

View Online

This journal is c The Royal Society of Chemistry 2012 Integr. Biol.

systems is that generally all data are available and accessible

through the same interface, which allows the division of labour

by process (and not by sample). This signifies a major step

from cottage style to industry style production.

The LIMS has been created for the particular purpose of

DNA barcode data production, based on the needs of the

CBS-KNAW Barcoding project. One of the leading principles

has been to implement procedures that are commonly used in

DNA Sanger sequencing projects in the code and to use

the afore named BioloMICS programming tool for creating

routines that are more specific, i.e. for scripts that manage

the communication with automated lab equipment. Regular

progress reports can be created through scripts as well as

the generation of lists of samples destined to enter specific

workflows. Automated validation is implemented in the

barcoding process.

All possible applications of the automated clustering and

naming procedure have not yet been explored. Within the

DNA barcoding project we experience that, after implementation

of the LIMS and 96-well based methods, sequence validation is

the most severe bottleneck, without even venturing yet to the

necessary task of repeated validation as more (validated) data

accumulate. The current lack of validated data is generally

perceived as a setback for fungal research and its applica-

tions.2,3,7,28,36 With the designation of the fungal barcode

region (expected by the end of 201135) there is expectation

that more validated fungal (ITS) sequences will become avail-

able. Even though the number of strains in the CBS collection

is only a fraction of the total number of fungal species, and is

representing only the culturable biodiversity, it is still going to

be a very valuable dataset for the mycological community at

large, last but not least because of its ca. 8000 (ex) type and

authentic strains.26

New sequencing technologies will boost the availability of

sequence data in many areas of mycology. While experimental

approaches often reveal a considerable variability in complex

traits of fungi, simple characters like fermentation skills in

yeasts or, for example, extrolites in Aspergillus (i.e. ref. 17 and

24) tend to be species specific. The described tool can be

applied either for checking assumptions about species’ properties

or exploring them. Even if traits are not strictly species specific,

like, for examples sensitivity against certain antimycotica,

clustering results could still give an indication about cures if

a pathogen has been identified to a species.

Acknowledgements

We thank Janneke Bloem and Nathalie van de Wiele for

testing the LIMS.

References

1 S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang,W. Miller and D. J. Lipman, Gapped blast and psi-blast: a newgeneration of protein database search programs, Nucleic AcidsRes., 1997, 25(17), 3389–3402.

2 D. Begerow, H. Nilsson, M. Unterseher and W. Maier, Currentstate and perspectives of fungal DNA barcoding and rapididentification procedures, Appl. Microbiol. Biotechnol., 2010, 87,99–108.

3 M. I. Bidartondo and 256 other authors, Preserving accuracy ingenbank, Science, 2008, 319, 1616.

4 M. Blackwell, D. S. Hibbett, J. W. Taylor and J. W. Spatafora,Research coordination networks: a phylogeny for kingdom fungi(deep hypha), Mycologia, 2006, 98, 829–837.

5 E. Bolten, A. Schliep, S. Schneckener, D. Schomburg andR. Schrader, Clustering protein sequences-structure prediction bytransitive homology, Bioinformatics, 2001, 17, 935–941.

6 P.M. Brock, H. Doring andM. I. Bidartondo, How to know unknownfungi: the role of a herbarium, New Phytol., 2009, 181(3), 719–724.

7 T. D. Bruns, A. E. Arnold and K. W. Hughes, Fungal networksmade of humans: unite, fesin, and frontiers in fungal ecology,New Phytol., 2008, 177, 586–588.

8 E. N. Cianciola, T. R. Popolizio, C. W. Schneider and C. E. Lane,Using molecular-assisted alpha taxonomy to better understand redalgal biodiversity in bermuda, Diversity, 2010, 2, 946–958.

9 CBS databases. http://www.cbs.knaw.nl/databases/.10 NCBI databases. http://www.ncbi.nlm.nih.gov/.11 K. de Queiroz, Species concepts and species delimitation, Syst.

Biol., 2007, 56(6), 879–886.12 R. P. de Vries, A. Wiebenga and V. Robert, Fungal growth

database: linking growth to genome, http://www.fung-growth.org.13 U. Eberhardt, Methods for DNA barcoding fungi, in DNA

Barcodes: Methods and Protocols, ed. J. W. Kress andD. L. Erickson, Humana Press.

14 A. J. Enright, S. Van Dongen and C. A. Ouzounis, An efficientalgorithm for large-scale detection of protein families, NucleicAcids Res., 2002, 30(7), 1575–1584.

15 A. J. Enright, S. Van Dongen and C. A. Ouzounis, Protein familiesand tribes in genome sequence space, Nucleic Acids Res., 2003,31(15), 4632–4638.

16 D. Erpenbeck and G. Woheide, On the molecular phylogeny ofsponges (porifera), Zootaxa, 2007, 1668, 107–126.

17 D. M. Geiser, M. A. Klich, J. C. Frisvad, S. W. Peterson, J. Vargaand R. A. Samson, The current status of species recognition andidentification in aspergillus studies, Mycology, 2007, 59, 1–10.

18 D. L. Hawksworth, The magnitude of fungal diversity: the 1 � 5million species estimate revisited, Mycol. Res., 2001, 105(12),1422–1432.

19 P. D. Hebert, A. Cywinska, S. L. Ball and J. R. de Waard,Biological identifications through DNA barcodes, Proc. R. Soc.London, Ser. B, 2003, 270, 313–321.

20 T. Y. James, F. Kauff, C. L. Schoch and 70 co authors, Reconstructingthe early evolution of fungi using a six-gene phylogeny, Nature,2006, 443, 818–822.

21 S. Kim and J. Lee, Bag: a graph theoretic sequence clusteringalgorithm, Int. J. Data Min. Bioinf., 2006, 1(2), 178–200.

22 T. K. Konstantinos and J. M. Tiedje, Genomic insights thatadvance the species definition for prokaryotes, Proc. Natl. Acad.Sci. U. S. A., 2005, 102(7), 2567–2572.

23 A. Krause, J. Stoye and M. Vingron, Large scale hierarchical-clustering of protein sequences, BMC Bioinformatics, 2005, 6, 15.

24 The yeasts—a taxonomic study, ed. C. P. Kurtzman, J. W. Fell andT. Boekhout, 2011.

25 C. B. Lawrence and V. V. Solovyev, Assignment of position-specific error probability to primary DNA sequence data, NucleicAcids Res., 1994, 22(7), 1272–1280.

26 S. E. Miller, DNA barcoding and the renaissance of taxonomy,Proc. Natl. Acad. Sci. U. S. A., 2007, 104(12), 4775–4776.

27 MongoDB. http://www.mongodb.org/.28 R. H. Nilsson, K. Abarenkov, K. H. Larsson and U. Koljalg,

Molecular identification of fungi: rationale, philosophical concerns,and the unite database, Open Appl. Inf. J., 2011, 5, 81–86.

29 R. H. Nilsson, E. Kristiansson, M. Ryberg, K. Abarenkov,K. H. Larsson, U. Koljalg and C. Fairhead, Taxonomic reliabilityof DNA sequences in public sequence databases: a fungal perspective,PLoS One, 2006, 1(1), e59.

30 P. Paccanaro, J. A. Casbon and M. A. Saqi, Spectral clustering ofproteins sequences, Nucleic Acids Res., 2006, 34(5), 1571.

31 S. Rahmann, T. Wittkop, J. Baumbach and M. Martin, Exact andheuristic algorithms for weighted cluster editing, Comput. Syst.Bioinf. Conf., 2007, 6, 391–401.

32 V. Robert, J. E. de Bien and G. L. Hennebert, Allev, a newprogram for computer-assisted identification of yeasts, Taxon,1994, 43, 433–439.

Dow

nloa

ded

by Y

ale

Uni

vers

ity L

ibra

ry o

n 06

Mar

ch 2

012

Publ

ishe

d on

17

Febr

uary

201

2 on

http

://pu

bs.r

sc.o

rg |

doi:1

0.10

39/C

2IB

0014

6B

View Online

Integr. Biol. This journal is c The Royal Society of Chemistry 2012

33 V. Robert, P. Evrard and G. L. Hennebert, Bccm/allev 2.0 an auto-mated system for identification of yeasts,Mycotaxon, 1997, 64, 455–463.

34 V. Robert, S. Szoke, J. Jabas, T. D. Vu, O. Chouchen, E. Blom andG. Cardinali, Biolomics software: biological datamanagement, identifi-cation, classification and statistic, Open Appl. Inf. J., 2011, 5, 87–98.

35 C. Schoch, et al. The nuclear ribosomal internal transcribed spacer(ITS) region as a universal DNA barcode marker for Fungi, Proc.Natl. Acad. Sci. U. S. A., 2011.

36 K. A. Seifert, Progress towards DNA barcoding of fungi, Mol.Ecol. Resour., 2009, 9, 83–89.

37 T. F. Smith and M. S. Waterman, Identification of commonmolecular subsequences, J. Mol. Biol., 1981, 147, 195–197.

38 M. Y. Stoeckle and P. D. Hebert, Barcode of life, Sci. Am., 2008,299, 82–86.

39 Citrix system. http://www.citrix.com/.40 T. D. Vu, S. Szoke, C. Wiwie, J. Baumbach and V. Robert,

Multilevel clustering for curation of massive biological data, 2011.In preparation.

41 R. A. Wilson and N. J. Talbot, Fungal physiology a futureperspective, Microbiology, 2009, 155, 3810–3815.

42 T. Wittkop, J. Baumbach, F. P. Lobo and S. Rahmann, Largescale clustering of protein sequences with force—a layout basedheuristic for weighted clustering editing, BMC Bioinformatics,2007, 8(1), 396.

43 T. Wittkop, D. Emig, S. Lange, S. Rahmann, M. Albrecht,J. Morris, S. Boker, J. Stoye and J. Baumbach, Partitioningbiological data with transitivity clustering, Nat. Methods, 2010,7, 419–420.

Dow

nloa

ded

by Y

ale

Uni

vers

ity L

ibra

ry o

n 06

Mar

ch 2

012

Publ

ishe

d on

17

Febr

uary

201

2 on

http

://pu

bs.r

sc.o

rg |

doi:1

0.10

39/C

2IB

0014

6B

View Online