target: a web-based pipeline for retrieving and

Published online 8 May 2009 Nucleic Acids Research 2009 Vol 37 No 11 e78doi101093nargkp295

TARGeT a web-based pipeline for retrieving andcharacterizing gene and transposable elementfamilies from genomic sequencesYujun Han James M Burnette III and Susan R Wessler

Department of Plant Biology University of Georgia Athens GA 30602 USA

Received March 7 2009 Revised and Accepted April 15 2009

ABSTRACT

Gene families compose a large proportion ofeukaryotic genomes The rapidly expanding geno-mic sequence database provides a good opportu-nity to study gene family evolution and functionHowever most gene family identification programsare restricted to searching protein databases wheredata are often lagging behind the genomicsequence data Here we report a user-friendlyweb-based pipeline named TARGeT (Tree Analysisof Related Genes and Transposons) which useseither a DNA or amino acid lsquoseedrsquo query to (i) auto-matically identify and retrieve gene family homologsfrom a genomic database (ii) characterize genestructure and (iii) perform phylogenetic analysisDue to its high speed TARGeT is also able tocharacterize very large gene families includingtransposable elements (TEs) We evaluatedTARGeT using well-annotated datasets includingthe ascorbate peroxidase gene family of ricemaize and sorghum and several TE families in riceIn all cases TARGeT rapidly recapitulated theknown homologs and predicted new ones We alsodemonstrated that TARGeT outperforms similarpipelines and has functionality that is not offeredelsewhere

INTRODUCTION

A major discovery of eukaryote genome projects is thatunexpectedly large numbers of genes are members of genefamilies Gene families comprise 49 of the genes inCaenorhabditis elegans 41 in Drosophila melanogaster38 in Homo sapiens 65 in Arabidopsis thaliana and77 in Oryza sativa L ssp japonica (1ndash5) Variationin the sizes of gene families among closely related speciesindicates that gene duplication and gene family diversifi-cation is an ongoing process (6ndash8)

Duplicate genes arise in several ways including whole-genome duplication (9ndash11) and segmental duplication(1213) Segmental duplication events can be further clas-sified into tandem and interspersed (14) A tandem dupli-cation event can result from either homologous (15) ornonhomologous recombination mechanisms (16) whileinterspersed duplication events are mainly caused by theactivity of transposable elements (TEs) (17ndash20)Gene family members can be detected by clustering

genes based on their similarity (2122) and new memberscan be identified through similarity comparison to knownmembers Many gene family databases have been estab-lished including Pfam (23) TreeFam (24) andPANTHER (25) etc While these gene family databasesare useful recourses they are not updated at the samerapid pace as that of newly generated genomic sequencesResearchers interested in particular gene families oftenhave to perform their own searches to obtain the mostcurrent collection of sequencesThe identification of gene family members using

sequence similarity searches is often complicated by thedetection of homologs from other gene familiesPhylogenetic analysis is a powerful tool to identify homo-logs of interest and to provide additional informationabout gene function and evolution To this end research-ers can perform manual searches using publicly availableprograms such as BLAT (26) Wise2 (27) BLAST (28)FASTA (29) and HMMER (30) followed by sequencealignment and phylogenetic analysis However these pro-cedures can be complicated as they often require extensivemanual curation particularly if homologous regions needto be extracted from genomic sequences While this is amanageable problem for a small gene family it can be atedious and time-consuming process when the target genefamily is large More significantly the quality of theresults often suffersIn addition to the more traditional gene families TEs

can also be viewed as members of lsquospecialrsquo gene familiesthat are able to duplicate themselves by the activity ofelement-encoded proteins TEs often constitute the largestcomponent of eukaryotic genomes and their identification

To whom correspondence should be addressed Tel +1 706-542-1810 Fax +1 706-542-1805 Email sueplantbiougaedu

2009 The Author(s)This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (httpcreativecommonsorglicensesby-nc20uk) which permits unrestricted non-commercial use distribution and reproduction in any medium provided the original work is properly cited

and classification are essential to accurate genome anno-tation (3132) However as with large gene families thevery high copy numbers of some TEs make their retrievalfrom genomic sequence and characterization an extremelydifficult task The increasing pace of genomic sequencingprojects demands a computer-assisted pipeline that canrapidly and accurately identify and characterize genefamiliesSeveral automated pipelines have been developed

to ease homolog identification and most are limited toprotein or expressed sequence tag (EST) databases Forexample PhyloBLAST (33) Pyphy (34) HoSeql (35)PhyloGena (36) and TRIBE-MCL (37) performBLASTP searches and retrieve data from protein data-bases SimESTs uses TBLASTN to search EST databases(38) Because these programs only compare protein-coding sequences they will miss any mutational eventsthat occur within noncoding regionsTARGeT (Tree Analysis of Related Genes and

Transposons) is a program to streamline the process ofretrieving annotating and analyzing both gene familiesand TE families from a genomic database The core ofthe TARGeT pipeline is an algorithm called putativehomolog identifier (PHI) that uses a series of steps topredict gene structure using BLAST results From thepredicted gene structure PHI extracts the amino acidsequences of putative homologs for use in subsequent phy-logenetic analysis We have compared TARGeT with twopipelines FGF and GFScan which can also be used toretrieve gene families from genomic databases Results arepresented showing that TARGeT significantly outper-forms both programs and adds several layers of function-ality not present in existing programs Tomake it easier forusers especially nonspecialists TARGeTwas implementedas a user-friendly web-based pipeline (httptargetiplantcollaborativeorg) All initial input for TARGeT is orga-nized on a web form and the results are presented in thebrowser All results and supporting files are documentedand are available for download TARGeT provides severalpoints where results can be inspected and analyses can berepeated

METHODS

TARGeT can use either protein or DNA sequence as thequery BLASTN searches are used for DNA queries whileTBLASTN is used for protein queries The pipeline thatuses TBLASTN is the focus of this article because it ismore complex and may have wider applicationTARGeT uses Muscle (39) to calculate the multiple align-ment and TreeBest (24) to generate the phylogenetic treeof the putative homologs with the neighbor-joiningmethod (40) The other functions of TARGeT are carriedout by several Perl scripts developed by the authorsRice genomic data were obtained from Genbank (4142)

with accession numbers from NC_008394 to NC_008405Maize genomic data were downloaded from the MaizeGenome Sequencing Project (httpwwwmaizesequenceorg version Dec 2008) Sorghum genomic data werefrom the Sorghum Bicolor Genome Project (httpwwwjgidoegov version 2008 Sorbi1 assembly)

There are five main steps in the TARGeT pipelinewith a checkpoint at the end of each (i) preparation ofthe query when multiple sequences are to be submitted(ii) BLAST search (either BLASTN or TBLASTN) (iii)homolog prediction (iv) multiple alignment and (v) phy-logenetic tree estimation (Figure 1) Details of each stepare presented in the lsquoResultsrsquo section using the ascorbateperoxidase (APx) gene family as an example

TARGeT can be accessed on a web server where alldata used and generated by TARGeT are entered in alog file TARGeT output is presented in a single webpagethat uses nested tabs to organize the data images andre-submission forms for each TARGeT run during a ses-sion There is a final tab for each run called Provenancewhere the user can view the parameters used by TARGeTin a log file and also download an archive that includes allfiles and images for offline viewing and analysis Theoutput includes the XML log file BLAST results inimage and text format PHI results in image and textformat multiple alignments in FASTA format and thephylogenetic tree in Newick and jpeg formats

RESULTS

Searching for APx gene family in rice

Rice and Arabidopsis serve as model plant monocot anddicot species respectively They diverged from a commonancestor 120ndash200 million years ago (43) and their genomesare fully sequenced (1244) Thus they provide excellentopportunities to evaluate the cross-species searching abil-ity of TARGeT We searched the rice APx gene familyusing the Arabidopsis APx protein sequence as queryand compared the results generated by TARGeT to thepublished data The goal of this exercise was to see howwell TARGeT would perform at predicting the rice APxfamily members We chose APx because it is a small butimportant gene family that has been well annotated inboth Arabidopsis and rice Based on the literature thereare as many as nine APx family members in Arabidopsis(45) and eight in rice (46) (Table 1) The APx family sharessequence similarity with several other peroxidase families(47) and as such is a good dataset to test the ability ofTARGeT to discriminate between closely related proteinfamilies

BLAST search To improve the chances of finding targetgene family members multiple queries can be submitted aslong as they are homologs An optional multiple align-ment step is provided for users to select sequences fromconserved regions (Figure 1A) As an example for theAPx gene family we selected as query the sequencesfrom the well-aligned (boxed) region in Figure 2

To aid users in viewing the BLAST result TARGeTproduces an image showing a rough estimation ofBLAST high scoring pair (HSP) numbers and conservedregions along the length of each query sequence(Figures 1B and 3) This is helpful for a quick overallview especially when the BLAST output is large In thisway the user can see the information used by TARGeTand if necessary modify the query in a subsequent

e78 Nucleic Acids Research 2009 Vol 37 No 11 PAGE 2 OF 13

BLAST search For example TAPX which is one of theArabidopsis APx genes is 426 amino acids Using thisfull-length sequence as the query low copy regions canbe detected at the beginning and at the end (Figure 3A)

Readers should note that the number of HSPs (up to 50) ismuch larger than the number of known APx genes in therice genome This inconsistency is largely due to the exis-tence of other gene families that share sequence similarity

Figure 1 Map of the five main steps of the TARGeT pipeline Users are able to inspect the results of each step before going on to the next step (A)Preparation of the query when more than one sequence is being used This is an optional step and its output is shown in Figure 2 (B) BLAST searchResults are shown in Figure 3 (C) Homolog identification by PHI The algorithm is explained in Figure 4 and the result of this step is shown inFigure 5 (D) Multiple alignment (E) Tree building

Table 1 The APx gene family homologs of Arabidopsis rice maize and sorghum

Arabidopsis Rice Maize Sorghum

Gene namea Accession no TARGeT IDb Gene namea Accession no Missedrate ()c

Errorrate ()d

TARGeT IDb TARGeT IDb

APX1 AT1G07890 TOAPx_1 ndash Os09g0538600 ndash ndash TZAPx_1 TSAPx_1APX2 AT3G09640 TOAPx_2 OsAPx4 Os08g0549100 041 041 TZAPx_2 TSAPx_2APX3 AT4G35000 TOAPx_3 OsAPx7 Os04g0434800 0 0 TZAPx_3 TSAPx_3APX4 AT4G09010 TOAPx_4 OsAPx6 Os12g0178100 403 037 TZAPx_4 TSAPx_4APX5 AT4G35970 TOAPx_5 OsAPx1 Os03g0285700 162 04 TZAPx_5 TSAPx_5APX6 AT4G32320 TOAPx_6 OsAPx5 Os12g0178200 0 11 TZAPx_6 TSAPx_6APX7 AT1G33660 TOAPx_7 OsAPx3 Os04g0223300 163 122 TZAPx_7 TSAPx_7SAPX AT1G77490 TOAPx_8 OsAPx8 Os02g0553200 0 0 TZAPx_8 TSAPx_8TAPX AT4G08390 TOAPx_9 ndash Os08g0522400 ndash ndash TZAPx_9 TSAPx_9

TOAPx_10 OsAPx2 Os07g0694700 122 041 TZAPx_10TOAPx_11 ndash Os04g0602100 ndash ndash TZAPx_11

aNames used for previously identified APx genesbNames assigned by TARGeT to predicted APx homologscThe lsquomissedrsquo rate is calculated by dividing the number of missed amino acid residues that are at the ends of the sequence by the length of the querydThe lsquoerrorrsquo rate is calculated by dividing the number of the incorrect amino acid assignments by the length of the corresponding region in thepreviously published rice APx protein sequence

PAGE 3 OF 13 Nucleic Acids Research 2009 Vol 37 No 11 e78

with the APx gene family As shown in later steps truehomologs belonging to the APx gene family will be dis-cerned from those of other families Using the full-lengthTAPX sequence as the query only three APx homologswere found in rice (data not shown) However five APxhomologs were found when the sequence from the boxedregion (see Figure 2) was used as the new query (Figure 3B)

Putative homolog identification Several factors make itdifficult to identify reliable homologs from BLASToutput and result in a high false positive rate (48ndash50)Lack of explicit treatment of frameshifts and introns isalso a disadvantage of TBLASTN (51) To solve theseproblems we developed a program called PHI whichtakes into account the e-value (default 001) as well as asecond parameter called the minimal match percentage(MMP defaults to 70) to find reliable homologs Thetwo main stages in PHI (grouping and refinement) areexplained below

Grouping Introns or low-similarity regions can break acomplete alignment into smaller HSPs In addition when

a frameshift occurs TBLASTN produces separate HSPsTo retrieve the intact sequence of each homolog or pseu-dogene PHI sorts the HSPs based on position and strandin the genomic sequence In this step HSPs that are fromthe same homolog are grouped together by the sequenceposition of query and subject (Figure 4A top part) TwoHSPs are assumed to belong to different groups if they areseparated by a distance greater than the minimum intronlength (a parameter adjustable by the user defaults to8000 nt) or if they are on different strands When there ismore than one way to connect the HSPs (which canhappen when there are repetitive domains in the query)PHI uses an overall HSP score to determine the correctorder A match percentage is calculated by dividing thesum length of the matches in each group by the lengthof the query If this number is greater than the MMP(see Figure 2) the group is sent to the refinement stageHSPs that fail to satisfy the MMP are available to inter-ested users as a record file

Refinement After the grouping step several potentialproblems often remain in the HSPs of each group

Figure 2 Multiple alignment of Arabidopsis APx protein sequences Sequences in the boxed region were extracted to form the query sequences APx7was not included because it aligns poorly


A demonstration figure to illustrate some problems isshown in the lower part of Figure 4A First there is anoverlap (indicated by a red triangle) between HSP 1 andHSP 2 second there is an intron (darker area) that hasbeen falsely translated and included in HSP 3 finallythere is a small area in the query (pink region) that hasno HSP in the BLAST result which results in the failureto detect a small exon due to its insignificant e-value Inthe refinement stage the most likely split position isdetected within the overlapping region Introns in theHSPs are removed and a second round BLAST searchis performed to find the missing exons Several resultfiles will be generated after this step including the homol-ogous sequences in both DNA and protein FASTAformats

Resolving the boundary between two overlapping HSPs InTBLASTN outputs two successive HSPs often overlapdue to coincident similarity beyond the true boundariesresulting in misalignment between the query and the sub-ject An example is shown in Figure 4B (boxed regions) Inthis example the end of HSP 1 overlaps the beginning ofHSP 2 by five amino acids corresponding to amino acids32ndash36 of the query (red residues GLDDK) 90212ndash90216(red residues GLDMQ) and 90138ndash90142 (blue residuesGVEDK) of the subject PHI determines the most likelycorrect boundary by choosing the alignment that has thehighest alignment score from all of the possible alignmentswithin the overlapping region For the example shown inFigure 4B there are six possible alignments A score iscalculated for each alignment using the BLOSUM62matrix and any amino acid that aligns to a gap or astop codon will be penalized 12 points The third align-ment in Figure 4B has the highest score and thus PHI

assumes that the true boundary in the subject is betweenthe two aspartic acid residues After the true boundary islocated additional amino acids will be trimmed off theHSPs (MQ in HSP 1 and GVE in HSP 2) For the riceAPx gene family this step trimmed 21 amino acid residueson average from each homolog

Identifying small introns The function of this step is toidentify and remove introns that appear as gaps withinthe HSPs Any gap in the subject that has a length greaterthan the minimum intron length parameter (user-adjusta-ble parameter default 60 nt) is identified as an intron andwill be removed resulting in two (smaller) new HSPs(Figure 4C and D) For each rice APx homologTARGeT identified on average 13 introns correspond-ing to 419 falsely translated amino acids

Identifying small exons Small exons will be missed byBLAST searches when their alignments do not meet thee-value cut-off Such small exons may be found by increas-ing the e-value However for a large database simplyincreasing the e-value could increase the computationalburden of TARGeT and there is no guarantee that allexons will be identified because the suitable e-value isunknown To improve the prediction of small exonsPHI can perform a second round BLAST search usinga small database containing only the sequences of putativehomologs (including the predicted intronic and flankingregions) Because e-value calculation is dependent inpart on the size of the database short alignments to theoriginal query sequence(s) may now be significant(Figure 4D) For each rice APx homolog this secondround of BLAST identified on average 16 additionalexons and 334 amino acids

Illustration of PHI output After the refinement stage animage is generated that provides a view of the predictedgene structure for each putative homolog (Figure 1C andSupplementary Figure 1) Features of this image includethe similarity between each putative homolog and itsquery the locations of exons introns premature stopcodons (represented by asterisks in the BLAST output)and frameshifts Frameshifts are identified by comparingHSPs that are close to each other (less than 5 amino acidsby default) and are on the same strand but are in differentreading frames In the demonstration figure putativepseudogenes may be genes with premature stop codonsor frameshifts that are marked with red or blue dotsrespectively (Supplementary Figures 7 and 8)Using default parameters 46 putative rice APx homo-

logs were identified and clustered into two groups based ontheir gene structures (Figure 5 and SupplementaryFigure 1) There are 11 homologs in the small groupamong which TOAPx_2-8 and TOAPx_10 were found tocorrespond to known rice APx genes OsAPx1-OsAPx8(Table 1) For the remaining 35 putative homologs com-parison of their sequences and gene structures revealed thatthey are not APx homologs but are instead from otherperoxidase gene families (data not shown)To assess the accuracy of homolog sequences retrieved

by TARGeT we considered two situations (this is not a

Figure 3 TARGeT output provides a rough visualization of theBLAST result X-axis is the length of the query Y-axis is the numberof BLAST HSPs The gray gradient shows the similarity which is cal-culated by dividing the sum of identities and similarities by the numberof the aligned amino acids along the HSP Darker represents highersimilarity at that position


step of TARGeT) One situation might occur at the endsof the query-target alignment where the program failedto identify some amino acids at the end We refer to thisas lsquomissingrsquo and can occur when the end of homolog

sequences are not as well conserved as the sequenceswithin By comparing the homolog sequence to thequery sequence the numbers of lsquomissingrsquo amino acidswere counted manually For example if the query is 100

Figure 4 The sorting and refinement stages of the PHI program See the text for details (A) In the grouping stage alignments are sorted andgrouped Dark bars are queries and colored bars are homologs Each group corresponds to one putative homolog The green group is shown in detailto illustrate potential problems (B) Two overlapping HSPs together with six possible alternative positions are shown The separation that producesthe highest score in the overlapping region is noted with a red check (C) An HSP that includes an intron The intron is detected and cut out by PHIresulting in two separated HSPs Red asterisks represent premature stop codons (D) Figure presentation of the result after the refinement stageThere is no overlap between HSPs 1 and 2 HSP 3 in (C) is separated by the small intron into new HSP 30 and 40 An additional exon (5) was foundand is shown in pink


amino acids and the alignment is from 5 to 97 the missingnumber of this homolog is 4+3=7 The lsquomissedrsquo rate iscalculated by dividing the number of missed amino acidsby the length of the query (7 in the above example) Incontrast we refer to an lsquoerrorrsquo as a situation where theprogram incorrectly predicts amino acids within a homo-log sequence By comparing the homolog sequence tothe previously published rice APx protein sequencemismatched amino acids were counted manually as thelsquoerrorrsquo number of this homolog The lsquoerrorrsquo rate is calcu-lated by dividing the number of incorrect amino acidassignments by the length of the corresponding regionin the previously published rice APx protein sequenceThe missed and error rates may vary for each predictedhomolog sequence because they depend on the levelof conservation between the homolog and the querysequences For the rice APx example above the averagemissed rate is 111 and the average error rate is 049(Table 1)

Multiple alignment and tree estimation If users are satis-fied with the putative homologs found by TARGeT theycan either download the sequences in FASTA format orlet TARGeT use the data to generate a phylogenetic treeUsers also have the option to employ other tree estimationmethods by downloading the alignment and using thesoftware of their choice The phylogenetic tree andthe figure showing the tree are generated by TreeBestWhen there are many homologs names on thefigure will be difficult to read because the figure sizecannot be varied To solve this problem users can down-load the newick file and draw their own tree usingsoftware such as TreeView (52) We have also providedtwo more solutions on the server The first is to useJalview (53) and the second is to copy the newick formattree file and submit it to PhyloWidget (54) which is apowerful web-based tree viewerFrom the TARGeT-generated tree of APx homologs

(shaded region in Figure 6) it is clear that the known

Figure 5 TARGeT output of the gene structure of rice APx family members (A) Exon intron structure of 11 reliable rice APx homologs detected byTARGeT All 46 putative homologs are in Supplementary Figure 1 (B) A larger figure of TOAPx_9 from (A) Query and subject names are shownon the left lsquo+rsquo or lsquorsquo indicates the strand of the hit Unmatched query regions at the ends of each homolog are in blue Black or gray gradient barsrepresent the exons Darker represents higher similarity Numbers flanking each gene structure are positions of the subject while numbers above andbelow the exons are the positions of the query Red numbers indicate discontinuous predicated exons Putative new APx homologs are indicated by lsquorsquo


APx family homologs are separated from the other puta-tive homologs Consideration of both gene structures(Figure 5 and Supplementary Figure 1) and positions inthe phylogenetic tree (Figure 6) led to the identification ofthree putative new rice APx genes (TOAPx_1 TOAPx_9

and TOAPx_11) that have high similarity to ArabidopsisAPx3 (identities=80 positives=92) APx6 (identi-ties=62 positives=77) and APx4 (identi-ties=71 positives=82) respectively To provideevidence that these are real genes these sequences were

Figure 6 An unrooted phylogenetic tree of all rice APx family members predicted by TARGeT Previously characterized APx gene names arein brackets The shaded region contains the true rice APx homologs Bootstrap values greater than 70 are shown


used as queries against the rice cDNA database inGenbank Each gene matched several cDNAs (data notshown)

Searching for APx gene family members in maize andsorghum

To further evaluate the cross-species search ability ofTARGeT we searched for APx gene families in maizeand sorghum using the same query that was used tosearch for rice APx genes The reasons for choosingmaize and sorghum are as follows First at the time ofthe final analysis for this study the available maize andsorghum sequences were incomplete Maize is beingsequenced using a BAC by BAC approach while sorghumwas sequenced using a whole genome shotgun approachAs such they are more representative of the availablegenomic databases than the complete rice sequence

Second search results of maize and sorghum can be com-pared with the rice and Arabidopsis output Finally theAPx gene families in maize and sorghum have not as yetbeen characterizedWe identified 11 APx homologs in maize and 9 in sor-

ghum (Supplementary Figures 2ndash5) To get a comprehen-sive view of the APx family in plants we produced aphylogenetic tree with MEGA (55) using the publishedAPx data from Arabidopsis and the data predicted byTARGeT for rice maize and sorghum (Figure 7) APxgene homologs are clustered into five main clades (labeledAndashE) with members from all species These data suggestthat ancient duplications preceded species divergence Theputative new rice APx homologs TOAPx_1 TOAPx_9and TOAPx_11 are in clades B D and E respectivelyExcept for two maize homologs in clade D thereis only one representative for each species in clades D

Figure 7 An unrooted phylogenetic tree of the APx homologs of rice maize sorghum and Arabidopsis This tree was generated with MEGAversion4 using the neighbor-jointing method with pairwise deletion and p distance Five main clades are labeled from A to E A main clade is definedas a minimal group of homologs that can be found in all species The remaining homologs are classified into orphan clades O1ndashO3 Bootstrap valueshigher than 70 are shown


and E This may be due to the effect of gene dosage bal-ance on these two clades (56ndash58) In addition to the mainclades there are several putative orphan clades that aremissing genes from one or more species This may be dueto either gene lose or insufficient sequence data

Searching DNA TE families in rice

TARGeT is a powerful tool for rapid TE identificationcharacterization and phylogenetic analysis We haveillustrated this by using TARGeT to search for TEs inthe rice genome using as query conserved transposasesequences from five DNA TE superfamilies The querieswere constructed from known TE protein sequences thatwere downloaded from Repbase (59) and additionalsequences annotated as part of another study (data notshown) Here we focus on the TARGeT results for theTc1mariner superfamily because it has been well anno-tated and characterized in riceThe Tc1mariner superfamily is widespread in plant and

animal genomes (60) A previous study (60) annotated 34coding mariner-like elements (MLEs) from two partiallysequenced rice genomes (14 from the indica database and20 from the japonica database) Here we used TARGeTto search the complete japonica database and in 1mingenerated a phylogenetic tree that was consistent with thatof Feschotte and Wessler (60) TARGeT successfullyretrieved the 20 MLEs reported in the previous studyand in addition detected 27 new MLEs (Figure 8)

Evaluating of the speed of TARGeT

Many factors can affect the speed of TARGeT such as thenumber and length of the query sequences the geneTEfamily size the database size and the number of exonsOther issues that affect the run time include the serverhardware and current usage In addition becauseTARGeT is entirely web based upload and downloadtimes vary from user to user For the gene or TE familiesthat were analyzed in this study we calculated the averagetime for each search as an average of 10 independent runsFor example TARGeT took 12 25 and 68minto complete the searches of the APx gene family in ricesorghum and maize respectively The search of the riceTc1mariner superfamily took 1min to complete

Comparison of TARGeT with similar programs

Two other pipelines GFScan (50) and FGF (61) can alsoretrieve and characterize gene families from genomic data-bases GFScan searches for gene family members with therepresentative genomic DNA motif while FGF performsTBLASTN search followed by GeneWise and phyloge-netic analysis Here we briefly compare the features andperformance of TARGeT with these two pipelines

TARGeT versus GFScan The cross-species searching abil-ity of GFScan was previously tested by using a humanquery sequence to retrieve the carbonic anhydrases (CA)family from the mouse genome (50) GFScan was able toidentify only 5 of the 11 known CA genes along with twoputative new CA genes in the available mouse genomesequence The authors stated that this discrepancy was

due to the large difference between the human andmouse motifs We did a similar search using TARGeTfor CA genes in the mouse Because there is no recordof the version of the mouse genomic database used inthe GFScan paper we chose the latest version of the ref-erence data (18 October 2006) from Genbank A querycomposed of 14 protein sequences from 14 knownhuman CA genes was constructed Using default param-eters except that the minimal intron length was set to80 000 nt TARGeT found 14 out of 16 known CA genes(data in 2008) in mouse and the remaining two were iden-tified together with a putative new CA homolog after theMMP cut-off was reduced from 07 to 05

TARGeT versus FGF Direct comparison between theresults of TARGeT and FGF proved difficult First theFGF server is often not available Second TARGeT andFGF use different local databases We ran TARGeT withthe queries that were used in the paper describing FGFUsing a peptidylprolyl isomerase Cyp2 gene (AK061894GI 115443875) as query to search against the rice data-base with default parameters TARGeT found six moreputative homologs than FGF (Supplementary Figure 6)We also found one possible mistake in the result of FGFit identified two overlapping homologs AK061894_chr06and AK061894_chr06 while there is no such overlapin the result of TARGeT Using Hsp90 (GI 40254816)as the query to search against the human database bothFGF and TARGeT found 15 homologs (SupplementaryFigure 7)

DISCUSSION

To date most gene family search programs can onlyretrieve homologs from protein sequence databasesMore commonly BLAST has been widely used tosearch genomic sequence databases However manualretrieval of homolog sequences from BLAST outputsrequires a great deal of time This is especially true forlarge gene or TE families TARGeT is particularlyuseful if one wants to quickly retrieve and characterizegene families from DNA databases especially when anewly sequenced genome is available TARGeT uses aPerl program named PHI that automatically retrieveshomolog sequences from BLAST outputs In additionTARGeT can do multiple alignment and phylogeneticanalysis with the retrieved homolog sequences Speed isanother major advantage of TARGeT As demonstratedin this report TARGeT can routinely retrieve and char-acterize gene family homologs including TEs from plantand animal genome sequences on the order of minutes

Although TARGeT shares similarity with homology-based TE annotation tools like RepeatMasker (62) thereare some important differences First instead of showingeach fragmented match as RepeatMasker does TARGeTtries to identify homologs that are long enough for phy-logenetic tree estimation A fragmented TE can be identi-fied as long as the sum length of its fragments satisfies theMMP to the query As such using the same query anddatabases the number of homologs identified byTARGeT is usually lower than the hit number found by


RepeatMasker Second when there are no repeat librariesavailable for a particular species RepeatMasker gives theuser the option of performing a BLASTX search to anno-tate coding regions of TEs in the submitted sequences Incontrast TARGeT uses a TBLASTN search to identifycoding regions from the whole genomic database FinallyRepeatMasker lacks most of the functionality that isprovided by TARGET including the generation ofphylogenetic tree and gene structure figures

When used to search genomic databases proteinsequence queries can efficiently detect distantly related

homologs even when their DNA sequences cannot bealigned Based on our experience TBLASTN can detectsequences with identities as low as 25 to the query (datanot shown) Comparison of the results of TARGeT FGFand GFScan show that TARGeT retrieved more homo-logs To further improve TARGeTrsquos ability to identifydistantly related homologs we are planning to optimizematrix and BLAST parameters (such as gap penalties)Using multiple queries can also increase the chances

of finding additional gene family homologs TARGeTcan accept multiple queries at one time Although more

Figure 8 A rooted phylogenetic tree of predicted rice Tc1mariner transposases Three clades (A B and C) are defined using the phylogenetic treegenerated by Feschotte and Wessler (56) Elements denoted by an asterisk are new transposases predicted by TARGeT Soymar1 was used as anoutgroup and the tree was rooted manually using TreeView Bootstrap values greater than 70 are shown


than one query may hit one homolog a unique feature ofTARGeT is that it can select the one that has the bestmatch to the homologWhen there is too much sequence divergence between a

homolog sequence and the query the homolog may not befound by TARGeT However TARGeT may still providea clue for users to find them For most homologs whereHSPs are inadequate to meet the MMP cut-off value theymay still have short matches to the query in conservedregions In this case the file containing the BLASTHSPs that do not meet the qualified homolog cut-offwould be valuable Inspecting this file may give users areason why TARGeT failed to detect some homologsand help users design new queries to find additionalhomologsTARGeT uses two approaches to separate closely

related gene families Because there is no absolute similar-ity cut-off among genes that are within or betweenfamilies closely related gene families may be retrievedunder certain circumstances with the target gene familyThis is often the case when the query is short such as adomain sequence An efficient way to separate closelyrelated gene families is using phylogenetic analysis becausehomologs from the same family tend to cluster on a phy-logenetic tree into the same clade (Figures 6ndash8) Howeverit may not be obvious which clade represents the homo-logs of interest In other situations the phylogeneticrelationships between the homologs may be ambiguouswhen the root is unknownTo overcome these limitations TARGeT displays the

gene structure of each homolog and their sequence simi-larity to the queries Because different gene families oftenhave distinct gene structures homologs that have highsequence similarity to the queries and also have similargene structures can be easily identified as members ofthe target gene family For example the homologs in theshaded clade in Figure 6 have higher sequence similarityto the query sequence than the homologs in the otherclade indicating that they are APx homologs A determi-nation of whether TOAPx_9 and TOAPx_11 belong to theshaded clade requires the gene structure comparison pro-vided by TARGeT (Figure S1) because the (unrooted)phylogenetic tree alone does not provide sufficientinformation

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online

ACKNOWLEDGEMENTS

The authors would like to thank The iPlant Collaborativefor hosting services on its cyber infrastructure and pro-viding systems administration for the TARGeT webserver In addition the authors would like to thank DrsHongyan Shan Jim Leebens-Mack Russell Malmbergand Yaowu Yuan for critical reading of the manuscriptand for valuable discussions

FUNDING

National Science Foundation (Grant DBI-0607123)Howard Hughes Medical Institute (Grant 52005731to SRW) Funding for open access charge HowardHughes Medical Institute

Conflict of interest statement None declared

REFERENCES

1 YuJ HuS WangJ WongGK LiS LiuB DengY DaiLZhouY ZhangX et al (2002) A draft sequence of the ricegenome (Oryza sativa L ssp indica) Science 296 79ndash92

2 GoffSA RickeD LanTH PrestingG WangR DunnMGlazebrookJ SessionsA OellerP VarmaH et al (2002) Adraft sequence of the rice genome (Oryza sativa L ssp japonica)Science 296 92ndash100

3 LiWH GuZ WangH and NekrutenkoA (2001) Evolutionaryanalyses of the human genome Nature 409 847ndash849

4 RubinGM YandellMD WortmanJR Gabor MiklosGLNelsonCR HariharanIK FortiniME LiPW ApweilerRFleischmannW et al (2000) Comparative genomics of theeukaryotes Science 287 2204ndash2215

5 Arabidopsis Genome Initiative (2000) Analysis of the genomesequence of the flowering plant Arabidopsis thaliana Nature 408796ndash815

6 LespinetO WolfYI KooninEV and AravindL (2002) Therole of lineage-specific gene family expansion in the evolution ofeukaryotes Genome Res 12 1048ndash1059

7 LyckegaardEM and ClarkAG (1989) Ribosomal DNA andStellate gene copy number variation on the Y chromosome ofDrosophila melanogaster Proc Natl Acad Sci USA 86 1944ndash1948

8 NeitzM and NeitzJ (1995) Numbers and ratios of visual pigmentgenes for normal red-green color vision Science 267 1013ndash1016

9 WendelJF (2000) Genome evolution in polyploids Plant MolBiol 42 225ndash249

10 WolfeKH and ShieldsDC (1997) Molecular evidence for anancient duplication of the entire yeast genome Nature 387708ndash713

11 DehalP and BooreJL (2005) Two rounds of whole genomeduplication in the ancestral vertebrate PLoS Biol 3 e314

12 CheungJ WilsonMD ZhangJ KhajaR MacDonaldJRHengHH KoopBF and SchererSW (2003) Recentsegmental and gene duplications in the mouse genome GenomeBiol 4 R47

13 EichlerEE (2001) Recent duplication domain accretion and thedynamic mutation of the human genome Trends Genet 17661ndash669

14 HurlesM (2004) Gene duplication the genomic trade in spareparts PLoS Biol 2 E206

15 BaileyJA LiuG and EichlerEE (2003) An Alu transpositionmodel for the origin and expansion of human segmentalduplications Am J Hum Genet 73 823ndash834

16 KoszulR CaburetS DujonB and FischerG (2004) Eucaryoticgenome evolution through the spontaneous duplication of largechromosomal segments EMBO J 23 234ndash243

17 MorganteM BrunnerS PeaG FenglerK ZuccoloA andRafalskiA (2005) Gene duplication and exon shuffling byhelitron-like transposons generate intraspecies diversity in maizeNat Genet 37 997ndash1002

18 JiangN BaoZ ZhangX EddySR and WesslerSR (2004)Pack-MULE transposable elements mediate gene evolution inplants Nature 431 569ndash573

19 TchenioT Segal-BendirdjianE and HeidmannT (1993)Generation of processed pseudogenes in murine cells EMBO J 121487ndash1497

20 VaninEF (1985) Processed pseudogenes characteristics andevolution Annu Rev Genet 19 253ndash272

21 DayhoffMO (1976) The origin and evolution of proteinsuperfamilies Fed Proc 35 2132ndash2138


22 HegerA and HolmL (2000) Towards a covering set of proteinfamily profiles Prog Biophys Mol Biol 73 321ndash337

23 FinnRD TateJ MistryJ CoggillPC SammutSJHotzHR CericG ForslundK EddySR SonnhammerELet al (2008) The Pfam protein families database Nucleic Acids Res36 D281ndashD288

24 LiH CoghlanA RuanJ CoinLJ HericheJK OsmotherlyLLiR LiuT ZhangZ BolundL et al (2006) TreeFam a curateddatabase of phylogenetic trees of animal gene families Nucleic AcidsRes 34 D572ndashD580

25 MiH Lazareva-UlitskyB LooR KejariwalA VandergriffJRabkinS GuoN MuruganujanA DoremieuxOCampbellMJ et al (2005) The PANTHER database of proteinfamilies subfamilies functions and pathways Nucleic Acids Res33 D284ndashD288

26 KentWJ (2002) BLATndashthe BLAST-like alignment tool GenomeRes 12 656ndash664

27 BirneyE ClampM and DurbinR (2004) GeneWise andgenomewise Genome Res 14 988ndash995

28 AltschulSF MaddenTL SchafferAA ZhangJ ZhangZMillerW and LipmanDJ (1997) Gapped BLAST and PSI-BLAST a new generation of protein database search programsNucleic Acids Res 25 3389ndash3402

29 PearsonWR (1990) Rapid and sensitive sequence comparison withFASTP and FASTA Methods Enzymol 183 63ndash98

30 EddySR (1998) Profile hidden Markov models Bioinformatics 14755ndash763

31 LanderES LintonLM BirrenB NusbaumC ZodyMCBaldwinJ DevonK DewarK DoyleM FitzHughW et al(2001) Initial sequencing and analysis of the human genomeNature 409 860ndash921

32 MeyersBC TingeySV and MorganteM (2001) Abundancedistribution and transcriptional activity of repetitive elements in themaize genome Genome Res 11 1660ndash1676

33 BrinkmanFS WanI HancockRE RoseAM and JonesSJ(2001) PhyloBLAST facilitating phylogenetic analysis of BLASTresults Bioinformatics 17 385ndash387

34 Sicheritz-PontenT and AnderssonSG (2001) A phylogenomicapproach to microbial evolution Nucleic Acids Res 29 545ndash552

35 ArigonAM PerriereG and GouyM (2006) HoSeqI automatedhomologous sequence identification in gene family databasesBioinformatics 22 1786ndash1787

36 HanekampK BohnebeckU BeszteriB and ValentinK (2007)PhyloGenandasha user-friendly system for automated phylogeneticannotation of unknown sequences Bioinformatics 23 793ndash801

37 EnrightAJ Van DongenS and OuzounisCA (2002) An efficientalgorithm for large-scale detection of protein families Nucleic AcidsRes 30 1575ndash1584

38 FrankRL ManeA and ErcalF (2006) An automated methodfor rapid identification of putative gene family members in plantsBMC Bioinformatics 7 (Suppl 2) S19

39 EdgarRC (2004) MUSCLE multiple sequence alignment withhigh accuracy and high throughput Nucleic Acids Res 321792ndash1797

40 SaitouN and NeiM (1987) The neighbor-joining method a newmethod for reconstructing phylogenetic trees Mol Biol Evol 4406ndash425

41 Karsch-MizrachiI and OuelletteBF (2001) The GenBanksequence database Methods Biochem Anal 43 45ndash63

42 BurksC FickettJW GoadWB KanehisaM LewitterFIRindoneWP SwindellCD TungCS and BilofskyHS (1985)

The GenBank nucleic acid sequence database Comput ApplBiosci 1 225ndash233

43 SalseJ PieguB CookeR and DelsenyM (2002) Syntenybetween Arabidopsis thaliana and rice at the genome level a toolto identify conservation in the ongoing rice genome sequencingproject Nucl Acids Res 30 2316ndash2328

44 InitiativeTAG (2000) Analysis of the genome sequence of theflowering plant Arabidopsis thaliana Nature 408 796ndash815

45 MittlerR VanderauweraS GolleryM and Van BreusegemF(2004) Reactive oxygen gene network of plants Trends Plant Sci 9490ndash498

46 TeixeiraFK Menezes-BenaventeL MargisR andMargis-PinheiroM (2004) Analysis of the molecular evolutionaryhistory of the ascorbate peroxidase gene family inferences from therice genome J Mol Evol 59 761ndash770

47 PassardiF TheilerG ZamockyM CosioC RouhierNTeixeraF Margis-PinheiroM IoannidisV PenelC FalquetLet al (2007) PeroxiBase the peroxidase database Phytochemistry68 1605ndash1611

48 FrickeyT and LupasAN (2004) PhyloGenie automatedphylome generation and analysis Nucleic Acids Res 325231ndash5238

49 KoskiLB and GoldingGB (2001) The closest BLAST hit is oftennot the nearest neighbor J Mol Evol 52 540ndash542

50 XuanZ McCombieWR and ZhangMQ (2002) GFScan a genefamily search tool at genomic DNA level Genome Res 121142ndash1149

51 GertzEM YuYK AgarwalaR SchafferAA and AltschulSF(2006) Composition-based statistics and translated nucleotidesearches improving the TBLASTN module of BLAST BMC Biol4 41

52 PageRD (1996) TreeView an application to displayphylogenetic trees on personal computers Comput Appl Biosci 12357ndash358

53 WaterhouseAM ProcterJB MartinDMA ClampM andBartonGJ (2009) Jalview Version 2mdasha multiple sequence align-ment editor and analysis workbench Bioinformatics 25 1189ndash1191

54 JordanGE and PielWH (2008) PhyloWidget web-basedvisualizations for the tree of life Bioinformatics 24 1641ndash1642

55 TamuraK DudleyJ NeiM and KumarS (2007) MEGA4Molecular Evolutionary Genetics Analysis (MEGA) softwareversion 40 Mol Biol Evol 24 1596ndash1599

56 QianW and ZhangJ (2008) Gene dosage and gene duplicabilityGenetics 179 2319ndash2324

57 LiangH PlazonicKR ChenJ LiWH and FernandezA (2008)Protein under-wrapping causes dosage sensitivity and decreases geneduplicability PLoS Genet 4 e11

58 PappB PalC and HurstLD (2003) Dosage sensitivity and theevolution of gene families in yeast Nature 424 194ndash197

59 JurkaJ KapitonovVV PavlicekA KlonowskiP KohanyOand WalichiewiczJ (2005) Repbase Update a database ofeukaryotic repetitive elements Cytogenet Genome Res 110462ndash467

60 FeschotteC and WesslerSR (2002) Mariner-like transposases arewidespread and diverse in flowering plants Proc Natl Acad SciUSA 99 280ndash285

61 ZhengH ShiJ FangX LiY VangS FanW WangJZhangZ WangW and KristiansenK (2007) FGF a web tool forFishing Gene Family in a whole genome database Nucleic AcidsRes 35 W121ndashW125

62 SmitAFA HubleyR and GreenP (2004) RepeatMaskerOpen-30 httpwwwrepeatmaskerorg


and classification are essential to accurate genome anno-tation (3132) However as with large gene families thevery high copy numbers of some TEs make their retrievalfrom genomic sequence and characterization an extremelydifficult task The increasing pace of genomic sequencingprojects demands a computer-assisted pipeline that canrapidly and accurately identify and characterize genefamiliesSeveral automated pipelines have been developed

to ease homolog identification and most are limited toprotein or expressed sequence tag (EST) databases Forexample PhyloBLAST (33) Pyphy (34) HoSeql (35)PhyloGena (36) and TRIBE-MCL (37) performBLASTP searches and retrieve data from protein data-bases SimESTs uses TBLASTN to search EST databases(38) Because these programs only compare protein-coding sequences they will miss any mutational eventsthat occur within noncoding regionsTARGeT (Tree Analysis of Related Genes and

Transposons) is a program to streamline the process ofretrieving annotating and analyzing both gene familiesand TE families from a genomic database The core ofthe TARGeT pipeline is an algorithm called putativehomolog identifier (PHI) that uses a series of steps topredict gene structure using BLAST results From thepredicted gene structure PHI extracts the amino acidsequences of putative homologs for use in subsequent phy-logenetic analysis We have compared TARGeT with twopipelines FGF and GFScan which can also be used toretrieve gene families from genomic databases Results arepresented showing that TARGeT significantly outper-forms both programs and adds several layers of function-ality not present in existing programs Tomake it easier forusers especially nonspecialists TARGeTwas implementedas a user-friendly web-based pipeline (httptargetiplantcollaborativeorg) All initial input for TARGeT is orga-nized on a web form and the results are presented in thebrowser All results and supporting files are documentedand are available for download TARGeT provides severalpoints where results can be inspected and analyses can berepeated

METHODS

TARGeT can use either protein or DNA sequence as thequery BLASTN searches are used for DNA queries whileTBLASTN is used for protein queries The pipeline thatuses TBLASTN is the focus of this article because it ismore complex and may have wider applicationTARGeT uses Muscle (39) to calculate the multiple align-ment and TreeBest (24) to generate the phylogenetic treeof the putative homologs with the neighbor-joiningmethod (40) The other functions of TARGeT are carriedout by several Perl scripts developed by the authorsRice genomic data were obtained from Genbank (4142)

with accession numbers from NC_008394 to NC_008405Maize genomic data were downloaded from the MaizeGenome Sequencing Project (httpwwwmaizesequenceorg version Dec 2008) Sorghum genomic data werefrom the Sorghum Bicolor Genome Project (httpwwwjgidoegov version 2008 Sorbi1 assembly)

There are five main steps in the TARGeT pipelinewith a checkpoint at the end of each (i) preparation ofthe query when multiple sequences are to be submitted(ii) BLAST search (either BLASTN or TBLASTN) (iii)homolog prediction (iv) multiple alignment and (v) phy-logenetic tree estimation (Figure 1) Details of each stepare presented in the lsquoResultsrsquo section using the ascorbateperoxidase (APx) gene family as an example

TARGeT can be accessed on a web server where alldata used and generated by TARGeT are entered in alog file TARGeT output is presented in a single webpagethat uses nested tabs to organize the data images andre-submission forms for each TARGeT run during a ses-sion There is a final tab for each run called Provenancewhere the user can view the parameters used by TARGeTin a log file and also download an archive that includes allfiles and images for offline viewing and analysis Theoutput includes the XML log file BLAST results inimage and text format PHI results in image and textformat multiple alignments in FASTA format and thephylogenetic tree in Newick and jpeg formats

RESULTS

Searching for APx gene family in rice

Rice and Arabidopsis serve as model plant monocot anddicot species respectively They diverged from a commonancestor 120ndash200 million years ago (43) and their genomesare fully sequenced (1244) Thus they provide excellentopportunities to evaluate the cross-species searching abil-ity of TARGeT We searched the rice APx gene familyusing the Arabidopsis APx protein sequence as queryand compared the results generated by TARGeT to thepublished data The goal of this exercise was to see howwell TARGeT would perform at predicting the rice APxfamily members We chose APx because it is a small butimportant gene family that has been well annotated inboth Arabidopsis and rice Based on the literature thereare as many as nine APx family members in Arabidopsis(45) and eight in rice (46) (Table 1) The APx family sharessequence similarity with several other peroxidase families(47) and as such is a good dataset to test the ability ofTARGeT to discriminate between closely related proteinfamilies

BLAST search To improve the chances of finding targetgene family members multiple queries can be submitted aslong as they are homologs An optional multiple align-ment step is provided for users to select sequences fromconserved regions (Figure 1A) As an example for theAPx gene family we selected as query the sequencesfrom the well-aligned (boxed) region in Figure 2

To aid users in viewing the BLAST result TARGeTproduces an image showing a rough estimation ofBLAST high scoring pair (HSP) numbers and conservedregions along the length of each query sequence(Figures 1B and 3) This is helpful for a quick overallview especially when the BLAST output is large In thisway the user can see the information used by TARGeTand if necessary modify the query in a subsequent








Errorrate ()d






















































DISCUSSION














SUPPLEMENTARY DATA


ACKNOWLEDGEMENTS


FUNDING



REFERENCES








































































Errorrate ()d






















































DISCUSSION














SUPPLEMENTARY DATA


ACKNOWLEDGEMENTS


FUNDING



REFERENCES


















































































































DISCUSSION














SUPPLEMENTARY DATA


ACKNOWLEDGEMENTS


FUNDING



REFERENCES












































































SUPPLEMENTARY DATA


ACKNOWLEDGEMENTS


FUNDING



REFERENCES


































































target: a web-based pipeline for retrieving and

Documents