[ieee 2011 fourth international workshop on advanced computational intelligence (iwaci) - wuhan,...

6
AbstractIn this paper, an effective promoter identification algorithm is proposed. This new algorithm is based on the following features of promoters: ( ) Promoter regions include some binding sites where RNA polymerase binds to and also where transcription starts. These binding sites include core-promoter, like TATA-box, GC-box, i.e. However, spacing structure of binding sites is not always consistent, the same kind of binding sites in promoter regions often differ in structure because of nucleotide variation. ( ) Positions of binding sites in the gene are not fixed, instead, their positions are actually more likely to fluctuate in an approximate region. Based on above two features of promoters, firstly, we overlook differences in structure of binding sites caused by nucleotide variation. In another word, Those binding motifs, with similarity in strucuture but appearing in different forms caused by nucleotide variation, are seen as one binding motif. Secondly, we divide promoter regions into equal-length intervals and calculate occurring probability of binding sites in each interval. It is the first time for us to present a new concept “Interval Weight Matrix (IWM)” to reflect relationship between interval and occurring probability of binding sites. Then a new promoter identification system is proposed. After testing on large sequences and comparing with other well-known systems, it is proved that our new algorithm performs much better in reducing false positives(FP) than other well-kbown systems. I. INTRODUCTION ROMOTER identification is always important in gene analysis as promoters are responsible for initiation and regulation of DNA transcription. With the completion of human genome draft [1], [2]; finding promoters and transcriptional signals in large sequences can help to deduce at least the start of transcription and to delineate end of gene as well. Though a lot of efforts have devoted into this research field, because of the incomplete understanding of transcriptional processes, a major deficiency of previous promoter prediction systemshigh false positives (FP) -still exists and also difficult to be addressed [3], [4]. Even many different prediction systems have been proposed to address this problem, none single system can achieve both sensitive and specificity >60% on general DNA sequences [20]. There already exist many different promoter prediction systems. For instance, in the CpG-island based systems, like Manuscript received July 15, 2011. This work was supported in part by the grant from National Natural Science Foundation of China (Project No. 61070118) and Science and Technology Planning Project of Shandong Provincial Education Department (Project No. J10LG27). Rongxin Fang is with the School of Computer Science & Technology, Yantai University, 264005, Yantai, China (e-mail: fangrongxinwilliam@ gmail.com). Shuanhu Wu is corresponding author with the School of Computer Science & Technology, Yantai University, 264005, Yantai, China (e-mail: wushuanhu@163. com). CpG-Promoter [6] and CpG-Prod [10], promoter prediction can be achieved by searching for particular “signal” CpG-island. Another kind of systems is based on content analysis A typical promoter prediction system of these content analysis systems is PromoterInspector [5] whose method is to analyze word groups rather than transcription elements. In PromoterInspector [5], each word group is defined by a set of oligonucleotides and many undefined base-pairs (wildcards, “N”). Later on, many more advanced promoter prediction systems were designed to reduce FP further. For instance, the Dragon Promoter Finders (DPF) [8] was found to perform much better especially performing over broad sensitivity range [21]. DPFs [8] find promoters by identifying Transcriptional Start Site (TTS). It is the same with Eponine[9] which contributes to analyzing the well-known TATA-box at the around position of -25. Even though other promoter prediction systems have been proved to perform better than PromoterInspector [5] on their each testing sets, comparative work[11] has indicted that this better performance actually depends on their particular problems to be addressed. Thus, the PromoterInspector [5] and DPF [7] are actually more suit to general DNA sequences because of their independence on any promoter signals. In 2007, we presented a eukaryotic promoter recognition system [19] which was based on relative entropy [15] and positional information. According to the strategies used in the previous prediction systems, all of previous works can be classified into following three categories: (1) search by signal, (2) search by CpG Island, (3) search by content. Though each of them has its own advantages, their inborn disadvantages determine that none of them can achieve high specificity. For the category of search by signal, systems predict promoters by identifying some specific putative transcriptional signals, like TATA-box or CAAT box. However, these signal patterns also appear in other DNA subsequences. For instance, one study [12] has found that applying Buchers TATA-box weight matrix on mammalian nonpromoter DNA sequences will produce a terrible results, an average of one predicted TATA-box every 120 bp. For the category of search by CpG island, its popularity is based on its good performances on those sequences rich in CpG island. However, merely half promoters in mammals have relationship with CpG islands and human promoters found relative with CpG island are only ~60% [14]. Merely applying CpG island prediction system on human DNA sequences impossibly achieve sensitivity >60%, in the contrast, its FP must be >40%. For the category of search by content, this method comes from the assumption that presences of transcriptional signals, such as binding motifs in the promoter regions, will produce differences in local bases and local word. Thus by analyzing the most A New Algorithm of Promoter Prediction and Identification Rongxin Fang, Shuanhu Wu, Wenyan Zhang, Qicheng Liu, and Yibin Song P 236 Fourth International Workshop on Advanced Computational Intelligence Wuhan, Hubei, China; October 19-21, 2011 978-1-61284-375-9/11/$26.00 @2011 IEEE

Upload: yibin

Post on 04-Apr-2017

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: [IEEE 2011 Fourth International Workshop on Advanced Computational Intelligence (IWACI) - Wuhan, China (2011.10.19-2011.10.21)] The Fourth International Workshop on Advanced Computational

Abstract—In this paper, an effective promoter identification algorithm is proposed. This new algorithm is based on the following features of promoters: ( ) Promoter regions include some binding sites where RNA polymerase binds to and also where transcription starts. These binding sites include core-promoter, like TATA-box, GC-box, i.e. However, spacing structure of binding sites is not always consistent, the same kind of binding sites in promoter regions often differ in structure because of nucleotide variation. ( ) Positions of binding sites in the gene are not fixed, instead, their positions are actually more likely to fluctuate in an approximate region. Based on above two features of promoters, firstly, we overlook differences in structure of binding sites caused by nucleotide variation. In another word, Those binding motifs, with similarity in strucuture but appearing in different forms caused by nucleotide variation, are seen as one binding motif. Secondly, we divide promoter regions into equal-length intervals and calculate occurring probability of binding sites in each interval. It is the first time for us to present a new concept “Interval Weight Matrix (IWM)” to reflect relationship between interval and occurring probability of binding sites. Then a new promoter identification system is proposed. After testing on large sequences and comparing with other well-known systems, it is proved that our new algorithm performs much better in reducing false positives(FP) than other well-kbown systems.

I. INTRODUCTION

ROMOTER identification is always important in gene analysis as promoters are responsible for initiation and

regulation of DNA transcription. With the completion of human genome draft [1], [2]; finding promoters and transcriptional signals in large sequences can help to deduce at least the start of transcription and to delineate end of gene as well. Though a lot of efforts have devoted into this research field, because of the incomplete understanding of transcriptional processes, a major deficiency of previous promoter prediction systems–high false positives (FP) -still exists and also difficult to be addressed [3], [4]. Even many different prediction systems have been proposed to address this problem, none single system can achieve both sensitive and specificity >60% on general DNA sequences [20].

There already exist many different promoter prediction systems. For instance, in the CpG-island based systems, like

Manuscript received July 15, 2011. This work was supported in part by the grant from National Natural Science Foundation of China (Project No. 61070118) and Science and Technology Planning Project of Shandong Provincial Education Department (Project No. J10LG27).

Rongxin Fang is with the School of Computer Science & Technology,Yantai University, 264005, Yantai, China (e-mail: [email protected]).

Shuanhu Wu is corresponding author with the School of Computer Science & Technology, Yantai University, 264005, Yantai, China (e-mail: wushuanhu@163. com).

CpG-Promoter [6] and CpG-Prod [10], promoter prediction can be achieved by searching for particular “signal” CpG-island. Another kind of systems is based on content analysis A typical promoter prediction system of these content analysis systems is PromoterInspector [5] whose method is to analyze word groups rather than transcription elements. In PromoterInspector [5], each word group is defined by a set of oligonucleotides and many undefined base-pairs (wildcards, “N”). Later on, many more advanced promoter prediction systems were designed to reduce FP further. For instance, the Dragon Promoter Finders (DPF) [8] was found to perform much better especially performing over broad sensitivity range [21]. DPFs [8] find promoters by identifying Transcriptional Start Site (TTS). It is the same with Eponine[9] which contributes to analyzing the well-known TATA-box at the around position of -25. Even though other promoter prediction systems have been proved to perform better than PromoterInspector [5] on their each testing sets, comparative work[11] has indicted that this better performance actually depends on their particular problems to be addressed. Thus, the PromoterInspector [5] and DPF [7] are actually more suit to general DNA sequences because of their independence on any promoter signals. In 2007, we presented a eukaryoticpromoter recognition system [19] which was based on relative entropy [15] and positional information.

According to the strategies used in the previous prediction systems, all of previous works can be classified into following three categories: (1) search by signal, (2) search by CpG Island, (3) search by content. Though each of them has its own advantages, their inborn disadvantages determine that none of them can achieve high specificity. For the category of search by signal, systems predict promoters by identifying some specific putative transcriptional signals, like TATA-box or CAAT box. However, these signal patterns also appear in other DNA subsequences. For instance, one study [12] has found that applying Buchers TATA-box weight matrix on mammalian nonpromoter DNA sequences will produce a terrible results, an average of one predicted TATA-box every 120 bp. For the category of search by CpG island, its popularity is based on its good performances on those sequences rich in CpG island. However, merely half promoters in mammals have relationship with CpG islands and human promoters found relative with CpG island are only ~60% [14]. Merely applying CpG island prediction system on human DNA sequences impossibly achieve sensitivity >60%, in the contrast, its FP must be >40%. For the category of search by content, this method comes from the assumption that presences of transcriptional signals, such as binding motifs in the promoter regions, will produce differences in local bases and local word. Thus by analyzing the most

A New Algorithm of Promoter Prediction and Identification Rongxin Fang, Shuanhu Wu, Wenyan Zhang, Qicheng Liu, and Yibin Song

P

236

Fourth International Workshop on Advanced Computational Intelligence Wuhan, Hubei, China; October 19-21, 2011

978-1-61284-375-9/11/$26.00 @2011 IEEE

Page 2: [IEEE 2011 Fourth International Workshop on Advanced Computational Intelligence (IWACI) - Wuhan, China (2011.10.19-2011.10.21)] The Fourth International Workshop on Advanced Computational

frequent hexamers[13], other variant-length motifs and other short words [5], [8], could we explore this concept.

In this paper, we present a new system belonging to category of search by content. With analysis of promoter structure, our new system is based on following features of promoters: ( ) the same kind of binding sites in promoter regions often differ in structure because of nucleotide variation. ( ) position of binding site in the gene is not always fixed; instead, their positions always fluctuate in approximate regions. Based on above two features of promoters, we present a new concept of Interval Weight Matrix (IWM) and a new system of promoter identification. Testing results on general large DNA sequences indict that the new system can achieve both high sensitivity and specificity. The details of our system will be explored in following sections.

II. PROMOTER FEATURE AND PROMOTER MODEL

A. Promoter Feature Promoter is a subsequence of DNA which can be

recognized by RNA polymerase and is also signal of transcription. This function can be achieved because there are some binding sites in promoter regions which can be recognized by RNA polymerase . These binding sites are well known for their similar structure, including TATA-box, GC-box, and CAAT-box, i.e, well known as core-promoter or consensus sequences. Even though core-promoters have similarity in their structure, the structures of these core-promoters are not always consistent. Instead, because of nucleotide variation, core-promoters mostly appear in different forms. For instance, for TATA-box, it has a common form of TATANAN, where ‘N’ is wildcard and can be either ‘T’ or ‘A’. Thus, the following words (TATAAAA, TATAAAT, TATATTT, and TATATAT) are different forms of TATA-box. Another feature of core-promoter is variation of their positions. In another word, core-promoter is more likely to appear in an approximate region rather than a fixed position. For instance, TATA-box is well known to appear at the position of -25, however, the fact is where TATA-box is more likely to appear is actually in a region of -35~-20. Based on above two features, it is our first time to encode above two features in our system to have a more precise promoter model.

There are already many different promoter identification algorithms proposed in the past. Most of these algorithms are based on searching for consensus motifs and some are rooted in finding other features, like transcription initiation site (INR), CpG islands, etc. The most serious problem with these algorithms is that features they use frequently appear in nonpromoter regions. Thus, in this kind of algorithms, positional information of consensus motifs is ignored.

Another kind of strategy is to calculate K-word frequencies (short subsequences with k nucleotides). Research has informed that distribution of K-word in a DNA sequence has biological significance. For instance, four-word frequencies are useful in quantifying the differences between E.coli promoter sequences and “average” genomic DNA [16].

Pentamer(five-word) and hexamer(six-word) distributions play a significant role in discriminating coding and noncoding DNA. One of the optimized systems is based on Positional Weight Matrix (PWM). PWM reflects positional distribution of K-word. Even though this strategy can reveal the relationship between core-promoter occurring probability and its position, it ignores above two discussed features of promoter regions. Thus, there still exist two problems with PWM strategy: (I) Structures of binding motifs often differ because of nucleotide variation. (II) Position of binding motifs is not absolutely fixed. Because of above two inborn disadvantages, the PWM strategy is not sufficient to precisely reflect relationship between binding motifs and their corresponding positions. In this paper, after analyzing the disadvantages of PWM strategy and two features of promoter regions, we present a new promoter model. The details of our new promoter model are explained in next section.

B. Promoter Model In this work, we attempt to overcome disadvantages in

PWM strategy. In order to gain a more precise promoter model, we encode solutions of above two problems of PWM strategy in our new systems. For the first problem that the same kind of binding motifs differ in structure caused by nucleotide variation, to solve this problem and encode its solution in our promoter model, in the K-word, we permit nucleotide variation after 4th base pair but keeping first 4 bases. In another word, the first four base pairs can be any of ‘A’, ‘T’,‘C’, ‘G’, but there are only two kinds of bases ’A’ and ‘T’, ‘C’and ‘G’ after 4th bp. This inspiration is from observation of TATA-box. TATA-box is in a form of TATANAN, where ‘N’is wildcard and can be either ‘T’ or ‘A’. In this way, total number of K-word (K>4) is 24 )4(4 ��� KWN . Total number of words will be effectively reduced. For instance, for word length K=7, in PWM system, total number of words is 47 =16,384, in our new promoter model, total number of 7-word is only �� 24 34 2,048. With dramatic decrease of total word number, computational burden is released exponentially. For the second problem that the position of binding motif is variable, it is our first time to present a new concept Interval Weight Matrix (IWM). We divide promoter regions into equal-length intervals and interval length is IL. The interval probability for each word equals the sum of probabilities of this word occurring at different positions in this interval. The probability of each word appearing at each position is obtained from training sets. The position of one word is seen as the position of the first nucleotide of this word. For instance, if we say TATA-box is at the position of -30, it means its first ‘T’ appears at the position of -30. Thus, Interval [ j� ~ )( ILj �� ] probability for word W in IWM is as following:

;)()(

)(...)()()()(

)()1()(

���

��

���

�������

���

ILj

jii

ILjj

ILjjjILj

j

WW

WWWW

IWMIWM

IWMIWMIWMIWM(1)

where )(WIWM i is the probability of W appearing at

237

Page 3: [IEEE 2011 Fourth International Workshop on Advanced Computational Intelligence (IWACI) - Wuhan, China (2011.10.19-2011.10.21)] The Fourth International Workshop on Advanced Computational

the ith position in the IWM. IL is the length of interval in IWM.

In conclusion, the interval probability for each word Wequals the sum of probabilities of W appearing at different positions in this interval.

After analyzing features of promoter regions and disadvantages of PWM strategy, we encode solutions to PWM system in our new promoter model. As result, we present a new promoter prediction model. The details of our promoter prediction system are revealed in next section.

III. WORD STATICS AND PROMOTER PREDICTION

A. Word Statics In last section, we present a new concept of IWM. In order

to make IWM of most discrimination between promoters regions and nonpromoters regions, we try to select most informative words. In order to gain most informative words to discriminate promoters and nonpromoters, we select those words which have a stronger affinity with promoter regions than other words. In another word, the occurring probability of these words in promoter regions is greater than other words.

The posterior probability of I given Wi, P (I|Wi), where I is an indictor which equals 1 when Wi indicts a promoter region, otherwise I=0. If P (I=1|Wi)> P (I=0|Wi), we consider Wi is a core-promoter which indicts promoter regions. We define

niWiIPLogWiIPLog ,...2,1)),|0(())|1(( ����� (2)then, compute the value of Δ for each word. According to Bayes’ theorem, we have

niWiP

IPIWiPWiIP ...2,1,)(

)1()1|()|1( ����� (3)

and

....2,1,)(

)0()0|()|0( niWiP

IPIWiPWiIP ����� (4)

from (2) to (4), we can obtain

....2,1,)0()1(

)|0()|1( ni

IPIPLog

WiIPWiIPLog �

���

��� (5)

Assuming that P(I=1) and P(I=0) are absolute, we define Δas following:

TABLE I DESCRIPTION OF LARGE GENOMIC SEQUENCES IN EVALUATION SET

Accession number Description Length (bp) Number of TSS

AC002397 L44140

D87675 AF017257

AF146793 AC002368

Complete sequence of mouse chromosome 6 BAC-284H12 Homo sapiens chromosome X region from filamin (FLN) gene to glucose-6-phosphate dehydrogenase (G6PD) gene. There are 13 known and six candidate genes in the sequence Homo sapiens DNA for amyloid precursor protein Homo sapiens chromosome 21-derived BAC containing erythroblastosis virus oncogene homolog 2 protein (ets-2)gene Mouse protein B, Clock, PFT27 and HSAR gene Homo sapiens Xq28 BAC PAC and cosmid clones containing FMR2 gene

227538

219447

301692101569

204625324816

17

11

1 1

4 1

TOTAL 1.38 Mb 35

TABLE II RESULTS OF LARGE GENOMIC SEQUENCE ANALYSIS

Accession Number

Method TP FP SP% Sn% Accession Number

Method TP FP SP% Sn%

AC002397

PromoterInspector FirstEF(p=0.98)

Eponine(t=0.995) DPF(s=0.45) New System

4 7 8 6 7

1 3 1 4 0

8070

88.8 60

100

23.5 41.1 47

35.2 41.1

AF017257

PromoterInspector FirstEF(p=0.98)

Eponine(t=0.995) DPF(s=0.45) New System

1 1 1 1 1

0 0 3 1 0

1001002550

100

100100100100100

L44140

PromoterInspector FirstEF(p=0.98)

Eponine(t=0.995) DPF(s=0.45) New System

6 6 6 8 8

1411121511

3035.2 33.3 3838

54.5 54.5 54.572.7 72.7

D87675

PromoterInspector FirstEF(p=0.98)

Eponine(t=0.995) DPF(s=0.45) New System

1 1 1 1 1

2 0 1 3 1

33.3 100502550

100100100100100

AF146793

PromoterInspector FirstEF(p=0.98)

Eponine(t=0.995) DPF(s=0.45) New System

1 1 1 1 1

2 3 3 4 2

33.3 252520

33.3

2525252525

AC002368

PromoterInspector FirstEF(p=0.98)

Eponine(t=0.995) DPF(s=0.45) New System

1 1 1 1 1

1 1 0 3 0

5050

10025

100

100100100100100

TP True Positive FP: False Positive FN: False Negative Se=TP/(TP+FN) Sp=TP/(TP+FP)New system is set at optimized situation with K=5, IL=6.

238

Page 4: [IEEE 2011 Fourth International Workshop on Advanced Computational Intelligence (IWACI) - Wuhan, China (2011.10.19-2011.10.21)] The Fourth International Workshop on Advanced Computational

nicIWiPIWiPLog ...2,1,

)0|()1|( ��

��� (6)

where c is a constant that equals )0()1(

��

IPIPLog . In case that

any of )0|( �IWiP or )1|( �IWiP equals zero, we final define Δ as following:

nicIWiPIWiPLog ...2,1,

1)0|(1)1|( ��

����� (7)

then words are ranked accorded to their Δ. The word group G with highest Δ values can be obtained as following:

niIWiPIWiPLogG ...2,1},

1)0|(1)1|(

{}{ �����

�� � (8)

where {G} is word group which consists of words with Δvalues greater than β(0.002) and β is threshold to make selected words of most discrimination between promoters and nonpromoters.

The sequences we use to gain IWMs and informative words of most discrimination between promoters and nonpromoters are from training sets http://hsc.utoledo.edu/bioinfo/eid/index.html. Through this method, we could get most informative word group {G} to predict promoters, details about promoter prediction are revealed in the next section Promoter Prediction.

B. Promoter Prediction In above sections, we already gain IWM and its

corresponding word group {G}, in this section, we present a new algorithm to discriminate promoters from nonpromoters. Our system consists of two classifiers: Promoter-Exon classifier and Promoter-Intron classifier. Each classifier attempts to discriminate promoters from its corresponding nonpromoters: exons and introns. Only if both classifiers vote for an unknown subsequence, could this subsequence be assigned to a promoter. For an unknown DNA sequence Q with length QL, a set of K-words (W1 ,W 2 …W n ), it can be shown as following:

WWWW niQ ......21� (9)

where Wi is the K-word at ith position in Q; n equals QL-K+1. For each word Wi in the unknown DNA sequence, there are two scores ( }{WiIWM p and }{WiIWM np ) can be calculated by

IWM promoter and IWM rnonpromote , respectively. Then two scores ( S promoter and S rnonpromote ) of this DNA sequence can be obtained. In order to reveal that all the words in Q appear in fixed order at the same time, we use product event to indict this, because appearance of each word is independent event, thus

}{1

)( )( Win

iQS IWM ILj

j��

� ��� (10)

where }{)( WiIWM ILjj

��� is the probability of word Wi

appearing at interval [ j� , )( ILj �� ], j equals ILILi �� )1\( , i is the position of word Wi in unknown sequence Q. K is word length, IL is the length of interval in IWM.

In case that any word Wi in Q does not appear in the IWM, scores of unknown DNA sequences is defined as following:

|)}{(|)(

|)}{(|)(

1

)(

1

)(

��

��

��

��

n

ii

ILj

jrnonpromote

n

ii

ILj

jpromoter

epsLogQ

epsLogQ

WIWMS

WIWMS

np

p (11)

where eps equals 10 6� in case of any of

}{)(

WIWM iILj

jp��

�or }{

)(WIWM i

ILj

jnp��

� being nil and

ILILij ��� )1\( ; Since word group {G} in the IWM is selected by

discriminating promoters from nonpromoters, we can expect that a smaller score can be gained if the word belongs to promoter regions and a greater score can be obtained if sequences do not belong to promoter regions. Based on this principle, an unknown sequence can be assigned to be a promoter so long as it satisfies following conditions simultaneously:

TSTTSSS

uppromoterdown

promoterrnonpromotepromoter

� 1/)((12)

where S promoter and S rnonpromote can be calculated by equation (11) respectively, The initial value of T1 , T up and Tdown are thresholds which can be gained from training sets as following.

TABLE III COMPARISON OF PREDICTION SYSTEMS ON THE EVALUATION SETS SHOWN IN TABLE II.

Method TP FP Se(%)a Sp(%)b

PromoterInspector FirstEF(P=0.98)

Eponine(t=0.9975) DPF(se=0.45) New system

1417181619

2018202814

40.0 48.6 51.4 45.7 54.3

41.2 48.5 47.3 36.4 57.6

aSensitivity : Se=TP/(TP+FN); bSpecificity: Sp=TP/(TP+FP); FN: false negative.

239

Page 5: [IEEE 2011 Fourth International Workshop on Advanced Computational Intelligence (IWACI) - Wuhan, China (2011.10.19-2011.10.21)] The Fourth International Workshop on Advanced Computational

2

||2

)(

||

__

___

__1

SavgSavgT

SavgSavgT

SavgSavgSavg

T

rnonpromotepromoterdown

rnonpromotepromoterup

promoter

rnonpromotepromoter

s

s

��

��

��

(13)

where Savg promoter_ and Savg rnonpromote_ are average

score of all training sets calculated on IWM promoter and

IWM rnonpromote , respectively, s is predefined sensitivity and s=0.5. T1 , T up and Tdown are adjusted in small increment so that s gets close to predefined sensitivity and optimize specificity as well. Only if the unknown sequence satisfies condition (12) of both Promoter-Exon classifier and Promoter-Intron classifier, it can be assigned to be a promoter.

IV. EXPERIMENTAL RESULTS

A. Training Sequence Sets Vertebrate promoter sequence sets we used are from the

database of transcription start sites (DBTSS) [17], all the training sets are human promoter sequences. For each sequence, a section is taken from 200bp upstream to 100 bp downstream of the TTS. We downloaded Exon/Intron from http://hsc.utoledo.edu/bioinfo/eid/index.html and then extracted vertebrate exon and intron sequence from them. Thus all training sets are at the length of 300bp and nonoverlapping. After cleaning up redundant sequences with program CleanUp, the training sets consist of 565 promoter sequences, 890 exon sequences and 4345 intron sequences. Finally, in order to get a disinterested and credible result, we also removed from our training sets that all promoter sequences appear in testing sets, there are 313 promoters in the DBTSS that belong to the human chromosome 22, 11 to the chromosome X, and 1 to the chromosome 21, respectively.

B. System Setting Optimization For different word length K and different interval length IL

in our new system, promoter prediction results are different. In order to optimize system settings, we have done many experiments by changing K and IL respectively to figure out the relationship between K, IL and prediction results. First, we try to figure out relationship between K and prediction performance. In the experiments, we change K from 4 to 7, prediction results reveal that before K=5, with increment of K, prediction performance is getting better and after K=5, performance is getting worse with increment of word length. Thus, it firms that system (K=5) performs best on our testing sets. Second, we try to make it clear what is the relationship between RL and prediction performance. In the experiments, IL is changed from 4 to 8 given K=5. Experimental results reveal that both increment and decrement of interval length from IL=6, performance of our system gets worse. Thus, system (IL=6) has the best performance on our testing sets. In conclusion, when K=5, IL=6 our system is at best situation and can achieve high sensitivity and best specificity.

As our system consists of two classifiers: Promoter-Exon

classifier and Promoter-Intron classifier, in order to optimizeour system settings, we optimize parameters in each classifier, respectively. In each classifier, three parametersT1 , T up and

Tdown are initiated as equation (13) and adjusted to satisfy its predefined sensitivity s=0.5. In this process, we try to reduce false positives in the promise of sensitivity. Finally, we get optimized parameters for both classifiers in our system.

After setting our system at best situation (K=5, RL=6), we test our system on the evaluation sets. The details of evaluation sets are shown in the TABLE I and testing results are revealed in the TABLE II. Comparison of different prediction systems are shown in TABLE III and TABLE IV.

C. Large genomic sequence analysis and comparisons Identification of promoter regions in large genomic

sequences is performed by a sliding window approach. The window moves over sequences and the content in the window is classified. The length of the window is 300bp (as long as training sets length), and its step is IL bp (as long as interval length) in our system. When one or more predictions fall in the region [-2000, +2000] relative to the reference promoter location, then the respective gene is counted as a TP.

In order to further verify the effectiveness of our new system, our new system is compared with other four prediction systems: PromoterInspector[5], FirstEF, Eponine[9] and Dragon Promoter Finder[8]. The reason why these four prediction systems are selected is because they have been proved to be best prediction systems. The evaluation set is currently a standard for evaluating the performance of promoter recognition system. This set consists of six Genebank genomic sequences with a total length of 1.38 Mb and 35 known TSSs. The details of these evaluation sets are shown in TABLE I. We adopt the same evaluating criterion used by PromoterInspector [3]: A predicted region is counted as correct if a TSS is located within the region or if a region boundary is within 200 bp 5’ of such a TSS. The experimental results on evaluation sets are revealed in TABLE II and comparisons between five prediction systems are shown in TABLE III. The testing results of above five prediction systems on human chromosome 22 are shown in TABLE IV.

In these experiments, PromoterInspector and our system are used with default settings, DPF is used by setting se=0.45, FirstEF is used by setting P=0.98, t of Eponine is set 0.9975 for fair comparison. After observing experimental results and comparing other four systems with our new system, it firms that our method performs best in both sensitivity and specificity.

V. CONCLUSION

Eukaryotic promoter identification is an important issue in DNA analysis as promoter indicts start of transcription. However, it is also a very tough problem. Although a lot of dedication has been devoted into solving this problem in the past, the most serious problem, high FP, still exists. In this paper, we focus on two features of promoter regions :(I) The variation of structure of core-promoter; (II) The variation of position of binding motifs. Above two features determine

240

Page 6: [IEEE 2011 Fourth International Workshop on Advanced Computational Intelligence (IWACI) - Wuhan, China (2011.10.19-2011.10.21)] The Fourth International Workshop on Advanced Computational

previous PWM system is not sufficient to reflect relationship between core-promoters and their position. In order to reflect these two features in our new system and set up a better relationship between words and their positions, we embody two features in our new system and optimize system settings. After testing on evaluation sets and other testing sets, experimental results firm that our new system is sufficient to reduce false positives. In the feature, we will integrate our features with other effective features and employ machine learning [18] to further improve prediction accuracy.

REFERENCES

[1] E. S. Lander, “Initial Sequencing and Analysis of Human Genome,”Nature, vol. 409, pp. 860-921, Feb. 2001.

[2] J. C. Venter et al., “The Sequence of Human Gene,” Science, vol. 291, no. 5507, pp.1304-1351, Feb. 2001.

[3] J. W. Fickett and A. G. Hatzigeorgiou, “ Eukaryotic Promoter Recognition,” Genome Res., vol. 7, pp. 861-878, Sep. 1997.

[4] D. S. Prestridge, “Computer Software for Eukaryotic Promoter Analysis,” Methods Mol. Biol., vol 130, pp. 265-295, Oct. 2000.

[5] M. Scherf, A. Klingenhoff, and T. Werner, “Highly Specific Localization of Promoter Regions in Large Genome Sequences by PromoterInspector: A Novel Context Analysis Approach,” J. Mol. Biol,vol. 297, pp. 599-606, Mar. 2000.

[6] I. P. Ioshikhes and M. Q. Zhang, “Large-Scale Human Promoter Mapping Using CpG Island,” Nat. Genet., vol. 26, pp. 61-63, Sep. 2000.

[7] R. V. Davuluri, I. Grosse, and M. Q. Zhang, “Computational Identification of Promoters and First Exons in the Human Genome, ”Nat. Genet, vol. 29, pp. 412-417, Nov. 2001.

[8] V. B. Bajic, S. H. Seah, A. Chong, S. P. T. Krishnan, J. L. Y. Koh, and V. Brusic, “Computer Model for Recognition of Functional Transcription Start Sites in Polymerase Promoters of Vertebrates,” J.Mol. Graphics Model, vol. 21, pp. 323-332, Mar. 2003.

[9] T. A. Down and T. J. Hubbard, “Computational Detection and Location of Transcription Start Sites in Mammalian Genomic DNA,” Genome Res., vol. 12, no. 3, pp. 458- 461, Dec. 2002.

[10] L. Ponger and D. Mouchiroud, “CpGProD: Identifying CpG Island CpG Islands Associated with Transcription Start Sites in Large Genomic Mamaalian Sequences,” Bioinformatics, vol. 18, no. 4, pp. 631-633, Nov. 2002.

[11] T. Werner,“The State of the Art of Mammalian Promoter Recognition,” Briefings Bioinf., vol. 4, pp. 22-30, Mar. 2003.

[12] D. S. Prestridge and C. Burks, “The Density of Trancriptional Elements in Promoter and non-Promoter Sequences,” Hum. Mol. Genet., vol. 2, no. 9, pp. 1449-1453, June 1993.

[13] G. B. Hutchinson, “The Prediction of Vertebrate Promoter Regions Using Differential Hexamer Frequency Analysis,” Comput. Appl. Biosci.,. vol. 12, pp. 391-398, Oct. 1996.

[14] S. H. Cross, V. H. Clark, and A. P. Bird, “A Cluster of Four Recptor-like Genes Resides in the Vf Locus that Confers Resistance to

Aplle Scab Disease,” Nucleic Acids Res., vol. 27, pp. 2099-2107, May 1999.

[15] T. M. Cover and J. A. Thomas, Elements of Information Theory. New York, Wiley, 1991.

[16] R.C. Deonier, S. Tavaré, and M.S. Waterman, Computational Genome Analysis: An Introduction. New York, Springer, 2005.

[17] Y. Suzuki, “Diverse Transcriptional Initiation Revealed by Fine Large-Scale Mapping of mRNA Start Sites,” EMBO Rep., vol. 2, no. 5, pp. 388-393, 2001.

[18] X. Xie, S. Wu, K. M. Lam, and H. Yan, “Promoter Explorer: An effective Promoter Identification Method Based on the AdaBoost Algorithm,” Bioinformatics, vol. 22, pp. 2722-2728, Nov. 2006.

[19] S. Wu, X. Xie, Alan Wee-Chun Liew, and H. Yan, “Eukaryotic Promoter Prediction Based on Relative Entropy and Positional Information,” Physical Review, vol. E75, pp. 1-7, April 2007.

[20] V. B. Bajic, S. L. Tan, Y. Suzuki and S. Sugano, “Promoter Prediction Analysis on the Whole Human Genome,” Nature Biotechnology, Vol. 22, No. 11, Nov. 2004.

[21] V. B. Bajic. S. H. Seah, A. Chong, G. Zhang, J. L. Y. Koh and V. Brusic, “Dragon Promoter Finder Recognition of Vertebrate RNA Polymerase Promoters,” Bioinformatics Application Note, vol.18, pp. 198-199, Aug. 2011.

TABLE IV RESULTS AND COMPARISONS OF PREDICTION SYSTEMS ON HUMAN CHROMOSOME 22.

Method TP FP Se(%)a Sp(%)b

PromoterInspector FirstEF(P=0.98) Eponine(t=0.9975) DPF(se=0.45) New system

239242247241245

274270248482157

60.8 61.5 62.8 61.3 62.3

46.6 47.2 49.9 33.3 60.9

aSensitivity : Se=TP/(TP+FN); bSpecificity: Sp=TP/(TP+FP); FN: false negative.