analysing gene expression patterns using the dinucleotide properties genome browser (diprogb)

1
Analysing gene expression patterns using the Dinucleotide Properties Genome Browser (DiProGB) Maik Friedel, Swetlana Nikolajewa, Thomas Wilhelm and Jürgen Sühnel Leibniz Graduate School on Ageing FLI Introduction New approaches to motif discovery in nucleotide sequences are still urgently required. Here we present an analysis of sequence regions around gene start and stop positions for highly and lowly expressed genes of the Escherichia coli K12 MG1655 genome using the DiProGB genome browser ( http://diprogb.fli-leibniz.de ). By a statistical analysis of genes superimposed at these positions we have discovered significant differences between the respective gene groups. In addition to the GenBank data (NC_000913) we have also used operon information from EcoCyc. The expression data was taken from the Many Microbe Microarrays Database (M3D) providing 380 uniformly normalized Affymetrix microarrays of different experiments for all known putative E. coli genes (4298). From this data we have extracted the 400 most highly expressed genes and the 400 most lowly expressed genes using the mean expression over all experiments as reference. Basic Statistic s Position Specific Statistic s Lowly expressed genes 5’ Gene Start End 3’ All genes Highly expressed genes Motifs sorted by occurrence in Lowly expressed genes All genes Highly expressed genes Conclusion DiProGB is a powerful tool for analyzing differences between gene groups. Position-specific statistics in combination with feature selection allows to find significantly over- or underrepresented motifs for each of the groups and to determine their positions. All analyses can be done both on the sequence level and also if the sequence is encoded by physical dinucleotide properties. Sequence Total length Number of Sequences Average gene length GC content Y content Keto content A content T content G content C content Complete genome 4639675 1 - 0.508 0.5 0.5 0.246 0.246 0.254 0.254 All_start_+- 100 897000 4485 829.5 (±655) 0.470 (±0.059) 0.490 (±0.043) 0.506 (±0.042) 0.267 (±0.045) 0.263 (±0.044) 0.243 (±0.040) 0.227 (±0.039) Lowly_start_+- 100 80000 400 696.5 (±724) 0.409 (±0.068) 0.491 (±0.045) 0.507 (±0.043) 0.296 (±0.051) 0.295 (±0.048) 0.213 (±0.042) 0.196 (±0.043) Highly_start_+ -100 80000 400 546.5 (±618) 0.474 (±0.044) 0.475 (±0.036) 0.502 (±0.040) 0.274 (±0.039) 0.251 (±0.037) 0.251 (±0.032) 0.223 (±0.031) All_end_+-100 896925 4485 - 0.485 (±0.058) 0.488 (±0.042) 0.507 (±0.041) 0.260 (±0.045) 0.255 (±0.042) 0.252 (±0.040) 0.233 (±0.038) Lowly_end_+- 0.423 (±0.062) 0.491 (±0.045) 0.501 (±0.043) 0.292 (±0.050) 0.284 (±0.046) 0.216 (±0.038) 0.207 (±0.041) 0.493 (±0.041) 0.480 (±0.040) (±0.031) 0.238 (±0.030) 2 – 4 % 5 – 9 % > 9 % end of400 highestexpressed genes 0 10 20 30 40 50 60 70 -100 -92 -84 -76 -68 -60 -52 -44 -36 -28 -20 -12 -4 5 13 21 29 37 45 53 61 69 77 85 93 R elative sequence position (nt) end of all4485 genes 0 10 20 30 40 50 60 70 -100 -92 -84 -76 -68 -60 -52 -44 -36 -28 -20 -12 -4 5 13 21 29 37 45 53 61 69 77 85 93 R elative sequence position (nt) startof400 highestexpressed genes 0 10 20 30 40 50 60 70 -100 -92 -84 -76 -68 -60 -52 -44 -36 -28 -20 -12 -4 5 13 21 29 37 45 53 61 69 77 85 93 R elative sequence position (nt) % A T G C start o f all4485 g e n e s 0 10 20 30 40 50 60 70 -100 -92 -84 -76 -68 -60 -52 -44 -36 -28 -20 -12 -4 5 13 21 29 37 45 53 61 69 77 85 93 R elative sequence position (nt) % A T G C end of400 low estexpressed genes 0 10 20 30 40 50 60 70 -100 -92 -84 -76 -68 -60 -52 -44 -36 -28 -20 -12 -4 5 13 21 29 37 45 53 61 69 77 85 93 R elative sequence position (nt) start o f 400 lo w estexpressed genes 0 10 20 30 40 50 60 70 -100 -92 -84 -76 -68 -60 -52 -44 -36 -28 -20 -12 -4 5 13 21 29 37 45 53 61 69 77 85 93 R elative sequence position (nt) % A T G C T-rich region A-rich at position 5 (>40%) , C = 21% A = T = ~30% G = C = ~20% G and T content is clearly depended on the codon position, A and C content is not dominant purine stretch upstream increased G content decreased T content C (~37%) is preferred over A (~29%) in position 5 A = T = G = C = ~25% The motif TTT is primarily responsible for the increased T content at the start region of the lowly expressed genes. GAAT, GAAAA motifs are frequently found in lowly expressed genes starting at the 3rd base downstream and are thus responsible for the A –richness at position 5 GAGC, GAG frequently found in highly expressed genes starting at the 3rd base downstream and are thus responsible for the frequent G at position 5 CGT is very frequent in highly expressed genes leading to their high C and G content Different purine-rich motifs lead to the significant purine peak 10 nt upstream. AAGG, GAG, AGG, AGA, AGGA, GGA A-rich region (~40%) between 6-10 bases after the stop position again A = T = 30% and G = C = 20% increased probability for T between 30-70 bases downstream A-rich region (~35%) 15-25 bases after the stop position again A = T = G = C = 25% TTT and TTTT are frequent in high and low genes between 30-70 bases downstream of the stop. This might explain the elevated T content in this region for all genes. A,T rich motifs like ATT, ATA, TTT are very frequent in and near lowly expressed genes and lead therefore to the increased A/T content. AAT, TAA occur frequently in lowly expressed genes in the region 3 – 10 nt after the stop. This may explain the A peak in that region. AAT is very frequent in highly expressed genes in the 15 – 20 nt region after the stop. This may explain the A peak of highly expressed genes in that region. The higher frequency of GC rich motifs in highly expressed genes lead to an elevated GC content. GCG,CGT,CCG

Upload: lali

Post on 17-Jan-2016

23 views

Category:

Documents


3 download

DESCRIPTION

FLI. Leibniz Graduate School on Ageing. Analysing gene expression patterns using the Dinucleotide Properties Genome Browser (DiProGB). Maik Friedel, Swetlana Nikolajewa, Thomas Wilhelm and Jürgen Sühnel. Introduction - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Analysing gene expression patterns using the  Dinucleotide Properties Genome Browser (DiProGB)

Analysing gene expression patterns using the Dinucleotide Properties Genome Browser (DiProGB)

Maik Friedel, Swetlana Nikolajewa, Thomas Wilhelm and Jürgen Sühnel

Leibniz Graduate School on Ageing

FLI

Introduction

New approaches to motif discovery in nucleotide sequences are still urgently required. Here we present an analysis of sequence regions around gene start and stop positions for highly and lowly expressed genes of the Escherichia coli K12 MG1655 genome using the DiProGB genome browser (http://diprogb.fli-leibniz.de). By a statistical analysis of genes superimposed at these positions we have discovered significant differences between the respective gene groups.

In addition to the GenBank data (NC_000913) we have also used operon information from EcoCyc. The expression data was taken from the Many Microbe Microarrays Database (M3D) providing 380 uniformly normalized Affymetrix microarrays of different experiments for all known putative E. coli genes (4298). From this data we have extracted the 400 most highly expressed genes and the 400 most lowly expressed genes using the mean expression over all experiments as reference.

Basic Statistics

Position Specific Statistics

Lowly

expressed

genes

5’ Gene Start End 3’

All genes

Highly

expressed

genes

Motifssorted by

occurrence in

Lowly

expressed

genes

All genes

Highly

expressed

genes

Conclusion

DiProGB is a powerful tool for analyzing differences between gene groups. Position-specific statistics in combination with feature selection allows to find significantly over- or underrepresented motifs for each of the groups and to determine their positions. All analyses can be done both on the sequence level and also if the sequence is encoded by physical dinucleotide properties.

SequenceTotal length

Number of Sequences

Average gene length GC content Y content Keto content A content T content G content C content

Complete genome 4639675 1 - 0.508 0.5 0.5 0.246 0.246 0.254 0.254

All_start_+-100 897000 4485 829.5 (±655) 0.470 (±0.059) 0.490 (±0.043) 0.506 (±0.042) 0.267 (±0.045) 0.263 (±0.044) 0.243 (±0.040) 0.227 (±0.039)

Lowly_start_+-100 80000 400 696.5 (±724) 0.409 (±0.068) 0.491 (±0.045) 0.507 (±0.043) 0.296 (±0.051) 0.295 (±0.048) 0.213 (±0.042) 0.196 (±0.043)

Highly_start_+-100 80000 400 546.5 (±618) 0.474 (±0.044) 0.475 (±0.036) 0.502 (±0.040) 0.274 (±0.039) 0.251 (±0.037) 0.251 (±0.032) 0.223 (±0.031)

All_end_+-100 896925 4485 - 0.485 (±0.058) 0.488 (±0.042) 0.507 (±0.041) 0.260 (±0.045) 0.255 (±0.042) 0.252 (±0.040) 0.233 (±0.038)

Lowly_end_+-100 80000 400 - 0.423 (±0.062) 0.491 (±0.045) 0.501 (±0.043) 0.292 (±0.050) 0.284 (±0.046) 0.216 (±0.038) 0.207 (±0.041)

Highly_end_+-100 80000 400 - 0.493 (±0.041) 0.480 (±0.040) 0.498 (±0.036) 0.264 (±0.037) 0.243 (±0.035) 0.256 (±0.031) 0.238 (±0.030)

2 – 4 %

5 – 9 %

> 9 %

end of 400 highest expressed genes

0

10

20

30

40

50

60

70

-100 -92 -84 -76 -68 -60 -52 -44 -36 -28 -20 -12 -4 5 13 21 29 37 45 53 61 69 77 85 93

Relative sequence position (nt)

end of all 4485 genes

0

10

20

30

40

50

60

70

-100 -92 -84 -76 -68 -60 -52 -44 -36 -28 -20 -12 -4 5 13 21 29 37 45 53 61 69 77 85 93

Relative sequence position (nt)

start of 400 highest expressed genes

0

10

20

30

40

50

60

70

-100 -92 -84 -76 -68 -60 -52 -44 -36 -28 -20 -12 -4 5 13 21 29 37 45 53 61 69 77 85 93

Relative sequence position (nt)

%

A

T

G

C

start of all 4485 genes

0

10

20

30

40

50

60

70

-100 -92 -84 -76 -68 -60 -52 -44 -36 -28 -20 -12 -4 5 13 21 29 37 45 53 61 69 77 85 93

Relative sequence position (nt)

%

A

T

G

C

end of 400 lowest expressed genes

0

10

20

30

40

50

60

70

-100 -92 -84 -76 -68 -60 -52 -44 -36 -28 -20 -12 -4 5 13 21 29 37 45 53 61 69 77 85 93

Relative sequence position (nt)

start of 400 lowest expressed genes

0

10

20

30

40

50

60

70

-100 -92 -84 -76 -68 -60 -52 -44 -36 -28 -20 -12 -4 5 13 21 29 37 45 53 61 69 77 85 93

Relative sequence position (nt)

%

A

T

G

C

T-rich region

A-rich at position 5 (>40%) , C = 21%

A = T = ~30%G = C = ~20%

G and T content is clearly depended on the codon position, A and C content is not

dominant purine stretch upstream

increased G contentdecreased T content

C (~37%) is preferred over A (~29%) in position 5

A = T = G = C = ~25%

The motif TTT is primarily responsible for the increased T content at the start region of the lowly expressed genes.

GAAT, GAAAA motifs are frequently found in lowly expressed genes starting at the 3rd base downstream and are thus responsible for the A –richness at position 5

GAGC, GAG frequently found in highly expressed genes starting at the 3rd base downstream and are thus responsible for the frequent G at position 5

CGT is very frequent in highly expressed genes leading to their high C and G content

Different purine-rich motifs lead to the significant purine peak 10 nt upstream.AAGG, GAG, AGG, AGA, AGGA, GGA

A-rich region (~40%) between 6-10 bases after the stop position

again A = T = 30% andG = C = 20%

increased probability for T between 30-70 bases downstream

A-rich region (~35%) 15-25 bases after the stop position

again A = T = G = C = 25%

TTT and TTTT are frequent in high and low genes between 30-70 bases downstream of the stop. This might explain the elevated T content in this region for all genes.

A,T rich motifs like ATT, ATA, TTT are very frequent in and near lowly expressed genes and lead therefore to the increased A/T content.

AAT, TAA occur frequently in lowly expressed genes in the region 3 – 10 nt after the stop. This may explain the A peak in that region.

AAT is very frequent in highly expressed genes in the 15 – 20 nt region after the stop. This may explain the A peak of highly expressed genes in that region.

The higher frequency of GC rich motifs in highly expressed genes lead to an elevated GC content. GCG,CGT,CCG