regulatory sequences (basics) alexander kel senior vice president of genome informatics, biobase...

Regulatory Sequences

(Basics)

Alexander Kel

Senior Vice President of Genome Informatics,BIOBASE GmbH, Halchtersche Strasse 33D-38304 WolfenbuettelGermany www.biobase.de

http://www.biobase.de/



TRANSCompel

TRANSFAC

TRANSPATH

Patho DBS/MARt DB

- mechanistic- semantic

Match Patch

Catch

Pathway builder Array analyser

Cytomer TRANSGenome TRANSPLORER

CMFinder

BIOBASE customers*

* not complete

TRANSFACSyngentaCeleraMonsantoPfizer Merck Sharp & DomeAmgenTakedaNovartisGlaxoSmithKline

TRANSPATHVertex

BothAventis Eli LillySchering PloughHoffmann La RocheAkzo Nobel

More than 200 academic labs including:

Harvard Stanford Tokyo University Riken Labs Max Planck

More than 7000 registered users on our portal

gene-regulation.com

Same blocks - different structures

LEGO system

Concepts of gene regulation

DNA

RNA

protein

transcription

translation

amplification, methylation,chromatin structure

splicing, degradation

modification, degradation

information carrier 1

transformation

carrier organization

information carrier 2

TRANSFACGene structure

ContigGene

Splice Variants mRNA

Regulatory Elements

CDS5’-UTR 3’-UTR

5‘

Splicing

3‘

Transcription

primarytranscript

altern.exon

TRANSFAC

ContigGene

Splice Variants mRNA

Regulatory Elements

CDS5’-UTR 3’-UTR

5‘

Splicing

3‘

Transcription

primarytranscript

altern.exon

Gene structure

promoterenhancer 1enhancer 2

TSS

TATAbox

initiatorInr

box Abox Bbox Cbox A‘

compositeelement

box E box Dbox D‘box Fbox Gbox A‘‘

General schema of the modular hierarchical structure of transcription regulatory regions of

eukaryotic genes.

…cis

trans

Human genes Sequences and positions of AP-1 binding sites glutathione P-

transferase

enhancer at -2500

hemoglobin, epsilon

-80 н.п.

Akt-2

-100 н.п.

IFN-

-89 н.п.

Apo АII

-792 н.п.

Melanotransferin

-2013 н.п.

Collagenase

-72 н.п.

proto-oncogene

c-myc

-335 н.п.

porphobilinogen deaminase

-162 н.п.

GM-CSF

enhancer at -3500

TGAСTTT

TGACATC

TGTCACC

TGACTCA

TGAGTCA

TGAGTCA

TGATTTA

TGACTCA

TGACTCA

What is a transcription factor?

A transcription factor is a protein that regulates transcription

after nuclear translocation

by specific interaction with DNA

or by stoichiometric interaction with a protein that can be assembled

into a sequence-specific DNA-protein complex.

Transcription factors

Sequence-specific DNA binding

Non-DNA binding

TF1 TF2 TF3 TF4

adapter

Co-activator

HAT

DNA

Layer I

Layer III

Layer II

Structure of transcription factors

USF-1, dimer

DNA binding domain

Activation domain

oligomerization domain

Ligand- binding domain

Protein-protein interaction domain

Structure of transcription factors

N Gene Schema and positions of a CE

TRANSCompel accession number

1.

Scavenger receptor, Homo sapiens

Enhancer –4500/-4100

C00080

2.

GM-CSF, Mus musculus

-53 -40 : :

C00081

3.

Collagenase, Homo sapiens

-89 -82 -72 -66 : : : :

C00083

4.

IgH , Mus musculus

Enhancer at 3’ flank

C00133

5.

Interleukin 2, Homo sapiens

-283 -268 : :

C00109

6.

Interleukin 2, Homo sapiens

-167 -142 : :

C00165

7.

Интерлейкин 2, Mus musculus

-167 -142 : :

C00158

8.

IgH, Homo sapiens

C00173

9.

Сывороточный амилоид А1, Rattus norvegicus

-117 -73 : :

С00101

10.

IRF-1, Mus musculus

-123 -113 -49 -40 : : : :

C00192

AP-1 Ets

AP-1 Ets

AP-1 Ets

AP-1 Ets

AP-1 NFAT

AP-1 NF-B

AP-1 Oct-2

Ets CBF

NF-B C/EBP

NF-B STAT-1

Ternary complex NFATp - AP1 - DNA

Synergistic activation of transcription

Low level of transcription

Low level of transcription

F1

F1

F1

F2

F2

F2

Composite elements

Minimal functional units where both protein-DNA and protein-protein interactions contribute to a highly specific pattern of gene expressionand provide cross-coupling of different signal transduction pathways.

M e m b ran e re ce p tor

S rc

S H 3

S H 2 R a s

R a s

G D P

G T P

A d aptorsP L C

P I3 -K

Phospho ry lation

IP 3

C a 2+

C a 2+C a2+

Ca2+ dependent cana l

Calc ineurin

E R K

E R K

JN K

JN K

P 3 8M A P K

P 3 8M A P K

N FAT p N FAT p

NFATp

P

P Pc-F o s c-F o s

с-F os

c-Ju n

c-Jun

c-Ju n

c-Ju n

AT F -2 AT F -2

AT F -2

IL -2

P K B /A k t

C om posite e lem ent

cytoplasm

Nucleus

Integration of signals. Cross-coupling of signal transduction pathways

S S

F F

S S

F F

1 1

11

2 2

22

1)Cooperative binding to DNA and ternary complex formation

SS

F

1 2

2

3)

F1

Sim ultaneous interaction of activation domains w ith the com ponents of the basal complex

Mechanisms of functioning of synergistic composite elements

S S

F F

S S

F F

1 1

11

2 2

22

2)A new protein surface for DNA recognition could be formed

S

F

S

F

1

1

2

2

4) Form ing a new protein surface for in teraction w ith the basal complex


F2F1

s1 s2

F1F2

5)Relief of autoinhibition as a result of protein-protein interactions

7)

F1

F2

DNA wrapping around a nucleosome allows transcription factors to in teract

SS 1 2

2

8)

F

HAT com plex

F1

Recruitm ent of a HAT com plex by one of the transcription factors


S

SF

F

2

1

2

1

6)DNA bending by one of the transcription factors

HDAC complex

1)HAT com plex

M utually exclusive binding of factor F1(activator) and F2 (repressor)

Mechanisms of functioning of antagonistic composite elements

HDAC complex

HAT complex

2)

Binding of F2 (repressor) results in the conform ational changes of F1 (activator)

Mechanisms of functioning of antagonistic composite elements

-180 -150-249

AP-1

NFAT

HMG Y

NFAT NFAT

AP-1STAT 6 NF-Y

-114 -88

AP-1

NFAT

HMG Y

-60

AP-1

NFAT

TATA

-28

c-MAF

CE CE

ST

Mouse IL-4 promoter

+1

ST

GM-CSF Homo sapiens

+1

T-cell specific inducible enhancer at –3500 bp Promoter

TATTT

-54

AP-1

NFAT

CE

NF-Bp50/p65

-88

AP-1

NFAT

CE

AP-1

NFAT

CE

AP-1

NFAT

AP-1

NFAT

CE

NF-Bc-Rel/p65

HMG Y(I)

-114

CD28 response element

CBF CBF

Recruitment of CIITA to MHC-II promoters. A prototypical MHC-II promoter (HLA-DRA) is represented schematically with the W, X, X2, and Y sequences conserved in all MHC-II, Ii, and HLA-DM promoters. RFX, X2BP, NF-Y, and an as yet undefined W-binding protein bind cooperatively to these sequences and assemble into a stable higher order nucleoprotein complex referred to here as the MHC-II enhanceosome. CIITA is tethered to the enhanceosome via multiple weak protein-protein interactions with the W, X, X2, and Y-binding factors. The octamer site found in the HLA-DRA promoter (O), and its cognate activators (Oct and OBF-1) are not required for recruitment of CIITA. CIITA is proposed to activate transcription (arrow) via its amino-terminal activation domains (AD), which contact the RNA polymerase II basal transcription machinery.

Masternak K et al., Genes Dev 2000 May 1;14(9):1156-66

Enhanceosome

T F IIA

T F IIE

T F IIH

S ite -sp e c if ic T F

T F IIF

R N A p o l I I

T F IID

C o-a ctiva torp 300 /C B P

A cetila se P C A F

Closed nucleosomes

Acetilation

T F IIB

Acetylase

Acetylation

Scaffold/matrix attached regions (S/MARs) are regions of the DNA strand that are found the basis of chromatin loops. They anchor the DNA to the proteinaceous nuclear matrix.

Each loop is considered to be a functional domain.

S/MARs may act as border elements and thus, protect gene expression from position effects.

S/MARs genesresidual DNA

S/MARs

enhancerpromoter

gene(transcribed region)SAR

SAR SAR SAR

LCR

LCR

open chromatin

compact chromatin

(regulated)

nuclear scaffold

J. Bode / E. Wingender 1993

S/MARs

Databases on gene regulation

• Clear identification of where you are (which species and which protein).

• Tabular presentation of controlled-vocabulary terms.

• Annotations linked to PubMed references.

• Clear paths of navigation between protein reports, within a species and between species.

• Links to ‘public domain’ databases.

BKL: collected information is displayed in a ‘one page per protein’ format = Protein Reports

N Databases containing gene regulation information

URL

1. EMBL Nucleotide sequence database http://www.ebi.ac.uk/embl.html 2. GeneBank http://www.ncbi.nlm.nih.gov/Web/Genbank/inde

x.html 3. SWISS-PROT http://www.expasy.ch 4. PIR: Protein Information Resourсe http://www-nbrf.georgetown.edu/pir 5. PDB http://www.pdb.bnl.gov/ 6. EPD - Eukaryotic promoter database

http://www.epd.isb-sib.ch

7. TRANSFAC http://transfac.gbf.de/TRANSFAC 8. TRRD http://www.bionet.nsc.ru/trrd/ 9. COMPEL http://compel.bionet.nsc.ru/ 10. TFD - Transcription factor database http://www.ifti.org/ 11. RegulonDB http://www.cifn.unam.mx/Computational_Biolog

y/regulondb 12. SCPD - The Promoter Database of

Saccharomyces cerevisiae http://cgsigma.cshl.org/jian/

13. Muscle-Specific Regulation of Transcription (A Catalogue of Regulatory Elements)

http://agave.humgen.upenn.edu/MTIR/HomePage.html

14. EpoDB. (Database of genes that relate to vertebrate red blood cells)

http://agave.hum-gen.upenn.edu/epodb/

15. GENET http://www.iephb.ru/~spirov/genet00.html

16. PlantCARE

http://sphinx.rug.ac.be:8080/PlantCARE/

17. PLACE http://www.dna.affrc.go.jp/htdocs/PLACE/ 18 DBTSS http://dbtss.hgc.jp/

EMBL data library

Feature gene

Definition region of biological interest identified as a gene and for which a name has been assigned;

Optional Qualifiers

/allele="text" /citation=[number] /db_xref="<database>:<identifier>" /evidence=<evidence_value> /function="text" /label= /map="text" /note="text" /product="text" /pseudo /phenotype="text" /standard_name="text" /usedin=accnum:feature_label

Comments the gene feature describes the interval of DNA that corresponds to a genetic trait or phenotype; the feature is, by definition, not strictly bound to it's positions at the ends; it is meant to represent a region where the gene is located.

EMBL data library

Feature promoter

Definition region on a DNA molecule involved in RNA polymerase binding to initiate transcription;

Optional Qualifiers /citation=[number] /db_xref="<database>:<identifier>" /evidence=<evidence_value> /function="text" /gene="text" /label=feature_label /map="text" /note="text" /phenotype="text" /pseudo /standard_name="text" /usedin=accnum:feature_label

Molecule Scope DNA

or look for: (start of) mRNA, or precursor_RNA, or prim_transcript, or exon /number=1, ...

EMBL data library

Feature misc_feature

Definition region of biological interest which cannot be described by any other feature key; a new or rare feature;

Optional Qualifiers /citation=[number] /db_xref="<database>:<identifier>" /evidence=<evidence_value> /function="text" /gene="text" /label=feature_label /map="text" /note="text" /number=unquoted /phenotype="text" /product="text" /pseudo /standard_name="text" /usedin=accnum:feature_label

Comments this key should not be used when the need is merely to mark a region in order to comment on it or to use it in another feature's location; use the '-' pseudo-key instead.

e.g.:FT misc_feature 4538FT /note="transcription initiation site« FT /gene="CDC6"

http://srs.ebi.ac.uk/srs6bin/cgi-bin/wgetz?-e+-id+1L7rH1ItVCC+%5BEMBL_features-id:AB010492_4%5D


EMBL data library

Feature enhancer

Definition a cis-acting sequence that increases the utilization of (some) eukaryotic promoters, and can function in either orientation and in any location (upstream or downstream) relative to the promoter;

Optional Qualifiers /citation=[number] /db_xref="<database>:<identifier>" /evidence=<evidence_value> /label=feature_label /gene="text /map="text" /note="text" /standard_name="text" /usedin=accnum:feature_label

Organism Scope eukaryotes and eukaryotic viruses

EMBL data library

Feature protein_bind

Definition non-covalent protein binding site on nucleic acid;

Mandatory Qualifiers /bound_moiety="text"

Optional Qualifiers /citation=[number] /db_xref="<database>:<identifier>" /evidence=<evidence_value> /function="text" /gene="text" /label=feature_label /map="text" /note="text" /standard_name="text" /usedin=accnum:feature_label

Comments note that RBS is used for ribosome binding sites.

EMBL data library

Qualifier bound_moiety

Definition moiety bound

Value Format "text"

Example /bound_moiety="repressor"

Qualifier usedin

Definition indicates that the feature is used in a compound feature in another entry

Value Format Accession-number:feature-name or Database_name::Acc_number:feature_label

Example /usedin=X10087:proteinx

Comment database_name is an abbreviation for the name of the database in which the entry for the accession number can be found.

EMBL data library

FH Key Location/QualifiersFHFT source 1..4734FT /db_xref="taxon:9606„FT /sequenced_mol="DNA„FT /organism="Homo sapiens„FT protein_bind 4495..4502FT /bound_moiety="E2F„FT protein_bind 4529..4537FT /bound_moiety="E2F„FT misc_feature 4538FT /note="transcription initiation site« FT /gene="CDC6"

experimentally confirmed sites,

though no /evidence qualifier is given


http://srs.ebi.ac.uk/srs6bin/cgi-bin/wgetz?-id+1L7rH1ItVCC+%5Btaxonomy-ID:9606%5D+-e







EMBL data libraryFH Key Location/QualifiersFHFT source 1..3204FT /db_xref="taxon:9606„FT /sequenced_mol="DNA„FT /organism="Homo sapiens„FT promoter 1..3201FT /note="melanocortin-1 receptor„FT /gene="MC1R„FT misc_signal 570..575FT /note="E-BOX„...FT TATA_signal 922..941FT protein_bind 1343..1350FT /evidence=EXPERIMENTALFT /bound_moiety="AP-1„...FT TATA_signal 1553..1559...FT misc_binding 1957..1964FT /evidence=EXPERIMENTALFT /bound_moiety="AP-2„FT misc_binding 2060..2067FT /evidence=EXPERIMENTALFT /bound_moiety="AP-2„

FT misc_binding 2069..2074FT /evidence=EXPERIMENTALFT /bound_moiety="SP-1„FT misc_binding 2603..2608FT /evidence=EXPERIMENTALFT /bound_moiety="SP-1"

Here:misc_signal "E-BOX" and TATA_signal are identified by homology and positional reasoning, AP-1 and AP-2 binding sites are suggested by homology, Sp1 sites are confirmed by gel shift analysis


http://srs.ebi.ac.uk/srs6bin/cgi-bin/wgetz?-id+1L7rH1ItVCC+%5Btaxonomy-ID:9606%5D+-e








EMBL data library

Feature TATA_signal

Definition

TATA box; Goldberg-Hogness box; a conserved AT-rich septamer found about 25 bp before the start point of each eukaryotic RNA polymerase II transcript unit which may be involved in positioning the enzyme for correct initiation; consensus=TATA(A or T)A(A or T) [1,2];

Optional Qualifiers /citation=[number] /db_xref="<database>:<identifier>" /evidence=<evidence_value> /gene="text" /label=feature_label /map="text" /note="text" /usedin=accnum:feature_label


Molecule Scope DNA

References [1] Efstratiadis, A. et al. Cell 21, 653-668 (1980) [2] Corden, J., et al. "Promoter sequences of eukaryotic protein-encoding genes" Science 209, 1406-1414 (1980)

EMBL data library

Feature CAAT_signal

Definition CAAT box; part of a conserved sequence located about 75 bp up-stream of the start point of eukaryotic transcription units which may be involved in RNA polymerase binding; consensus=GG(C or T)CAATCT [1,2].

Optional Qualifiers /citation=[number] /db_xref="<database>:<identifier>" /evidence=<evidence_value> /gene="text" /label=feature_label /gene="text" /map="text" /note="text" /usedin=accnum:feature_label


Molecule Scope DNA

References [1] Efstratiadis, A. et al. Cell 21, 653-668 (1980) [2] Nevins, J.R. "The pathway of eukaryotic mRNA formation" Ann Rev Biochem 52, 441-466 (1983)

Feature GC_signal

Definition GC box; a conserved GC-rich region located upstream of the start point of eukaryotic transcription units which may occur in multiple copies or in either orientation; consensus=GGGCGG;

Optional Qualifiers /citation=[number] /db_xref="<database>:<identifier>" /evidence=<evidence_value> /gene="text" /label=feature_label /map="text" /note="text" /usedin=accnum:feature_label

EMBL data library

Feature misc_signal

Definition

any region containing a signal controlling or altering gene function or expression that cannot be described by other signal keys (promoter, CAAT_signal, TATA_signal, -35_signal, -10_signal, GC_signal, RBS, polyA_signal, enhancer, attenuator, terminator, and rep_origin).

Optional Qualifiers /citation=[number] /db_xref="<database>:<identifier>" /evidence=<evidence_value> /function="text" /gene="text" /label=feature_label /map="text" /note="text" /phenotype="text" /standard_name="text" /usedin=accnum:feature_label

EMBL data library

ID MMIGHALP standard; DNA; MUS; 17956 BP.XXAC X96607;...FT enhancer 4537..6107FT /note="locus control region„FT /note="alpha„FT /gene="IgH"

ID SSLCREG1 standard; DNA; MAM; 1190 BP.XXAC X86793;XXSV X86793.1XXDT 10-MAY-1995 (Rel. 43, Created)DT 30-MAY-1995 (Rel. 43, Last updated, Version 3)XXDE S.scrofa locus control region (1190 bp)XXKW locus control region. ...FT source 1..1190FT /chromosome="9„FT /db_xref="taxon:9823„FT /organism="Sus scrofa„FT /clone_lib="clonetech„FT /map="p2.4„FT - 5..1190FT /note="locus control region (HSI)"

Eukaryotic Promoter Database (EPD)

Praz et al., Nucleic Acids Res. 30, 322-324 http://www.epd.isb-sib.ch


Praz et al., Nucleic Acids Res. 30, 322-324 (2002) http://www.epd.isb-sib.ch

All EPD 4809

Vertebrate promoters 2540

Arthropode promoters 2000

Plant promoters 198

Viral 129

Nematode promoters 26


ID HS_MYC_1 standard; single; VRT.XXAC EP11146;XXDT ??-APR-1987 (Rel. 11, created)DT 10-OCT-2001 (Rel. 69, Last annotation update).XX

DE c-myc (cellular homologue of myelocytomatosis virus 29 oncogene),DE promoter 1, MYC gene.OS Homo sapiens (human).XXHG Homology group 52; Mammalian c-myc proto-oncogene, promoter 1.AP Alternative promoter #1 of 2; exon 1; site 1.NP none.XXDR EPD; EP11148; HS_MYC_2; alternative promoter; [+162; +].DR EPDEX; HS_MYC.DR EMBL; X00364.2; HSMYCC; [-2327, 8669]. [ EMBL; GenBank; DDBJ ]...DR SWISS-PROT; P01106; MYC_HUMAN.DR TRANSFAC; R01157; HS$CMYC_01; [-49,-27]; by position....DR MIM; 190080.

Eukaryotic Promoter Database (EPD)...DR MIM; 190080.XXRN [1]RX MEDLINE; 84026482.RA Battey J., Moulding C., Taub R., Murphy W., Stewart T., Potter H.,RA Lenoir G., Leder P.;RT "The human c-myc oncogene: structural consequences ofRT translocation into the IgH locus in Burkitt lymphoma";RL Cell 34:779-787(1983)....XXME Nuclease protection [2].ME Nuclease protection; transfected or transformed cells [3].ME Primer extension [2].XXSE aatctccgcccaccggccctttataatgcgagggtctggacggctgaggACCCCCGAGCTXXTX 6. Vertebrate promotersTX 6.1. Chromosomal genesTX 6.1.5. Hormones, growth factors, regulatory proteinsTX 6.1.5.16. Various cellular protooncogenesXXKW Proto-oncogene, Nuclear protein, DNA-binding, Glycoprotein,KW Transcription regulation.XXFP Hs c-myc P1 :+S EM:X00364.2 1+ 2328; 11146.052 010*1XXDO Experimental evidence: 3,3#,6DO Expression/Regulation: +mitogen;+IL-2RF Cell34:779 PNAS80:6307 MCB7:1393 MCB7:2988//

RegulonDB

Salgado et al., Nucleic Acids Res. 29, 72-74 (2001) http://www.cifn.unam.mx/Computational_Genomics/regulondb/

SCPD

Zhu & Zhang, Bioinformatics 15, 607-611 (1999) http://cgsigma.cshl.org/jian/

PlantCARE

Rombauts et al., Nucleic Acids Res. 27, 295-296 (1999) http://sphinx.rug.ac.be:8080/PlantCARE/cgi/index.html

Schematic representation of "Oligo-capping" method

TRRD

Kolchanov et al., Nucleic Acids Res. 30, 312-317 (2002) http://wwwmgs.bionet.nsc.ru/mgs/gnw/trrd/

GENE

SITE FACTOR

MATRIX

encodes for

contains

binds to and regulates interacts

is used to construct

is an attribute of

TRANSFAC®

a database on gene transcription regulation

interactingfactor

coding regionregulatory region

gene

expression

SITE

FACTOR

GENE

SYNONYMS

FEATURES

CLASS SPECIES

MATRIX

SEQUENCE

METHODCELL Q

FUNCTIONAL ELEMENT

TRANSFAC structure

Manual annotation of the databases: input client

TRANSFAC: FACTOR table, protein sequence

TRANSFAC: FACTOR table, protein domains

TRANSFAC: FACTOR table, structural and functional features

TRANSFAC: FACTOR table, links to other databases

TRANSFAC: classification of transcription factors

TRANSFAC: CLASS table

TRANSFAC 8.1 (2004-03-31): number of factor entries for different species

human

mouse

rat

other vertebrates

fruit fly

plants

Fungi

Other

0

200

400

600

800

1000

1200

1400

0

100

200

300

400

500

600

700

800

TRANSFAC 8.1 (2004-03-31): distribution of experimentally known TFBS in 5‘ regions of genes.

TRANSFAC: FACTOR table, protein-DNA and protein-protein interactions

TRANSFAC: MATRIX table

TRANSFAC® : accompanying tools

PatchTM- pattern search MatchTM- PWM-based search

gATTGGCGCGAAGtttt

aCAGGGCGCCAAAcgcg

aTTTCGCGCCAAActtg

aTTTCGCGCCAAActtg

aTTTCGCGCCAAActtg

GGCTGCGGCCAAAtctcATCTCCCGCCAGGtcagaGTTCGCGGGCAAatgc

cTTCGGCGCGCGGtgtt

tTTTCGCGCCAAAgtca

tTTTGCCGCGAAAagac

gATTGGCGCGAAGtttt

aCAGGGCGCCAAAcgcg

aTTTCGCGCCAAActtg

aTTTCGCGCCAAActtg

aTTTCGCGCCAAActtg

GGCTGCGGCCAAAtctcATCTCCCGCCAGGtcagaGTTCGCGGGCAAatgc

cTTCGGCGCGCGGtgtt

tTTTCGCGCCAAAgtca

tTTTGCCGCGAAAagac

q1 q2

Selection of DNA binding sites by regulatory proteins

Statistical-mechanical theory

O.G. Berg and P.H von Hippel

Mutational drift

Match

Mismatch

1) Binding affinity of protein to DNA in some useful range2) Number of sequences is large. 3) All possible sequences are equiprobable4) - express the decrease in binding energy when cognate base pare is replaced by B5) Individual base-pare contributions are independent and therefore additive The loss in the binding affinity in one position may be gained in the other position.

1 2 ... l ... s

A 0.5 0.9

T 0.0 0.1

G 1.2 0.0

C 0.1 0.8

00 l

lB

lB

Sites have binding affinity in a limited range E around a requred level E

E

In such set of sitesthe local contributionfrom every positionsmust sum to E

lB

What is the frequencywith wich certain base pair Bapeares at a certain positionin a site?

lBf

l

The same question is askeb in statistical mechanics:

S independent particles in a systemand a given total energy E.What a probability to that the particlelB will have the energy ?

lB

leEf qlB

41)(

- is determined by the density of potential sites, i.e. by the numberof possible sequence combinations that have the required descrimination energy E

)ln( 0obs

lBobs

llB ffFor any sequence X of the length s the actualdiscrimination energy:

s

l

obslB

obsl

s

llB ll

ffXE1

01

1

)ln()(

12

n

Small-sample effect

4

1

N

nf lB

lB

1

1ln 01

lB

llB n

n

Problems:

1. Small sets of sites2. Homology between sites3. Specific function of nucleotides in certain positions4. Correlations between positions (not additive effect)

L

ii

L

ii

L

ibi fiIfiIfiIq

i1

max

1

min

1, )()()(

},,,{

,, ,...,2,1),4ln()(CGTAB

BiBi LiffiI

with: bi, nucleotide b found in the i-th position of test sequence,fbi, frequency of nucleotide b in the i-th position of the aligned training sequences,fi

min, minimum frequency in position i,fi

max, maximum frequency in position i,and

TFS identification

Calculating the Ci-values

gapTGCABi BiPBiPiC

,,,,

5ln,ln,5ln

100

Position1 2 3 4 5 6 7 8 9 10 11

A 1 2 0 0 4 0 0 1 5 2 2C 1 2 0 0 1 3 0 4 0 0 2G 1 1 0 5 0 2 0 0 0 3 1T 1 0 5 0 0 0 5 0 0 0 0gap 1 0 0 0 0 0 0 0 0 0 0

P (A) 0.2 0.4 0 0 0.8 0 0 0.2 1 0.4 0.4P (C) 0.2 0.4 0 0 0.2 0.6 0 0.8 0 0 0.4P (G) 0.2 0.2 0 1 0 0.4 0 0 0 0.6 0.2P (T) 0.2 0 1 0 0 0 1 0 0 0 0P (gap) 0.2 0 0 0 0 0 0 0 0 0 0

Ci (A) -0.32 -0.37 0 0 -0.18 0 0 0 0 -0.37 -0.37

Ci (C) -0.32 -0.37 0 0 -0.32 -0.31 0 0 0 0 -0.37

Ci (G) -0.32 -0.32 0 0 0 -0.37 0 0 0 -0.31 -0.32

Ci (T) -0.32 0 0 0 0 0 0 0 0 0 0

Ci (gap) -0.32 0 0 0 0 0 0 0 0 0 0P(B)*lnP(B)+ln(5) 0.00 0.55 1.61 1.61 1.11 0.94 1.61 1.11 1.61 0.94 0.55

Ci 0 34 100 100 69 58 100 69 100 58 34

Position 1 2 3 4 5 6 7 8 9 10 11 A 1 2 0 0 4 0 0 1 5 2 2 C 1 2 0 0 1 3 0 4 0 0 2 G 1 1 0 5 0 2 0 0 0 3 1 T 1 0 5 0 0 0 5 0 0 0 0 - 1 0 0 0 0 0 0 0 0 0 0

Ci 0 34 100 100 69 58 100 69 100 58 34

core

To make it fast

Preselection with the core:

Scoring of the matchScoring of the match

T G A C T

TRANSFAC: MatchTM tool

TRANSFAC: MatchTM output

0

10

20

30

40

50

60

70

80

90

100

0,75 0,8 0,85 0,9 0,95 1

undeprediction error

overprediction error

error sum

Selection of optimal cut-offs

minFN minFPminSUM

Feature table of Genebank entry Corresponding hits found by Match

Matrix-Identifier

Position

MatrixSimilarity

CoreSimilarity Sequence Factor Name

Feature table of Genebank entry Corresponding hits found by Match

Matrix-IdentifierMatrix-Identifier

PositionPosition

MatrixSimilarityMatrixSimilarity

CoreSimilarityCoreSimilarity SequenceSequence Factor NameFactor Name

Example of a search using cut-offs to minimize false negative matches

In this example we searched the homo sapiens angiotensinogen gene (5`region andexon1) for all bindings sites listed in the features of its Genebank entry. For that search we usedcut-offs to minimize false negative matches as these cut-offs are recommended toreduce the probability that Match misses a potential binding site. Corresponding hits for all of the entries in the feature table, which concern a binding site, could be found in the Match output.

TRANSPLORER (TRANScription exPLORER) is a software package for the analysis of transcription regulatory sequences. Currently, TRANSPLORER site prediction tool uses position weight matrices (PWM) collections. It is able to use several matrix sources: the largest and most up-to-date library of matrices derived from TRANSFAC® Professional database, other matrix libraries as well as any user-developed matrix libraries. This means that it provides an opportunity to search for a great variety of different transcription factor binding sites. A search can be made using all or subsets of matrices from the libraries.

Search for most probable binding sites regulating gene expression

Search for binding sites coinciding with SNPs

•pairs of closely situated binding sites for TFs;

•cooperative functioning of transcription factors;

•direct protein-protein interactions;

•combinatorial regulation of gene transcription.

Key topics

TRANSCompel®

a database on composite regulatory elements

individual entry

Description of an evidence (experiment, cell type, two individual interactions)

Link to the TRANSFAC GENE

table

Link to the EMBL

Link to the TRANSFAC

FACTOR table

N Gene Scheme of CE 1. IgH , Mus

musculus

2. IL-2, Homo sapiens

-283 -268 : :

3.

IL-2, Homo sapiens

-167 -142 : :

5.

4. Il-2, Mus musculus

-167 -142 : :

IgH ,Homo sapiens

6.

Serum amyloid А1, Rattus norv.

-117 -73 : :

7. IRF-1, Mus musculus

-123 -113 -49 -40 : : : :

AP-1Ets

AP-1NFAT

AP-1NF-B

Ets CBF

AP-1 Oct-2

NF-BC/EBP

NF-BSTAT-1

TRANSCompel®

combinatorial regulation, more than 360 CEs

TRANSCompel®

functional classification of the composite elements

inducible/inducible - Ca2+ and PKC response NFAT / AP1

- IFN-gamma and TNF-alpha response NF-kappaB / IRF

inducible/constitutive- cholesterol level response SREBP / Sp1

- acute-phase response STAT-3 / Sp1

inducible/tissue-restricted- TGF-beta response in B-cells SMAD / AML

tissue-restricted/tissue-restricted- pancreas islet beta-cells (insulin-producing) HNF3 / BETA2

- pituitary gonadotropes Ptx1 / SF-1

tissue-restricted/ubiquitous- macrophages PU.1 / Sp1

Tissue-specific

32

Inducible

44 119

Cell-cycle dependent

1 2

Dev. stage-dependent

3

Ubiquitous constitutive

39 60 2 12

F1 F2

Tissue-specific

Indu-cible

Cell-cycle dep.


Ubiquit. constitut.

2

Inducible/inducible

19 CE‘s ETS / AP-1 providing cross-coupling of Ras/Raf- and PKC-dependent signalling pathways;

15 CE‘s NFATp / AP-1 providing cross-coupling of Ca2+ - and PKC-dependent signalling pathways;

14 CE‘s NF-B / C/EBP NF-B is inducible by IL-1 and TNF-; C/EBP is inducible by IL-6.

Tissue-specific

32

Inducible

44 119


1 2


3


39 60 2 12

F1 F2

Tissue-specific

Indu-cible

Cell-cycle dep.


Ubiquit. constitut.

2

Inducible/constitutive

9 CE‘s ETS / Sp1 ETS factors are inducible through Ras/Raf- dependent signalling pathway;

5 CE‘s Smad / TEF3 Smads are inducible by TGF- signalling.

Tissue-specific

32

Inducible

44 119


1 2


3


39 60 2 12

F1 F2

Tissue-specific

Indu-cible

Cell-cycle dep.


Ubiquit. constitut.

2

Inducible/tissue-restricted

CE‘s Pit-1 / AP-1 Pit1 is pituitary-restricted transcription factor whereas AP-1 and Ets are ubiquitous inducible factors;

acaggaTGTCCATATTAGGacatctgcg

YY-1

SRF

human c-fosSRF mediates the rapid, transient induction of the c-fos protooncogen by serum growth factors.

YY1 diminishes both basal and serum-induced expression of the c-fos.

TRANSCompel®

antagonistic type of CEs

GGTGGGcctccggagtgaccaatgagtgTGGACAGATGCCA

Sp1 NF-Y

NF-1

COMPEL: C00006Chicken embryonic -globin gene

Sp1 cooperatively with NF-Y activates transcription in primitive erythroid cells

NF-1 represses transcription in adult cells

acaggaTGTCCATATTAGGacatctgcg

YY-1

SRF

COMPEL: C00009Human c-fos protooncogene

SRF mediates the rapid, transient induction of the c-fos protooncogen by serum growth factors.

YY1 diminishes both basal and serum-induced expression

of the c-fos.

COMPEL: C00054Rat serum amyloid A1 gene

TGGTAGTCTTGCACAGGAAATGACATggtGGGACTTTCCCcaggg

C/EBP NF-B

YY-1

C/EBP and NF-B synergistically activate transcription in liver cellsduring acute phase response

YY1 represses inducible transcription of this

gene.

Antagonistic composite elements

pattern-based search for potential composite elements in DNA sequences

• All CE‘s are used as individual searching patterns;

• Several parameters are available restricting the search:

mismatches in the site 1 and site 2,

distance between two sites,

composite score

Catch®

CCACCCATTTCCTC

ACAGGAATgacctggtgcCTCGCCC

TTCCTCctgtgccttag...ctgtttttctaaCCGCCC

GAAGGGCGGGGAcagtt...aagcaaaaAAAGGGAACTGA

AAAGGGAACTGAgtggctgcgaaAGGGTGGGG

GGAAgcaaccagCCCACCA

CCGGAAGCaaccagCCCACC

aaAAGGAAGTGGGCGTGGTttaaag

ACTTCCTC...GGCTCCTCCTCC

Set of CEs

M1

qM1 > n1

M2

qM2 > n2

1. matrix rule

2. distance rule

3. orientation rule

Search for the potential CEs

rules

TRANSCompel®

CEs of similar structure can be used to construct models

Application of CE models for promoter analysis

Four CE types are over-represented in promoters in comparison with several biological sequences tested.

0

20

40

60

80

100

120

140

160

180

Myb/Aml NF-kB/Sp1 Ets/Sp1 E2F/Sp1

in 1

00

00

0 b

p promoters -350/+50

exons_3d

h_chr_15(whole)

h_chr_15_Alu

h_chr_15_L1

h_chr_15_L2

Gene expression profiling

GENE ONTOLOGYTM

http://www.geneontology.org/

http://www.geneontology.org/

Genome -ReferenceSequence

Gene

Transcripts

Splicing variants

Polypeptides

Compositeelements

Regulatoryregions

TF bindingsites

Repeats

S/MARs

TRANSGENOME provides the hierarchical structure of the most important elements of a genome in coding regions as well as in regulatory regions. This structure provides the possibility to have a unique reference sequence and to store the location of all gene regulatory and structural elements.

TRANSGENOME

Gene

pre-mRNAs(from RefSeq)

spliced mRNAs

CDS

site

5’UTR3’UTR

RefSeq derived potential starts oftranscription (first exons)

TRANSFAC derived start of transcription(by relative site positions)

EPD derived start of transcription

DBTSS derived start of transcription

Bronchial tree and Intrapulmonary Airways

Lobar bronchus

Segmental bronchus

Bronchus

Bronchiolus

Terminal bronchiolus Alveolar sac

Respiratory bronchiolus

Pulmonary alveolus

Alveolar duct

Alveolar pore Alveolar

epithelium

Alveolar septa

PneumocytesCytomer/Content

Human body Lung Bronchial treeMain bronchus

SpeciesID

Name

CellID

NameDescription

OrganID

Name Parent

SystemID

Name

HUBID

Cytomer_noOrgan_IDCell_ID

System_IDPeriod_ID

Species_ID

Stage2PeriodStage_IDPeriod_ID

CPID

TFaccCytomer_no

CaccCP

CNID

TFaccCytomer_no

CaccCN

Transfac

FactorID

Acc

PeriodIDT1T2

StageIDT1T2

description

CYTOMER structure

human body abdominal regions acetabular artery acetabular ramus anastomatic vessel aortic plexus apocrine gland blood blood vessel body cavities bronchi cardiac thoracic nerve cartilage cartilaginous tissue of anular radial ligament central nervous system

brain brain stem meninges of brain mesencephalon

cerebral peduncle base of peduncle cerebral crus lateral groove of midbrain substantia nigra

compact part of substantia nigra lateral part of substantia nigra reticular part of substantia nigra retrorubral part of substantia nigra

tegmentum of midbrain of cerebral peduncle frenulum of superior medullary velum of midbrain interpeduncular fossa

TRANSGENOMEEST

UniGene

Gene expression group 1

Gene expressiongroup 2

Gene expressiongroup 3

CYTOMER® A database on gene expression sources

Spatio-temporal coordinates:CYTOMER

Conditional determinants:TRANSPATH

Factors controlling transcription:TRANSFAC

Expression space E

Expression space Eg of gene g

t(temporal, developmental axis)

(spatial axis:systems,

organs,cells)

x

c(conditional)

The gene expression space


E1 E2 .. ES

g1 x1,1g2

: :gh

x1,2

x1,h x2,h

::

..

..

..

x2,2 xS,2

xS,1

xh,h

x2,1Expression patternof gene g1

Expression matrix:-rows representing genes-columns representing samples (various tissues, developmental stage,...)

Expression profileof state E1(e.g. in organ Oat stage t)

Gene

Expression state


regulatory sequences (basics) alexander kel senior vice president of genome informatics, biobase...

Documents

specific interaction

homo sapiensenhancer

homo sapiensc001739

mus musculusenhancer

gmcsf enhancer

positions of ap

stoichiometric interaction

biobase gmbh