regulatory sequences (basics) alexander kel senior vice president of genome informatics, biobase...
TRANSCRIPT
Regulatory Sequences
(Basics)
Alexander Kel
Senior Vice President of Genome Informatics,BIOBASE GmbH, Halchtersche Strasse 33D-38304 WolfenbuettelGermany www.biobase.de
TRANSCompel
TRANSFAC
TRANSPATH
Patho DBS/MARt DB
- mechanistic- semantic
Match Patch
Catch
Pathway builder Array analyser
Cytomer TRANSGenome TRANSPLORER
CMFinder
BIOBASE customers*
* not complete
TRANSFACSyngentaCeleraMonsantoPfizer Merck Sharp & DomeAmgenTakedaNovartisGlaxoSmithKline
TRANSPATHVertex
BothAventis Eli LillySchering PloughHoffmann La RocheAkzo Nobel
More than 200 academic labs including:
Harvard Stanford Tokyo University Riken Labs Max Planck
More than 7000 registered users on our portal
gene-regulation.com
Same blocks - different structures
LEGO system
Concepts of gene regulation
DNA
RNA
protein
transcription
translation
amplification, methylation,chromatin structure
splicing, degradation
modification, degradation
information carrier 1
transformation
carrier organization
information carrier 2
TRANSFACGene structure
ContigGene
Splice Variants mRNA
Regulatory Elements
CDS5’-UTR 3’-UTR
5‘
Splicing
3‘
Transcription
primarytranscript
altern.exon
TRANSFAC
ContigGene
Splice Variants mRNA
Regulatory Elements
CDS5’-UTR 3’-UTR
5‘
Splicing
3‘
Transcription
primarytranscript
altern.exon
Gene structure
promoterenhancer 1enhancer 2
TSS
TATAbox
initiatorInr
box Abox Bbox Cbox A‘
compositeelement
box E box Dbox D‘box Fbox Gbox A‘‘
General schema of the modular hierarchical structure of transcription regulatory regions of
eukaryotic genes.
…cis
trans
Human genes Sequences and positions of AP-1 binding sites glutathione P-
transferase
enhancer at -2500
hemoglobin, epsilon
-80 н.п.
Akt-2
-100 н.п.
IFN-
-89 н.п.
Apo АII
-792 н.п.
Melanotransferin
-2013 н.п.
Collagenase
-72 н.п.
proto-oncogene
c-myc
-335 н.п.
porphobilinogen deaminase
-162 н.п.
GM-CSF
enhancer at -3500
TGAСTTT
TGACATC
TGTCACC
TGACTCA
TGAGTCA
TGAGTCA
TGATTTA
TGACTCA
TGACTCA
What is a transcription factor?
A transcription factor is a protein that regulates transcription
after nuclear translocation
by specific interaction with DNA
or by stoichiometric interaction with a protein that can be assembled
into a sequence-specific DNA-protein complex.
Transcription factors
Sequence-specific DNA binding
Non-DNA binding
TF1 TF2 TF3 TF4
adapter
Co-activator
HAT
DNA
Layer I
Layer III
Layer II
Structure of transcription factors
USF-1, dimer
DNA binding domain
Activation domain
oligomerization domain
Ligand- binding domain
Protein-protein interaction domain
Structure of transcription factors
N Gene Schema and positions of a CE
TRANSCompel accession number
1.
Scavenger receptor, Homo sapiens
Enhancer –4500/-4100
C00080
2.
GM-CSF, Mus musculus
-53 -40 : :
C00081
3.
Collagenase, Homo sapiens
-89 -82 -72 -66 : : : :
C00083
4.
IgH , Mus musculus
Enhancer at 3’ flank
C00133
5.
Interleukin 2, Homo sapiens
-283 -268 : :
C00109
6.
Interleukin 2, Homo sapiens
-167 -142 : :
C00165
7.
Интерлейкин 2, Mus musculus
-167 -142 : :
C00158
8.
IgH, Homo sapiens
C00173
9.
Сывороточный амилоид А1, Rattus norvegicus
-117 -73 : :
С00101
10.
IRF-1, Mus musculus
-123 -113 -49 -40 : : : :
C00192
AP-1 Ets
AP-1 Ets
AP-1 Ets
AP-1 Ets
AP-1 NFAT
AP-1 NF-B
AP-1 Oct-2
Ets CBF
NF-B C/EBP
NF-B STAT-1
Ternary complex NFATp - AP1 - DNA
Synergistic activation of transcription
Low level of transcription
Low level of transcription
F1
F1
F1
F2
F2
F2
Composite elements
Minimal functional units where both protein-DNA and protein-protein interactions contribute to a highly specific pattern of gene expressionand provide cross-coupling of different signal transduction pathways.
M e m b ran e re ce p tor
S rc
S H 3
S H 2 R a s
R a s
G D P
G T P
A d aptorsP L C
P I3 -K
Phospho ry lation
IP 3
C a 2+
C a 2+C a2+
Ca2+ dependent cana l
Calc ineurin
E R K
E R K
JN K
JN K
P 3 8M A P K
P 3 8M A P K
N FAT p N FAT p
NFATp
P
P Pc-F o s c-F o s
с-F os
c-Ju n
c-Jun
c-Ju n
c-Ju n
AT F -2 AT F -2
AT F -2
IL -2
P K B /A k t
C om posite e lem ent
cytoplasm
Nucleus
Integration of signals. Cross-coupling of signal transduction pathways
S S
F F
S S
F F
1 1
11
2 2
22
1)Cooperative binding to DNA and ternary complex formation
SS
F
1 2
2
3)
F1
Sim ultaneous interaction of activation domains w ith the com ponents of the basal complex
Mechanisms of functioning of synergistic composite elements
S S
F F
S S
F F
1 1
11
2 2
22
2)A new protein surface for DNA recognition could be formed
S
F
S
F
1
1
2
2
4) Form ing a new protein surface for in teraction w ith the basal complex
Mechanisms of functioning of synergistic composite elements
F2F1
s1 s2
F1F2
5)Relief of autoinhibition as a result of protein-protein interactions
7)
F1
F2
DNA wrapping around a nucleosome allows transcription factors to in teract
SS 1 2
2
8)
F
HAT com plex
F1
Recruitm ent of a HAT com plex by one of the transcription factors
Mechanisms of functioning of synergistic composite elements
S
SF
F
2
1
2
1
6)DNA bending by one of the transcription factors
HDAC complex
1)HAT com plex
M utually exclusive binding of factor F1(activator) and F2 (repressor)
Mechanisms of functioning of antagonistic composite elements
HDAC complex
HAT complex
2)
Binding of F2 (repressor) results in the conform ational changes of F1 (activator)
Mechanisms of functioning of antagonistic composite elements
-180 -150-249
AP-1
NFAT
HMG Y
NFAT NFAT
AP-1STAT 6 NF-Y
-114 -88
AP-1
NFAT
HMG Y
-60
AP-1
NFAT
TATA
-28
c-MAF
CE CE
ST
Mouse IL-4 promoter
+1
ST
GM-CSF Homo sapiens
+1
T-cell specific inducible enhancer at –3500 bp Promoter
TATTT
-54
AP-1
NFAT
CE
NF-Bp50/p65
-88
AP-1
NFAT
CE
AP-1
NFAT
CE
AP-1
NFAT
AP-1
NFAT
CE
NF-Bc-Rel/p65
HMG Y(I)
-114
CD28 response element
CBF CBF
Recruitment of CIITA to MHC-II promoters. A prototypical MHC-II promoter (HLA-DRA) is represented schematically with the W, X, X2, and Y sequences conserved in all MHC-II, Ii, and HLA-DM promoters. RFX, X2BP, NF-Y, and an as yet undefined W-binding protein bind cooperatively to these sequences and assemble into a stable higher order nucleoprotein complex referred to here as the MHC-II enhanceosome. CIITA is tethered to the enhanceosome via multiple weak protein-protein interactions with the W, X, X2, and Y-binding factors. The octamer site found in the HLA-DRA promoter (O), and its cognate activators (Oct and OBF-1) are not required for recruitment of CIITA. CIITA is proposed to activate transcription (arrow) via its amino-terminal activation domains (AD), which contact the RNA polymerase II basal transcription machinery.
Masternak K et al., Genes Dev 2000 May 1;14(9):1156-66
Enhanceosome
T F IIA
T F IIE
T F IIH
S ite -sp e c if ic T F
T F IIF
R N A p o l I I
T F IID
C o-a ctiva torp 300 /C B P
A cetila se P C A F
Closed nucleosomes
Acetilation
T F IIB
Acetylase
Acetylation
Scaffold/matrix attached regions (S/MARs) are regions of the DNA strand that are found the basis of chromatin loops. They anchor the DNA to the proteinaceous nuclear matrix.
Each loop is considered to be a functional domain.
S/MARs may act as border elements and thus, protect gene expression from position effects.
S/MARs genesresidual DNA
S/MARs
enhancerpromoter
gene(transcribed region)SAR
SAR SAR SAR
LCR
LCR
open chromatin
compact chromatin
(regulated)
nuclear scaffold
J. Bode / E. Wingender 1993
S/MARs
Databases on gene regulation
• Clear identification of where you are (which species and which protein).
• Tabular presentation of controlled-vocabulary terms.
• Annotations linked to PubMed references.
• Clear paths of navigation between protein reports, within a species and between species.
• Links to ‘public domain’ databases.
BKL: collected information is displayed in a ‘one page per protein’ format = Protein Reports
N Databases containing gene regulation information
URL
1. EMBL Nucleotide sequence database http://www.ebi.ac.uk/embl.html 2. GeneBank http://www.ncbi.nlm.nih.gov/Web/Genbank/inde
x.html 3. SWISS-PROT http://www.expasy.ch 4. PIR: Protein Information Resourсe http://www-nbrf.georgetown.edu/pir 5. PDB http://www.pdb.bnl.gov/ 6. EPD - Eukaryotic promoter database
http://www.epd.isb-sib.ch
7. TRANSFAC http://transfac.gbf.de/TRANSFAC 8. TRRD http://www.bionet.nsc.ru/trrd/ 9. COMPEL http://compel.bionet.nsc.ru/ 10. TFD - Transcription factor database http://www.ifti.org/ 11. RegulonDB http://www.cifn.unam.mx/Computational_Biolog
y/regulondb 12. SCPD - The Promoter Database of
Saccharomyces cerevisiae http://cgsigma.cshl.org/jian/
13. Muscle-Specific Regulation of Transcription (A Catalogue of Regulatory Elements)
http://agave.humgen.upenn.edu/MTIR/HomePage.html
14. EpoDB. (Database of genes that relate to vertebrate red blood cells)
http://agave.hum-gen.upenn.edu/epodb/
15. GENET http://www.iephb.ru/~spirov/genet00.html
16. PlantCARE
http://sphinx.rug.ac.be:8080/PlantCARE/
17. PLACE http://www.dna.affrc.go.jp/htdocs/PLACE/ 18 DBTSS http://dbtss.hgc.jp/
EMBL data library
Feature gene
Definition region of biological interest identified as a gene and for which a name has been assigned;
Optional Qualifiers
/allele="text" /citation=[number] /db_xref="<database>:<identifier>" /evidence=<evidence_value> /function="text" /label= /map="text" /note="text" /product="text" /pseudo /phenotype="text" /standard_name="text" /usedin=accnum:feature_label
Comments the gene feature describes the interval of DNA that corresponds to a genetic trait or phenotype; the feature is, by definition, not strictly bound to it's positions at the ends; it is meant to represent a region where the gene is located.
EMBL data library
Feature promoter
Definition region on a DNA molecule involved in RNA polymerase binding to initiate transcription;
Optional Qualifiers /citation=[number] /db_xref="<database>:<identifier>" /evidence=<evidence_value> /function="text" /gene="text" /label=feature_label /map="text" /note="text" /phenotype="text" /pseudo /standard_name="text" /usedin=accnum:feature_label
Molecule Scope DNA
or look for: (start of) mRNA, or precursor_RNA, or prim_transcript, or exon /number=1, ...
EMBL data library
Feature misc_feature
Definition region of biological interest which cannot be described by any other feature key; a new or rare feature;
Optional Qualifiers /citation=[number] /db_xref="<database>:<identifier>" /evidence=<evidence_value> /function="text" /gene="text" /label=feature_label /map="text" /note="text" /number=unquoted /phenotype="text" /product="text" /pseudo /standard_name="text" /usedin=accnum:feature_label
Comments this key should not be used when the need is merely to mark a region in order to comment on it or to use it in another feature's location; use the '-' pseudo-key instead.
e.g.:FT misc_feature 4538FT /note="transcription initiation site« FT /gene="CDC6"
EMBL data library
Feature enhancer
Definition a cis-acting sequence that increases the utilization of (some) eukaryotic promoters, and can function in either orientation and in any location (upstream or downstream) relative to the promoter;
Optional Qualifiers /citation=[number] /db_xref="<database>:<identifier>" /evidence=<evidence_value> /label=feature_label /gene="text /map="text" /note="text" /standard_name="text" /usedin=accnum:feature_label
Organism Scope eukaryotes and eukaryotic viruses
EMBL data library
Feature protein_bind
Definition non-covalent protein binding site on nucleic acid;
Mandatory Qualifiers /bound_moiety="text"
Optional Qualifiers /citation=[number] /db_xref="<database>:<identifier>" /evidence=<evidence_value> /function="text" /gene="text" /label=feature_label /map="text" /note="text" /standard_name="text" /usedin=accnum:feature_label
Comments note that RBS is used for ribosome binding sites.
EMBL data library
Qualifier bound_moiety
Definition moiety bound
Value Format "text"
Example /bound_moiety="repressor"
Qualifier usedin
Definition indicates that the feature is used in a compound feature in another entry
Value Format Accession-number:feature-name or Database_name::Acc_number:feature_label
Example /usedin=X10087:proteinx
Comment database_name is an abbreviation for the name of the database in which the entry for the accession number can be found.
EMBL data library
FH Key Location/QualifiersFHFT source 1..4734FT /db_xref="taxon:9606„FT /sequenced_mol="DNA„FT /organism="Homo sapiens„FT protein_bind 4495..4502FT /bound_moiety="E2F„FT protein_bind 4529..4537FT /bound_moiety="E2F„FT misc_feature 4538FT /note="transcription initiation site« FT /gene="CDC6"
experimentally confirmed sites,
though no /evidence qualifier is given
EMBL data libraryFH Key Location/QualifiersFHFT source 1..3204FT /db_xref="taxon:9606„FT /sequenced_mol="DNA„FT /organism="Homo sapiens„FT promoter 1..3201FT /note="melanocortin-1 receptor„FT /gene="MC1R„FT misc_signal 570..575FT /note="E-BOX„...FT TATA_signal 922..941FT protein_bind 1343..1350FT /evidence=EXPERIMENTALFT /bound_moiety="AP-1„...FT TATA_signal 1553..1559...FT misc_binding 1957..1964FT /evidence=EXPERIMENTALFT /bound_moiety="AP-2„FT misc_binding 2060..2067FT /evidence=EXPERIMENTALFT /bound_moiety="AP-2„
FT misc_binding 2069..2074FT /evidence=EXPERIMENTALFT /bound_moiety="SP-1„FT misc_binding 2603..2608FT /evidence=EXPERIMENTALFT /bound_moiety="SP-1"
Here:misc_signal "E-BOX" and TATA_signal are identified by homology and positional reasoning, AP-1 and AP-2 binding sites are suggested by homology, Sp1 sites are confirmed by gel shift analysis
EMBL data library
Feature TATA_signal
Definition
TATA box; Goldberg-Hogness box; a conserved AT-rich septamer found about 25 bp before the start point of each eukaryotic RNA polymerase II transcript unit which may be involved in positioning the enzyme for correct initiation; consensus=TATA(A or T)A(A or T) [1,2];
Optional Qualifiers /citation=[number] /db_xref="<database>:<identifier>" /evidence=<evidence_value> /gene="text" /label=feature_label /map="text" /note="text" /usedin=accnum:feature_label
Organism Scope eukaryotes and eukaryotic viruses
Molecule Scope DNA
References [1] Efstratiadis, A. et al. Cell 21, 653-668 (1980) [2] Corden, J., et al. "Promoter sequences of eukaryotic protein-encoding genes" Science 209, 1406-1414 (1980)
EMBL data library
Feature CAAT_signal
Definition CAAT box; part of a conserved sequence located about 75 bp up-stream of the start point of eukaryotic transcription units which may be involved in RNA polymerase binding; consensus=GG(C or T)CAATCT [1,2].
Optional Qualifiers /citation=[number] /db_xref="<database>:<identifier>" /evidence=<evidence_value> /gene="text" /label=feature_label /gene="text" /map="text" /note="text" /usedin=accnum:feature_label
Organism Scope eukaryotes and eukaryotic viruses
Molecule Scope DNA
References [1] Efstratiadis, A. et al. Cell 21, 653-668 (1980) [2] Nevins, J.R. "The pathway of eukaryotic mRNA formation" Ann Rev Biochem 52, 441-466 (1983)
Feature GC_signal
Definition GC box; a conserved GC-rich region located upstream of the start point of eukaryotic transcription units which may occur in multiple copies or in either orientation; consensus=GGGCGG;
Optional Qualifiers /citation=[number] /db_xref="<database>:<identifier>" /evidence=<evidence_value> /gene="text" /label=feature_label /map="text" /note="text" /usedin=accnum:feature_label
EMBL data library
Feature misc_signal
Definition
any region containing a signal controlling or altering gene function or expression that cannot be described by other signal keys (promoter, CAAT_signal, TATA_signal, -35_signal, -10_signal, GC_signal, RBS, polyA_signal, enhancer, attenuator, terminator, and rep_origin).
Optional Qualifiers /citation=[number] /db_xref="<database>:<identifier>" /evidence=<evidence_value> /function="text" /gene="text" /label=feature_label /map="text" /note="text" /phenotype="text" /standard_name="text" /usedin=accnum:feature_label
EMBL data library
ID MMIGHALP standard; DNA; MUS; 17956 BP.XXAC X96607;...FT enhancer 4537..6107FT /note="locus control region„FT /note="alpha„FT /gene="IgH"
ID SSLCREG1 standard; DNA; MAM; 1190 BP.XXAC X86793;XXSV X86793.1XXDT 10-MAY-1995 (Rel. 43, Created)DT 30-MAY-1995 (Rel. 43, Last updated, Version 3)XXDE S.scrofa locus control region (1190 bp)XXKW locus control region. ...FT source 1..1190FT /chromosome="9„FT /db_xref="taxon:9823„FT /organism="Sus scrofa„FT /clone_lib="clonetech„FT /map="p2.4„FT - 5..1190FT /note="locus control region (HSI)"
Eukaryotic Promoter Database (EPD)
Praz et al., Nucleic Acids Res. 30, 322-324 http://www.epd.isb-sib.ch
Eukaryotic Promoter Database (EPD)
Praz et al., Nucleic Acids Res. 30, 322-324 (2002) http://www.epd.isb-sib.ch
All EPD 4809
Vertebrate promoters 2540
Arthropode promoters 2000
Plant promoters 198
Viral 129
Nematode promoters 26
Eukaryotic Promoter Database (EPD)
ID HS_MYC_1 standard; single; VRT.XXAC EP11146;XXDT ??-APR-1987 (Rel. 11, created)DT 10-OCT-2001 (Rel. 69, Last annotation update).XX
DE c-myc (cellular homologue of myelocytomatosis virus 29 oncogene),DE promoter 1, MYC gene.OS Homo sapiens (human).XXHG Homology group 52; Mammalian c-myc proto-oncogene, promoter 1.AP Alternative promoter #1 of 2; exon 1; site 1.NP none.XXDR EPD; EP11148; HS_MYC_2; alternative promoter; [+162; +].DR EPDEX; HS_MYC.DR EMBL; X00364.2; HSMYCC; [-2327, 8669]. [ EMBL; GenBank; DDBJ ]...DR SWISS-PROT; P01106; MYC_HUMAN.DR TRANSFAC; R01157; HS$CMYC_01; [-49,-27]; by position....DR MIM; 190080.
Eukaryotic Promoter Database (EPD)...DR MIM; 190080.XXRN [1]RX MEDLINE; 84026482.RA Battey J., Moulding C., Taub R., Murphy W., Stewart T., Potter H.,RA Lenoir G., Leder P.;RT "The human c-myc oncogene: structural consequences ofRT translocation into the IgH locus in Burkitt lymphoma";RL Cell 34:779-787(1983)....XXME Nuclease protection [2].ME Nuclease protection; transfected or transformed cells [3].ME Primer extension [2].XXSE aatctccgcccaccggccctttataatgcgagggtctggacggctgaggACCCCCGAGCTXXTX 6. Vertebrate promotersTX 6.1. Chromosomal genesTX 6.1.5. Hormones, growth factors, regulatory proteinsTX 6.1.5.16. Various cellular protooncogenesXXKW Proto-oncogene, Nuclear protein, DNA-binding, Glycoprotein,KW Transcription regulation.XXFP Hs c-myc P1 :+S EM:X00364.2 1+ 2328; 11146.052 010*1XXDO Experimental evidence: 3,3#,6DO Expression/Regulation: +mitogen;+IL-2RF Cell34:779 PNAS80:6307 MCB7:1393 MCB7:2988//
RegulonDB
Salgado et al., Nucleic Acids Res. 29, 72-74 (2001) http://www.cifn.unam.mx/Computational_Genomics/regulondb/
SCPD
Zhu & Zhang, Bioinformatics 15, 607-611 (1999) http://cgsigma.cshl.org/jian/
PlantCARE
Rombauts et al., Nucleic Acids Res. 27, 295-296 (1999) http://sphinx.rug.ac.be:8080/PlantCARE/cgi/index.html
Schematic representation of "Oligo-capping" method
TRRD
Kolchanov et al., Nucleic Acids Res. 30, 312-317 (2002) http://wwwmgs.bionet.nsc.ru/mgs/gnw/trrd/
TRRD
Kolchanov et al., Nucleic Acids Res. 30, 312-317 (2002) http://wwwmgs.bionet.nsc.ru/mgs/gnw/trrd/
GENE
SITE FACTOR
MATRIX
encodes for
contains
binds to and regulates interacts
is used to construct
is an attribute of
TRANSFAC®
a database on gene transcription regulation
interactingfactor
coding regionregulatory region
gene
expression
SITE
FACTOR
GENE
SYNONYMS
FEATURES
CLASS SPECIES
MATRIX
SEQUENCE
METHODCELL Q
FUNCTIONAL ELEMENT
TRANSFAC structure
Manual annotation of the databases: input client
TRANSFAC: FACTOR table, protein sequence
TRANSFAC: FACTOR table, protein domains
TRANSFAC: FACTOR table, structural and functional features
TRANSFAC: FACTOR table, links to other databases
TRANSFAC: classification of transcription factors
TRANSFAC: CLASS table
TRANSFAC 8.1 (2004-03-31): number of factor entries for different species
human
mouse
rat
other vertebrates
fruit fly
plants
Fungi
Other
0
200
400
600
800
1000
1200
1400
0
100
200
300
400
500
600
700
800
TRANSFAC 8.1 (2004-03-31): distribution of experimentally known TFBS in 5‘ regions of genes.
TRANSFAC: FACTOR table, protein-DNA and protein-protein interactions
TRANSFAC: MATRIX table
TRANSFAC® : accompanying tools
PatchTM- pattern search MatchTM- PWM-based search
gATTGGCGCGAAGtttt
aCAGGGCGCCAAAcgcg
aTTTCGCGCCAAActtg
aTTTCGCGCCAAActtg
aTTTCGCGCCAAActtg
GGCTGCGGCCAAAtctcATCTCCCGCCAGGtcagaGTTCGCGGGCAAatgc
cTTCGGCGCGCGGtgtt
tTTTCGCGCCAAAgtca
tTTTGCCGCGAAAagac
gATTGGCGCGAAGtttt
aCAGGGCGCCAAAcgcg
aTTTCGCGCCAAActtg
aTTTCGCGCCAAActtg
aTTTCGCGCCAAActtg
GGCTGCGGCCAAAtctcATCTCCCGCCAGGtcagaGTTCGCGGGCAAatgc
cTTCGGCGCGCGGtgtt
tTTTCGCGCCAAAgtca
tTTTGCCGCGAAAagac
q1 q2
TM
Selection of DNA binding sites by regulatory proteins
Statistical-mechanical theory
O.G. Berg and P.H von Hippel
Mutational drift
Match
Mismatch
1) Binding affinity of protein to DNA in some useful range2) Number of sequences is large. 3) All possible sequences are equiprobable4) - express the decrease in binding energy when cognate base pare is replaced by B5) Individual base-pare contributions are independent and therefore additive The loss in the binding affinity in one position may be gained in the other position.
1 2 ... l ... s
A 0.5 0.9
T 0.0 0.1
G 1.2 0.0
C 0.1 0.8
00 l
lB
lB
Sites have binding affinity in a limited range E around a requred level E
E
In such set of sitesthe local contributionfrom every positionsmust sum to E
lB
What is the frequencywith wich certain base pair Bapeares at a certain positionin a site?
lBf
l
The same question is askeb in statistical mechanics:
S independent particles in a systemand a given total energy E.What a probability to that the particlelB will have the energy ?
lB
leEf qlB
41)(
- is determined by the density of potential sites, i.e. by the numberof possible sequence combinations that have the required descrimination energy E
)ln( 0obs
lBobs
llB ffFor any sequence X of the length s the actualdiscrimination energy:
s
l
obslB
obsl
s
llB ll
ffXE1
01
1
)ln()(
12
n
Small-sample effect
4
1
N
nf lB
lB
1
1ln 01
lB
llB n
n
Problems:
1. Small sets of sites2. Homology between sites3. Specific function of nucleotides in certain positions4. Correlations between positions (not additive effect)
L
ii
L
ii
L
ibi fiIfiIfiIq
i1
max
1
min
1, )()()(
},,,{
,, ,...,2,1),4ln()(CGTAB
BiBi LiffiI
with: bi, nucleotide b found in the i-th position of test sequence,fbi, frequency of nucleotide b in the i-th position of the aligned training sequences,fi
min, minimum frequency in position i,fi
max, maximum frequency in position i,and
TFS identification
Calculating the Ci-values
gapTGCABi BiPBiPiC
,,,,
5ln,ln,5ln
100
Position1 2 3 4 5 6 7 8 9 10 11
A 1 2 0 0 4 0 0 1 5 2 2C 1 2 0 0 1 3 0 4 0 0 2G 1 1 0 5 0 2 0 0 0 3 1T 1 0 5 0 0 0 5 0 0 0 0gap 1 0 0 0 0 0 0 0 0 0 0
P (A) 0.2 0.4 0 0 0.8 0 0 0.2 1 0.4 0.4P (C) 0.2 0.4 0 0 0.2 0.6 0 0.8 0 0 0.4P (G) 0.2 0.2 0 1 0 0.4 0 0 0 0.6 0.2P (T) 0.2 0 1 0 0 0 1 0 0 0 0P (gap) 0.2 0 0 0 0 0 0 0 0 0 0
Ci (A) -0.32 -0.37 0 0 -0.18 0 0 0 0 -0.37 -0.37
Ci (C) -0.32 -0.37 0 0 -0.32 -0.31 0 0 0 0 -0.37
Ci (G) -0.32 -0.32 0 0 0 -0.37 0 0 0 -0.31 -0.32
Ci (T) -0.32 0 0 0 0 0 0 0 0 0 0
Ci (gap) -0.32 0 0 0 0 0 0 0 0 0 0P(B)*lnP(B)+ln(5) 0.00 0.55 1.61 1.61 1.11 0.94 1.61 1.11 1.61 0.94 0.55
Ci 0 34 100 100 69 58 100 69 100 58 34
Position 1 2 3 4 5 6 7 8 9 10 11 A 1 2 0 0 4 0 0 1 5 2 2 C 1 2 0 0 1 3 0 4 0 0 2 G 1 1 0 5 0 2 0 0 0 3 1 T 1 0 5 0 0 0 5 0 0 0 0 - 1 0 0 0 0 0 0 0 0 0 0
Ci 0 34 100 100 69 58 100 69 100 58 34
core
To make it fast
Preselection with the core:
Scoring of the matchScoring of the match
T G A C T
TRANSFAC: MatchTM tool
TRANSFAC: MatchTM output
0
10
20
30
40
50
60
70
80
90
100
0,75 0,8 0,85 0,9 0,95 1
undeprediction error
overprediction error
error sum
Selection of optimal cut-offs
minFN minFPminSUM
Feature table of Genebank entry Corresponding hits found by Match
Matrix-Identifier
Position
MatrixSimilarity
CoreSimilarity Sequence Factor Name
Feature table of Genebank entry Corresponding hits found by Match
Matrix-IdentifierMatrix-Identifier
PositionPosition
MatrixSimilarityMatrixSimilarity
CoreSimilarityCoreSimilarity SequenceSequence Factor NameFactor Name
Example of a search using cut-offs to minimize false negative matches
In this example we searched the homo sapiens angiotensinogen gene (5`region andexon1) for all bindings sites listed in the features of its Genebank entry. For that search we usedcut-offs to minimize false negative matches as these cut-offs are recommended toreduce the probability that Match misses a potential binding site. Corresponding hits for all of the entries in the feature table, which concern a binding site, could be found in the Match output.
TRANSPLORER (TRANScription exPLORER) is a software package for the analysis of transcription regulatory sequences. Currently, TRANSPLORER site prediction tool uses position weight matrices (PWM) collections. It is able to use several matrix sources: the largest and most up-to-date library of matrices derived from TRANSFAC® Professional database, other matrix libraries as well as any user-developed matrix libraries. This means that it provides an opportunity to search for a great variety of different transcription factor binding sites. A search can be made using all or subsets of matrices from the libraries.
Search for most probable binding sites regulating gene expression
Search for binding sites coinciding with SNPs
•pairs of closely situated binding sites for TFs;
•cooperative functioning of transcription factors;
•direct protein-protein interactions;
•combinatorial regulation of gene transcription.
Key topics
TRANSCompel®
a database on composite regulatory elements
individual entry
Description of an evidence (experiment, cell type, two individual interactions)
Link to the TRANSFAC GENE
table
Link to the EMBL
Link to the TRANSFAC
FACTOR table
N Gene Scheme of CE 1. IgH , Mus
musculus
2. IL-2, Homo sapiens
-283 -268 : :
3.
IL-2, Homo sapiens
-167 -142 : :
5.
4. Il-2, Mus musculus
-167 -142 : :
IgH ,Homo sapiens
6.
Serum amyloid А1, Rattus norv.
-117 -73 : :
7. IRF-1, Mus musculus
-123 -113 -49 -40 : : : :
AP-1Ets
AP-1NFAT
AP-1NF-B
Ets CBF
AP-1 Oct-2
NF-BC/EBP
NF-BSTAT-1
TRANSCompel®
combinatorial regulation, more than 360 CEs
TRANSCompel®
functional classification of the composite elements
inducible/inducible - Ca2+ and PKC response NFAT / AP1
- IFN-gamma and TNF-alpha response NF-kappaB / IRF
inducible/constitutive- cholesterol level response SREBP / Sp1
- acute-phase response STAT-3 / Sp1
inducible/tissue-restricted- TGF-beta response in B-cells SMAD / AML
tissue-restricted/tissue-restricted- pancreas islet beta-cells (insulin-producing) HNF3 / BETA2
- pituitary gonadotropes Ptx1 / SF-1
tissue-restricted/ubiquitous- macrophages PU.1 / Sp1
Tissue-specific
32
Inducible
44 119
Cell-cycle dependent
1 2
Dev. stage-dependent
3
Ubiquitous constitutive
39 60 2 12
F1 F2
Tissue-specific
Indu-cible
Cell-cycle dep.
Dev. stage-dependent
Ubiquit. constitut.
2
Inducible/inducible
19 CE‘s ETS / AP-1 providing cross-coupling of Ras/Raf- and PKC-dependent signalling pathways;
15 CE‘s NFATp / AP-1 providing cross-coupling of Ca2+ - and PKC-dependent signalling pathways;
14 CE‘s NF-B / C/EBP NF-B is inducible by IL-1 and TNF-; C/EBP is inducible by IL-6.
Tissue-specific
32
Inducible
44 119
Cell-cycle dependent
1 2
Dev. stage-dependent
3
Ubiquitous constitutive
39 60 2 12
F1 F2
Tissue-specific
Indu-cible
Cell-cycle dep.
Dev. stage-dependent
Ubiquit. constitut.
2
Inducible/constitutive
9 CE‘s ETS / Sp1 ETS factors are inducible through Ras/Raf- dependent signalling pathway;
5 CE‘s Smad / TEF3 Smads are inducible by TGF- signalling.
Tissue-specific
32
Inducible
44 119
Cell-cycle dependent
1 2
Dev. stage-dependent
3
Ubiquitous constitutive
39 60 2 12
F1 F2
Tissue-specific
Indu-cible
Cell-cycle dep.
Dev. stage-dependent
Ubiquit. constitut.
2
Inducible/tissue-restricted
CE‘s Pit-1 / AP-1 Pit1 is pituitary-restricted transcription factor whereas AP-1 and Ets are ubiquitous inducible factors;
acaggaTGTCCATATTAGGacatctgcg
YY-1
SRF
human c-fosSRF mediates the rapid, transient induction of the c-fos protooncogen by serum growth factors.
YY1 diminishes both basal and serum-induced expression of the c-fos.
TRANSCompel®
antagonistic type of CEs
GGTGGGcctccggagtgaccaatgagtgTGGACAGATGCCA
Sp1 NF-Y
NF-1
COMPEL: C00006Chicken embryonic -globin gene
Sp1 cooperatively with NF-Y activates transcription in primitive erythroid cells
NF-1 represses transcription in adult cells
acaggaTGTCCATATTAGGacatctgcg
YY-1
SRF
COMPEL: C00009Human c-fos protooncogene
SRF mediates the rapid, transient induction of the c-fos protooncogen by serum growth factors.
YY1 diminishes both basal and serum-induced expression
of the c-fos.
COMPEL: C00054Rat serum amyloid A1 gene
TGGTAGTCTTGCACAGGAAATGACATggtGGGACTTTCCCcaggg
C/EBP NF-B
YY-1
C/EBP and NF-B synergistically activate transcription in liver cellsduring acute phase response
YY1 represses inducible transcription of this
gene.
Antagonistic composite elements
pattern-based search for potential composite elements in DNA sequences
• All CE‘s are used as individual searching patterns;
• Several parameters are available restricting the search:
mismatches in the site 1 and site 2,
distance between two sites,
composite score
Catch®
CCACCCATTTCCTC
ACAGGAATgacctggtgcCTCGCCC
TTCCTCctgtgccttag...ctgtttttctaaCCGCCC
GAAGGGCGGGGAcagtt...aagcaaaaAAAGGGAACTGA
AAAGGGAACTGAgtggctgcgaaAGGGTGGGG
GGAAgcaaccagCCCACCA
CCGGAAGCaaccagCCCACC
aaAAGGAAGTGGGCGTGGTttaaag
ACTTCCTC...GGCTCCTCCTCC
Set of CEs
M1
qM1 > n1
M2
qM2 > n2
1. matrix rule
2. distance rule
3. orientation rule
Search for the potential CEs
rules
TRANSCompel®
CEs of similar structure can be used to construct models
Application of CE models for promoter analysis
Four CE types are over-represented in promoters in comparison with several biological sequences tested.
0
20
40
60
80
100
120
140
160
180
Myb/Aml NF-kB/Sp1 Ets/Sp1 E2F/Sp1
in 1
00
00
0 b
p promoters -350/+50
exons_3d
h_chr_15(whole)
h_chr_15_Alu
h_chr_15_L1
h_chr_15_L2
Gene expression profiling
Genome -ReferenceSequence
Gene
Transcripts
Splicing variants
Polypeptides
Compositeelements
Regulatoryregions
TF bindingsites
Repeats
S/MARs
TRANSGENOME provides the hierarchical structure of the most important elements of a genome in coding regions as well as in regulatory regions. This structure provides the possibility to have a unique reference sequence and to store the location of all gene regulatory and structural elements.
TRANSGENOME
Gene
pre-mRNAs(from RefSeq)
spliced mRNAs
CDS
site
5’UTR3’UTR
RefSeq derived potential starts oftranscription (first exons)
TRANSFAC derived start of transcription(by relative site positions)
EPD derived start of transcription
DBTSS derived start of transcription
Bronchial tree and Intrapulmonary Airways
Lobar bronchus
Segmental bronchus
Bronchus
Bronchiolus
Terminal bronchiolus Alveolar sac
Respiratory bronchiolus
Pulmonary alveolus
Alveolar duct
Alveolar pore Alveolar
epithelium
Alveolar septa
PneumocytesCytomer/Content
Human body Lung Bronchial treeMain bronchus
SpeciesID
Name
CellID
NameDescription
OrganID
Name Parent
SystemID
Name
HUBID
Cytomer_noOrgan_IDCell_ID
System_IDPeriod_ID
Species_ID
Stage2PeriodStage_IDPeriod_ID
CPID
TFaccCytomer_no
CaccCP
CNID
TFaccCytomer_no
CaccCN
Transfac
FactorID
Acc
PeriodIDT1T2
StageIDT1T2
description
CYTOMER structure
human body abdominal regions acetabular artery acetabular ramus anastomatic vessel aortic plexus apocrine gland blood blood vessel body cavities bronchi cardiac thoracic nerve cartilage cartilaginous tissue of anular radial ligament central nervous system
brain brain stem meninges of brain mesencephalon
cerebral peduncle base of peduncle cerebral crus lateral groove of midbrain substantia nigra
compact part of substantia nigra lateral part of substantia nigra reticular part of substantia nigra retrorubral part of substantia nigra
tegmentum of midbrain of cerebral peduncle frenulum of superior medullary velum of midbrain interpeduncular fossa
TRANSGENOMEEST
UniGene
Gene expression group 1
Gene expressiongroup 2
Gene expressiongroup 3
CYTOMER® A database on gene expression sources
Spatio-temporal coordinates:CYTOMER
Conditional determinants:TRANSPATH
Factors controlling transcription:TRANSFAC
Expression space E
Expression space Eg of gene g
t(temporal, developmental axis)
(spatial axis:systems,
organs,cells)
x
c(conditional)
The gene expression space
Gene expression profiling
E1 E2 .. ES
g1 x1,1g2
: :gh
x1,2
x1,h x2,h
::
..
..
..
x2,2 xS,2
xS,1
xh,h
x2,1Expression patternof gene g1
Expression matrix:-rows representing genes-columns representing samples (various tissues, developmental stage,...)
Expression profileof state E1(e.g. in organ Oat stage t)
Gene
Expression state
Gene expression profiling
Gene expression profiling