synonymous mutations - from bacterial evolution to somatic
TRANSCRIPT
Synonymous mutations - from bacterial evolution to somatic
changes in human cancer
Fran Supek
1) Lehner group, CRG/EMBL Systems Biology Unit, Barcelona
2) Division of Electronics, RBI, Zagreb, Croatia
XXI Jornades de Biologia Molecular
Barcelona, 11.6.2014
synonymous mutations =changes in the gene sequencethat don’t alter the protein sequence
Synonymous mutations
• (some) synonymous mutations are subject to evolutionary pressures• clearly shown for many bacteria and yeasts
• likely also higher Eukarya (but weaker signal)
• how does selection for/against synonymous changes relate to gene function in (a) evolution of bacteria and (b) in carcinogenesis?
evolutionary trace across ~1000 bacterial genomes somatic mutations in ~4000 human cancers
malignant transformationadaptation to diverse environments
( plush microbes in photos are from http://www.giantmicrobes.com/ )
• In what way can evolution of synoymous codon preferences be used to systematically infer gene function in bacteria?
• There are other simpler (known) ways to determine gene function from the genome sequences:
• commonly/systematically applied: transfer of annotation via sequence similarity (BLAST, COG, Pfam...)
• >30% of genes end up with no known function annotated. They may not have known homologs, or their homologs may have no experimentally determined function.
• known but less common: genomic context methods, such as phyletic profiling
evolutionary trace across ~1000 bacterial genomes
adaptation to diverse environments
( plush microbes in photos are from http://www.giantmicrobes.com/ )
Phyletic (or phylogenetic) profiling
Pellegrini, Marcotte et al., PNAS (1999)
one genomic context method:
examines presence/absence patterns of homologous genes across species.
Kensche et al. (2008) J Royal Soc Interface. ~30 examples of success of phyletic profiling
• by 2008 -> n~=30
• by 2014 -> n~=300 (estimate)
• aim for: N > 3000
Enriching phyletic profileswith information on orthology and paralogy
Species 1
Species 2
… Species
997 Species
998 Function
OMA 1 … 0 GO:001,
GO:007
OMA 2 0 … ?
… … … … … … …
OMA 64051 0 … 0 0 GO:042
OMA 64052 0 … GO:003,
GO:160
orthologs in cliquesorth. outside cliquesparalogs
groups of orthologs from OMA database:Schneider, Dessimoz and Gonnet (2007) Bioinformatics
Skunca et al. PLoS Comp Biology 2013doi:10.1371/journal.pcbi.1002852
Accuracy of predicting GO categories strongly increases when adding paralogs
+ paralogs + orthologs(outside clique)
+ para + orthoclique only
(bubbles are Gene Ontology categories)
Supervised machine learning is superior to common approaches based on pairwise distances
Based on correlationof profiles
AU
C (
area
un
der
R
OC
cu
rve)
Decision trees
Schietgat et al. 2010. BMC Bioinfo
Experimental validation of predictions made with phyletic profiling
• knockout mutants of E. coli in predicted genes
• three selected GO categories targeted by particular antibiotics:• ‘response to DNA damage’
• ‘translation’
• ‘peptidoglycan-based cell wall biogenesis’
• predictions: 38 genes with expected precision > 60%
0%
20%
40%
60%
80%
100%
120%
140%
160%
w.t.
dbpA
rh
lB
yhbJ
pm
bA
rhlE
tldD
yidD
ynbB
envC
murE
nalidixic acid ampicillin kasugamycin
Su
rviv
al c
om
pa
red
to
th
e w
ild t
yp
e
inhibitstranslationinitiation
inhibits cell wall synthesis
DNA damaging
agent
0%
20%
40%
60%
80%
100%
120%
140%
160%
w.t.
dbpA
rh
lB
yhbJ
pm
bA
rhlE
tldD
yidD
ynbB
envC
murE
nalidixic acid ampicillin kasugamycin
Su
rviv
al c
om
pa
red
to
th
e w
ild t
yp
e
Does this gene participate in ‘peptidoglycan-based cell wall biogenesis’ ?
0%
20%
40%
60%
80%
100%
120%
140%
160%
w.t.
dbpA
rh
lB
yhbJ
pm
bA
rhlE
tldD
yidD
ynbB
envC
murE
nalidixic acid ampicillin kasugamycin
Su
rviv
al c
om
pa
red
to
th
e w
ild t
yp
e Does this gene participate in ‘peptidoglycan-based cell wall biogenesis’ ?
25/38 validated predictions (experimental precision = 66%; theoretically expected = 60%) our method is useful for prioritizing genes for experimentally
determining gene function
http://gorbi.irb.hr/
“We predict Gene Ontology annotations ... for about 1.3 million poorly annotated genes in 998 prokaryotes at a stringent threshold of 90% Precision...”
“...about 19000 of those are highly specific functions.”
published in:Skunca et al. PLoS Comp Biology 2013doi:10.1371/journal.pcbi.1002852
• Codon usage biases are another useful source of evolutionary information
• ... complementary to gene presence/absence• ... available from just the genome sequence• ... with an established biological rationale
tRNA levels and codon usage biases
E. coli K-12, tRNA gene counts (proxy for tRNA levels)
codon
anticodon
Commonly used codons typically correspond to abundant tRNAs, particularly in highly expressed genes.
Codon biases correlate to gene expression
0.5
1
1.5
2
2.5
0.5 1 1.5 2 2.5 3 3.5
MIL
C (n
on
-RP
gen
es)
MILC (ribosomal protein genes)
ribosomal protein genes other highly expressed genes rest of genome
B
Figure from
Supek and Vlahoviček (2005)
BMC Bioinformatics
doi:10.1186/1471-2105-6-182
E. coli genome
• organisms adapt to the environment through changes in translation efficiency?
• Carbone A (2005) J Mol Evol – codon adaptation in metabolic pathways:
Photosynthesis genes in Synechocystis
Methanogenesis genes in Methanosarcina
Archaea
Bacteria
An example phenotype: oxygen requirement
• Man & Pilpel (2007) Nat Genet: 9 yeasts
TCA cycle glycolysis
aerobic anaerobic (low) codon adaptation (high)
• Based on these examples, we aimed to systematically link:
• Many environments/phenotypes, with
• evolutionary change in translation efficiency across many gene families
Measuring translation efficiency
Method from
Supek et al. (2010)
PLoS Geneticsdoi:10.1371/journal.pgen.1001004
non-HE HE
4-20% of genome
Expression levels: microarrayson 19 diverse bacteria
0
1
2
3
4
log
2e
xpre
ssio
n r
ati
o
OCU/non-OCU, from ref. [7] HE/non-HE ribosomal proteins/all genes
gene 1
intergenicDNA
codonusage
all otherproteingenes
highly expressed
genes *
increasein
probability after adding
codon usage?
classifier predicts probability:
expr.
A
gene1
gene2
gene3
* ribosome, translationelongation factors, chaperones
vs.
B
C
3.9x6.0x
Correlation vs. causality?
a randomization test to control for confounding phenotypes and phylogeny
This passes the randomization test:
This fails (association not unique):
associations between phenotypes, and also with phylogeny:
• 514 aerotolerant vs. 214 aerointolerant:
295 COGs are significantly enrichedwith HE genes
• obligate vs. facultative aerobes:
• thermophiles
• halophiles
+ 20 other phenotypes tested
control for confounders 23 COGs
11 COGs
16 COGs
6 COGs
Gene families linked to aerotolerance
all experiments: Anita Kriško lab (Mediterranean Institute for Life Science, Split, Croatia)published as Kriško et al, Genome Biology 2014. doi:10.1186/gb-2014-15-3-r44
0%
20%
40%
60%
80%
100%
120%
w.t
.
yjjB
flg
H
cysG
mn
mA
nlp
E
pro
Xosmotic oxidative heat
C
0%
20%
40%
60%
80%
100%
120%
w.t
.
clp
S
op
pA tig
ssu
D
nu
dF
pn
p
typ
A
mng
R
lsrR
yeb
S
rhlE
yajL
pyk
F
dtd
eu
tD
glo
B
yfcA
ma
rR
yccX
pn
cB
ttd
B
mo
aA
dsb
B
surv
ival
, no
rmal
ize
d to
w.t
.
heat oxidative osmotic
B
0x
1x
2x
3x
4x
5x
6x
0%
20%
40%
60%
80%
100%
120%
NA
C /
no
NA
C s
urv
ival
rat
io
surv
ival
, n
orm
aliz
ed
to
w.t
.
2.5 mM H2O2 5 mM NAC pretreatment heat shock osmotic shock
A
** ** **
* known antioxidant proteins in E. coli (or homologs in other organisms)
* known to be regulated in response to air or oxidative stress
positive control
2 nonspeci-fic hits
ca
rbo
nyla
tion
incre
ase
DH
R-1
23
incre
ase
Ce
llRO
X
incre
ase
tota
lF
e
incre
ase
dip
yrid
yl
rescu
e
NA
DP
Hle
ve
lin
cre
ase
NA
DP
Hre
scu
e
fresufD
rseCsodA
w.t.
clpArecA
napFlon
ybeQ
yaaUcysD
ybhJgpmM
icdlpd
yidH
0 0.4 0.8ROS levels in the mutants
ca
rbo
nyla
tion
incre
ase
DH
R-1
23
incre
ase
Ce
llRO
X
incre
ase
tota
lF
e
incre
ase
dip
yrid
yl
rescu
e
NA
DP
Hle
ve
lin
cre
ase
NA
DP
Hre
scu
e
fresufD
rseCsodA
w.t.
clpArecA
napFlon
ybeQ
yaaUcysD
ybhJgpmM
icdlpd
yidH
0 0.4 0.8
positive control
wild-type
ROS are typically not increased (except cysD, yaaU, rseC, and the positive control sodA)
Predicted functional interactions from STRING v9
Gene families whose codon biases are associated to aerobicity/aerotolerance:
ca
rbo
nyla
tion
incre
ase
DH
R-1
23
incre
ase
Ce
llRO
X
incre
ase
tota
lF
e
incre
ase
dip
yrid
yl
rescu
e
NA
DP
Hle
ve
lin
cre
ase
NA
DP
Hre
scu
e
fresufD
rseCsodA
w.t.
clpArecA
napFlon
ybeQ
yaaUcysD
ybhJgpmM
icdlpd
yidH
0 0.4 0.8Putative mechanisms of oxidative stress resistance
NAD(P)Hrelated
iron-related
unknown
all experiments: Anita Kriško lab (Mediterranean Institute for Life Science, Split, Croatia)published as Kriško et al, Genome Biology 2014. doi:10.1186/gb-2014-15-3-r44
carb
on
ylat
ion
incr
ease
DH
R-1
23
incr
ease
Cel
lRO
Xin
crea
se
tota
l Fe
incr
ease
dip
yrid
ylre
scu
e
NA
DP
H le
vel
dec
reas
e
exo
gen
ou
s N
AD
PH
res
cue
0%
20%
40%
60%
80%
100%
120%
w.t
.
yjjB
flg
H
cysG
mn
mA
nlp
E
pro
X
osmotic oxidative heat
C
0%
20%
40%
60%
80%
100%
120%
w.t
.
clp
S
op
pA tig
ssu
D
nu
dF
pn
p
typ
A
mng
R
lsrR
yeb
S
rhlE
yajL
pyk
F
dtd
eu
tD
glo
B
yfcA
ma
rR
yccX
pn
cB
ttd
B
mo
aA
dsb
B
surv
ival
, no
rmal
ize
d to
w.t
.
heat oxidative osmotic
B
0x
1x
2x
3x
4x
5x
6x
0%
20%
40%
60%
80%
100%
120%
NA
C /
no
NA
C s
urv
ival
rat
io
surv
ival
, n
orm
aliz
ed
to
w.t
.
2.5 mM H2O2 5 mM NAC pretreatment heat shock osmotic shock
A
Other phenotypes: thermophilicity, halophilicity
Knockout of candidate genes affects heat shock resistance and osmotic shock resistance.
Validation using synthetic genes with introduced suboptimal codons
0%
5%
10%
15%
20%
25%
30%
w.t. ΔclpS ΔclpS + clpS_w.t.
ΔclpS + clpS_15
ΔclpS + clpS_20
ΔclpS + clpS_25
% s
urv
ival
0
0.1
0.2
0.3
0.4
0.5
0.6
0 0.5 1 1.5 2 2.5
rela
tive
fre
qu
en
cy
codon distance (MILC) to ribosomal protein genes
ribosomal protein genes
all other E. coli genesw.t.
1520 25
w.t.
21 28 35
yjjB
clpS
0%
5%
10%
15%
20%
25%
30%
w.t. ΔyjjB ΔyjjB + yjjB_w.t.
ΔyjjB + yjjB_21
ΔyjjB + yjjB_28
ΔyjjB + yjjB_35
% s
urv
ival
osmotic shock
heat shockC
DB
A
all experiments: Anita Kriško lab (Mediterranean Institute for Life Science, Split, Croatia)published as Kriško et al, Genome Biology 2014. doi:10.1186/gb-2014-15-3-r44
Overall:
• 200 links between 187 different COG gene families
- and -
24 diverse phenotypic traits, including• spore-forming ability
• motility
• pathogenicity to plants or mammals• affecting certain tissues/organs
• (1000s predictions at less stringent thresholds)
• Anita Kriško - Mediterranean Institute for Life Sciences (MedILS)Split, Croatia.
all experimentalwork shown
• Nives ŠkuncaETH Zurich.
phyletic profiling
Cancer
3851 cancer exomes from 11 tissues (>200 samples each)292,405 missense and 123,193 synonymous somatic mutations
ARE THE SYNONYMOUS MUTATIONS SELECTED FOR IN CARCINOGENESIS?
from Lawrence et al (2013) Nature. Mutation rate varies widely across the genome and correlates with DNA replication time and expression level.
from Schuster-Böckler and Lehner (2012)heterochromatin correlates to SNV rates
Drivers vs. passengers
• many somatic mutations in cancer = „passengers”
• a driver = a gene that confers a selective advantage. Recurrently mutated (ie. more than expected)
1. For missense, could be measured using the dN/dS
2.
3. commonly: find backgroud mut. frequencies for patient from entire exome see if a gene is above that background
Intronic rates as a baseline: INVEX testHodis et al. (Cell 2012)
-0.5
-0.25
0
0.25
0.5
0.75
1
-1 -0.75 -0.5 -0.25 0 0.25 0.5 0.75 1corr
elat
ion
to
PC
2 (
24
.3 %
)
correlation to PC1 (30.4 % variance)
carcinoma, 1Mbnon-carcinoma, 1Mb
pooled, 200kbliver, 200kb
liver, 1Mbbreast, 1Mb
H3K9me3,1Mb
GC3
RepliSeq,1Mb
hypothalamusliver
skeletal & heart muscle
6 tissues
regional mutation rates
mRNA levels
0
0.2
0.4
0.6
0.8
1
-2 0 2
D- = 0.224P = 0.017
0
0.2
0.4
0.6
0.8
1
9 19 29
D+ = 0.313P = 0.0004
0
0.2
0.4
0.6
0.8
1
-2 0 2
D- = 0.464P = 2.4·10-8
0
0.2
0.4
0.6
0.8
1
9 19 29
D- = 0.256P = 0.005
0
0.2
0.4
0.6
0.8
1
D+ =0.211P = 0.026
earlylate
oncogenes:
translocation(217)
missense(40)
copy number (12)
tumorsuppressors:
all mechanisms
(84)
Cancer GeneCensusA
recurrently mutated genes(self-reported in literature)
matched sets of noncancer genes:
1517 genes (for oncogenes)
693 genes (for tumor suppressors)
complete set of 13219 noncancer genes
B
known cancer genes
in Census
others:336
39
38
C
# mutations per 200 kb(110 cancers, pooled tissues)
heterochromatin (H3K9me3levels in 1 MB windows)
replication timing (RepliSeqsignal in 1 MB windows)
mRNA levels, avg. of 6 tissues(log2 RPKM)
# mutations per 200 kb(110 cancers, pooled tissues)
heterochromatin (H3K9me3levels in 1 MB windows)
replication timing (RepliSeqsignal in 1 MB windows)
mRNA levels, avg. of 6 tissues(log2 RPKM)
0
0.2
0.4
0.6
0.8
1D- = 0.199P = 0.043
earlylate
0
0.2
0.4
0.6
0.8
1
0.1 0.3 0.5
D+ = 0.215P = 0.025
39 oncogenes (recurrently mutated)
38 tumor suppressors (recurr. mutated)
D
19 1821missense-activatedoncogenes
recurrently mutated(from literature)
oncogenes
0
0.2
0.4
0.6
0.8
1
0.1 0.3 0.5
D- = 0.185P = 0.061
-0.5
-0.25
0
0.25
0.5
0.75
1
-1 -0.75 -0.5 -0.25 0 0.25 0.5 0.75 1corr
elat
ion
to
PC
2 (
24
.3 %
)
correlation to PC1 (30.4 % variance)
carcinoma, 1Mbnon-carcinoma, 1Mb
pooled, 200kbliver, 200kb
liver, 1Mbbreast, 1Mb
H3K9me3,1Mb
GC3
RepliSeq,1Mb
hypothalamusliver
skeletal & heart muscle
6 tissues
regional mutation rates
mRNA levels
0
0.2
0.4
0.6
0.8
1
-2 0 2
D- = 0.224P = 0.017
0
0.2
0.4
0.6
0.8
1
9 19 29
D+ = 0.313P = 0.0004
0
0.2
0.4
0.6
0.8
1
-2 0 2
D- = 0.464P = 2.4·10-8
0
0.2
0.4
0.6
0.8
1
9 19 29
D- = 0.256P = 0.005
0
0.2
0.4
0.6
0.8
1
D+ =0.211P = 0.026
earlylate
oncogenes:
translocation(217)
missense(40)
copy number (12)
tumorsuppressors:
all mechanisms
(84)
Cancer GeneCensusA
recurrently mutated genes(self-reported in literature)
matched sets of noncancer genes:
1517 genes (for oncogenes)
693 genes (for tumor suppressors)
complete set of 13219 noncancer genes
B
known cancer genes
in Census
others:336
39
38
C
# mutations per 200 kb(110 cancers, pooled tissues)
heterochromatin (H3K9me3levels in 1 MB windows)
replication timing (RepliSeqsignal in 1 MB windows)
mRNA levels, avg. of 6 tissues(log2 RPKM)
# mutations per 200 kb(110 cancers, pooled tissues)
heterochromatin (H3K9me3levels in 1 MB windows)
replication timing (RepliSeqsignal in 1 MB windows)
mRNA levels, avg. of 6 tissues(log2 RPKM)
0
0.2
0.4
0.6
0.8
1D- = 0.199P = 0.043
earlylate
0
0.2
0.4
0.6
0.8
1
0.1 0.3 0.5
D+ = 0.215P = 0.025
39 oncogenes (recurrently mutated)
38 tumor suppressors (recurr. mutated)
D
19 1821missense-activatedoncogenes
recurrently mutated(from literature)
oncogenes
0
0.2
0.4
0.6
0.8
1
0.1 0.3 0.5
D- = 0.185P = 0.061
„classical” cancer genes:newly discovered, fromcancer genomes:
-0.5
-0.25
0
0.25
0.5
0.75
1
-1 -0.75 -0.5 -0.25 0 0.25 0.5 0.75 1corr
elat
ion
to
PC
2 (
24
.3 %
)
correlation to PC1 (30.4 % variance)
carcinoma, 1Mbnon-carcinoma, 1Mb
pooled, 200kbliver, 200kb
liver, 1Mbbreast, 1Mb
H3K9me3,1Mb
GC3
RepliSeq,1Mb
hypothalamusliver
skeletal & heart muscle
6 tissues
regional mutation rates
mRNA levels
0
0.2
0.4
0.6
0.8
1
-2 0 2
D- = 0.224P = 0.017
0
0.2
0.4
0.6
0.8
1
9 19 29
D+ = 0.313P = 0.0004
0
0.2
0.4
0.6
0.8
1
-2 0 2
D- = 0.464P = 2.4·10-8
0
0.2
0.4
0.6
0.8
1
9 19 29
D- = 0.256P = 0.005
0
0.2
0.4
0.6
0.8
1
D+ =0.211P = 0.026
earlylate
oncogenes:
translocation(217)
missense(40)
copy number (12)
tumorsuppressors:
all mechanisms
(84)
Cancer GeneCensusA
recurrently mutated genes(self-reported in literature)
matched sets of noncancer genes:
1517 genes (for oncogenes)
693 genes (for tumor suppressors)
complete set of 13219 noncancer genes
B
known cancer genes
in Census
others:336
39
38
C
# mutations per 200 kb(110 cancers, pooled tissues)
heterochromatin (H3K9me3levels in 1 MB windows)
replication timing (RepliSeqsignal in 1 MB windows)
mRNA levels, avg. of 6 tissues(log2 RPKM)
# mutations per 200 kb(110 cancers, pooled tissues)
heterochromatin (H3K9me3levels in 1 MB windows)
replication timing (RepliSeqsignal in 1 MB windows)
mRNA levels, avg. of 6 tissues(log2 RPKM)
0
0.2
0.4
0.6
0.8
1D- = 0.199P = 0.043
earlylate
0
0.2
0.4
0.6
0.8
1
0.1 0.3 0.5
D+ = 0.215P = 0.025
39 oncogenes (recurrently mutated)
38 tumor suppressors (recurr. mutated)
D
19 1821missense-activatedoncogenes
recurrently mutated(from literature)
oncogenes
0
0.2
0.4
0.6
0.8
1
0.1 0.3 0.5
D- = 0.185P = 0.061
Detecting positive selection on synonymous mutations in cancer
• create „matched sets” of genes closely following the oncogenes in:
• regional mutation rates• In 1 Mb and 200 kb windows
• expression levels in different tissues
• Heterochromatin, replication timing
• G+C content
How to find a good set of genes?
A genetic algorithm. An optimization technique that can (relatively)easily handle many criteria at once. Quite efficient. Many parameters.
Operators:
...crossover
...random mutation
-0.5
-0.25
0
0.25
0.5
0.75
1
-1 -0.75 -0.5 -0.25 0 0.25 0.5 0.75 1corr
elat
ion
to
PC
2 (
24
.3 %
)
correlation to PC1 (30.4 % variance)
carcinoma, 1Mbnon-carcinoma, 1Mb
pooled, 200kbliver, 200kb
liver, 1Mbbreast, 1Mb
H3K9me3,1Mb
GC3
RepliSeq,1Mb
hypothalamusliver
skeletal & heart muscle
6 tissues
regional mutation rates
mRNA levels
0
0.2
0.4
0.6
0.8
1
-2 0 2
D- = 0.224P = 0.017
0
0.2
0.4
0.6
0.8
1
9 19 29
D+ = 0.313P = 0.0004
0
0.2
0.4
0.6
0.8
1
-2 0 2
D- = 0.464P = 2.4·10-8
0
0.2
0.4
0.6
0.8
1
9 19 29
D- = 0.256P = 0.005
0
0.2
0.4
0.6
0.8
1
D+ =0.211P = 0.026
earlylate
oncogenes:
translocation(217)
missense(40)
copy number (12)
tumorsuppressors:
all mechanisms
(84)
Cancer GeneCensusA
recurrently mutated genes(self-reported in literature)
matched sets of noncancer genes:
1517 genes (for oncogenes)
693 genes (for tumor suppressors)
complete set of 13219 noncancer genes
B
known cancer genes
in Census
others:336
39
38
C
# mutations per 200 kb(110 cancers, pooled tissues)
heterochromatin (H3K9me3levels in 1 MB windows)
replication timing (RepliSeqsignal in 1 MB windows)
mRNA levels, avg. of 6 tissues(log2 RPKM)
# mutations per 200 kb(110 cancers, pooled tissues)
heterochromatin (H3K9me3levels in 1 MB windows)
replication timing (RepliSeqsignal in 1 MB windows)
mRNA levels, avg. of 6 tissues(log2 RPKM)
0
0.2
0.4
0.6
0.8
1D- = 0.199P = 0.043
earlylate
0
0.2
0.4
0.6
0.8
1
0.1 0.3 0.5
D+ = 0.215P = 0.025
39 oncogenes (recurrently mutated)
38 tumor suppressors (recurr. mutated)
D
19 1821missense-activatedoncogenes
recurrently mutated(from literature)
oncogenes
0
0.2
0.4
0.6
0.8
1
0.1 0.3 0.5
D- = 0.185P = 0.061
Oncogenes: Tumor suppressors:
Distributions of regional mutation rates (1Mb and 200 kb), heterochromatin, etc. in the optimized sets of non-cancer genes closely match the cancer genes. Genetic algorithm tries to minimize the K-S statistic.
-0.5
-0.25
0
0.25
0.5
0.75
1
-1 -0.75 -0.5 -0.25 0 0.25 0.5 0.75 1corr
ela
tio
n t
o P
C2
(2
4.3
%)
correlation to PC1 (30.4 % variance)
carcinoma, 1Mbnon-carcinoma, 1Mb
pooled, 200kbliver, 200kb
liver, 1Mbbreast, 1Mb
H3K9me3,1Mb
GC3
RepliSeq,1Mb
hypothalamusliver
skeletal & heart muscle
6 tissues
regional mutation rates
mRNA levels
0
0.2
0.4
0.6
0.8
1
-2 0 2
D- = 0.224P = 0.017
0
0.2
0.4
0.6
0.8
1
9 19 29
D+ = 0.313P = 0.0004
0
0.2
0.4
0.6
0.8
1
-2 0 2
D- = 0.464P = 2.4·10-8
0
0.2
0.4
0.6
0.8
1
9 19 29
D- = 0.256P = 0.005
0
0.2
0.4
0.6
0.8
1
D+ =0.211P = 0.026
earlylate
oncogenes:
translocation(217)
missense(40)
copy number (12)
tumorsuppressors:
all mechanisms
(84)
Cancer GeneCensusA
recurrently mutated genes(self-reported in literature)
matched sets of noncancer genes:
1517 genes (for oncogenes)
693 genes (for tumor suppressors)
complete set of 13219 noncancer genes
B
known cancer genes
in Census
others:336
39
38
C
# mutations per 200 kb(110 cancers, pooled tissues)
heterochromatin (H3K9me3levels in 1 MB windows)
replication timing (RepliSeqsignal in 1 MB windows)
mRNA levels, avg. of 6 tissues(log2 RPKM)
# mutations per 200 kb(110 cancers, pooled tissues)
heterochromatin (H3K9me3levels in 1 MB windows)
replication timing (RepliSeqsignal in 1 MB windows)
mRNA levels, avg. of 6 tissues(log2 RPKM)
0
0.2
0.4
0.6
0.8
1D- = 0.199P = 0.043
earlylate
0
0.2
0.4
0.6
0.8
1
0.1 0.3 0.5
D+ = 0.215P = 0.025
39 oncogenes (recurrently mutated)
38 tumor suppressors (recurr. mutated)
D
19 1821missense-activatedoncogenes
recurrently mutated(from literature)
oncogenes
0
0.2
0.4
0.6
0.8
1
0.1 0.3 0.5
D- = 0.185P = 0.061
Expected: the oncogenes and the tumor suppressors are highly enriched with missense mutations (~1.5 - 2.5x).
However, the oncogenes are also enriched with synoynmous mutations over their matched sets, ~1.2x.
Introns of oncogenes (from whole-genome sequencing) are not enriched with SNVs, compared to matched sets.
The matched sets method agrees with Invex, and with simply using neighboring genes as a baseline.
Tissue-specific oncogenes are more enriched with synonymous mutations in the corresponding tissue.
This effect is not due to mutation showers/clustered mutations, as the same cancer samples don't tend to contain both a synonymous and a missense mutation in same gene.
Synonymous enrichment in oncogenes is detectable across cancer types.
Some oncogenes are more highly enriched with synonymous mutations than others, e.g. PDGFRA, EGFR, GATA1, ELN, NTRK1, JAK3, ALK and others (n=16).
The synonymous SNV enrichment in these genes is not paralleled by intronic SNV enrichment.
The synonymous mutations tend to cluster together to a similar extent as the missense mutations in the affected oncogenes. They also (less prominently) cluster with missense mutations.
0%
10%
20%
30%
40%
50%
60%
optimalcodongain
optimalcodon
loss
nochange%
of
syn
on
ymo
us
mu
tati
on
s le
adin
g to
ou
tco
me
n.s.
-18
-13
-8
-3
mR
NA
fo
ldin
g fr
ee e
ner
gy
aro
un
d m
uta
ted
sit
es (
kcal
/mo
l)
50nt windows
w.t.mRNA
mut.mRNA
-31
-26
-21
-16
-11
-6 100nt windows
w.t.mRNA
mut.mRNA
0%
10%
20%
30%
40%
≤30 nt 31-70nt
>70 nt
p < 10-4
1.75
1.26
0.45
-2
-1
0
1
2
1 2 3 4 5 6 7
log 2
RP
KM
of
exo
n
exon # in transcript ENST00000334286
30 random samples w/o point mutations
6 samples w/ synonymous exonic mutations
EDNRB gene,colorectal cancer
-0.5
-0.3
-0.1
0.1
0.3
0.5
wholecDNA
sites w/phyloP>1.0
net
# o
f ga
ined
miR
NA
see
d
site
s p
er s
yn. m
uta
tio
n
16 oncogenes
matched set
-0.3 -0.2 -0.1 0 0.1 0.2
normalized difference (Glass' delta) between properties of mutated positions in oncogenes vs. matched set
Relative preference value at C-cap (of α helices)
Normalized frequency of turn in all-α class
Alpha-helix indices for α-proteins
Relative preference value at N' (of α helices)
Relative preference value at N'' (of α helices)
Normalized frequency of α-helix in all-α class
t-testFDR<10%
0%
10%
20%
30%
enh.gain
enh.loss
sil.gain
sil.loss%
syn
. mu
tati
on
s (w
ith
in 3
0 n
t o
f sp
lice
site
) le
adin
g to
eve
nt
Ke et al. 2012 hexamers
1.53
0.83
0.60
1.90
p = 0.02
enh.gain
enh.loss
RESCUE-ESE
p = 0.003
1.90
0.53
sil.gain
sil.loss
FAS-hex2
p = 3·10-4
0.372.73
A B C
D E
G
F
0%
10%
20%
α-helix, 1st a.a.
α-helix, middle
α-helix, last a.a.
p=0.05n.s.
n.s.
1.43
1.12
0.79
0%
10%
20%
30%
40%
50%
coil
actualsynonymousmutations
randomizedmutationpositions
0%
10%
20%
middle next tocoil only
next to β-sheet
p = 4·10-5
0.97
1.01
2.60
α-helixparts:
0%
10%
20%
30%
40%
50%
coil
H
I
To do: Make nice schematicof alpha-helix as a legend here
Use of „optimal codons” miRNA binding sites Secondary structures in mRNA
What could the synonymous mutations do?
0%
10%
20%
30%
40%
50%
60%
optimalcodongain
optimalcodon
loss
nochange%
of
syn
on
ymo
us
mu
tati
on
s le
adin
g to
ou
tco
me
n.s.
-18
-13
-8
-3
mR
NA
fo
ldin
g fr
ee e
ner
gy
aro
un
d m
uta
ted
sit
es (
kcal
/mo
l)
50nt windows
w.t.mRNA
mut.mRNA
-31
-26
-21
-16
-11
-6 100nt windows
w.t.mRNA
mut.mRNA
0%
10%
20%
30%
40%
≤30 nt 31-70nt
>70 nt
p < 10-4
1.75
1.26
0.45
-2
-1
0
1
2
1 2 3 4 5 6 7
log 2
RP
KM
of
exo
n
exon # in transcript ENST00000334286
30 random samples w/o point mutations
6 samples w/ synonymous exonic mutations
EDNRB gene,colorectal cancer
-0.5
-0.3
-0.1
0.1
0.3
0.5
wholecDNA
sites w/phyloP>1.0
net
# o
f ga
ined
miR
NA
see
d
site
s p
er s
yn. m
uta
tio
n
16 oncogenes
matched set
-0.3 -0.2 -0.1 0 0.1 0.2
normalized difference (Glass' delta) between properties of mutated positions in oncogenes vs. matched set
Relative preference value at C-cap (of α helices)
Normalized frequency of turn in all-α class
Alpha-helix indices for α-proteins
Relative preference value at N' (of α helices)
Relative preference value at N'' (of α helices)
Normalized frequency of α-helix in all-α class
t-testFDR<10%
0%
10%
20%
30%
enh.gain
enh.loss
sil.gain
sil.loss%
syn
. mu
tati
on
s (w
ith
in 3
0 n
t o
f sp
lice
site
) le
adin
g to
eve
nt
Ke et al. 2012 hexamers
1.53
0.83
0.60
1.90
p = 0.02
enh.gain
enh.loss
RESCUE-ESE
p = 0.003
1.90
0.53
sil.gain
sil.loss
FAS-hex2
p = 3·10-4
0.372.73
A B C
D E
G
F
0%
10%
20%
α-helix, 1st a.a.
α-helix, middle
α-helix, last a.a.
p=0.05n.s.
n.s.
1.43
1.12
0.79
0%
10%
20%
30%
40%
50%
coil
actualsynonymousmutations
randomizedmutationpositions
0%
10%
20%
middle next tocoil only
next to β-sheet
p = 4·10-5
0.97
1.01
2.60
α-helixparts:
0%
10%
20%
30%
40%
50%
coil
H
I
To do: Make nice schematicof alpha-helix as a legend here
Use of „optimal codons” miRNA binding sites Secondary structures in mRNA
No general effect was detected in any of these cases (although they may still be important in specific examples).
Exonic Splicing Enhancer
~ and ~
Exonic Splicing Silencer
From Cartegni, Chew & Krainer. Nat Rev Genet. 2002 3(4),285-98.
AGAAGA enhGAAGAT enhGACGTC enhGAAGAC enh
....
CTTTTA silCTTTAA silTAGGTA silTAGTAG sil
Synonymous SNVs tend to be closer to splice sites in oncogenes.
They also tend to cause gains of known exonic splicing enhancer motifs, and losses of exonic splicing silencer motifs.
They more often affect exons with weaker (noncanonical) splice sites.
The exonic splicing enhancers created may resemble SF2/ASF motifs.
The ESS sites that are lost upon mutation sometimes resemble hnRNP A2/B1, H2 and A1 motifs.
Roughly ½ of the putatively causal synonymous mutations alter splicing, as evidenced by examining RNA-seq data from cancer.
We don't (yet) know what the other ½ is doing. One possibility may be affecting protein folding.
In yeast: Pechmann & Frydmann Nature Struct Mol Biol 2013
F
0%
10%
20%
α-helix, 1st a.a.
α-helix, middle
α-helix, last a.a.
p=0.05n.s.
n.s.
1.43
1.12
0.79
0%
10%
20%
30%
40%
50%
coil
actualsynonymousmutations
randomizedmutationpositions
0%
10%
20%
middle next tocoil only
next to β-sheet
p = 4·10-5
0.97
1.01
2.60
α-helixparts:
0%
10%
20%
30%
40%
50%
coil
G
H
N’’ N’ Ncap Ccap C’ C’’
α-helix
turn
-0.3 -0.2 -0.1 0 0.1 0.2
normalized difference (Glass' delta) between mutated sites in oncogenes vs. matched set
relative preference value at C-cap
normalized frequency of turn in all-α class
α-helix indices for α-proteins
relative preference value at N'
relative preference value at N''
normalized frequency of α-helix in all-α class
FDR<10%
...also in cancer: we observe an enrichment of synonymous mutations at N-termini of alpha-helices, esp. if close to beta-sheets.Suggestive of effects on folding.
known
novel
TP53 gene has a large excess of synonymous mutations, which are always near splice sites.
We found three examples of recurrent SNV that inactivate the nearby splice site.
causes a frameshift
Dosage sensitive oncogenes have many point mutations in their 3' UTRs
Take-home messages:
• oncogenes contain an excess of synonymous mutations in human cancers
• a subset of synonymous mutations target splicing motifs
• 1/5 to 1/2 synonymous mutations in oncogenes reported to-date are acting as driver mutations
• ~6 – 8% of all driver mutations due to single nucleotide changes are likely to be synonymous mutations
• TP53 has recurrent synonymous mutations that disrupt splice sites
• an excess of mutations of 3’ UTRs of dosage-sensitive genes
published in: Supek et al. (2014) Cell. http://dx.doi.org/10.1016/j.cell.2014.01.051
Thank you!
Fran Supek
1) Lehner group, CRG/EMBL Systems Biology Unit, Barcelona
2) Dept of Electronics, RBI, Zagreb, Croatia
XXI Jornades de Biologia Molecular
Barcelona, 11.6.2014