getting genomics and proteomics data to work together - jason wong
DESCRIPTION
Genomics and proteomics are closely related fields of research. An understanding of one is generally required for the other, yet in many ways, the methods used to study the two cannot be more different. With the emergence of massive parallel sequencing vast quantities of genomics and transcriptomics data are being generated. At the same time, improvements in mass spectrometry technologies are enabling proteins to be identified with greater specificity and sensitivity. This now provides new opportunities to investigate ways to integrate genomics and proteomics data and understand how the two can complement each other to advance biological knowledge. Using HeLa cells as a model system, we have comprehensively examined the gene models derived from genomics and transcriptomics data and integrated these with proteomics and phosphoproteomics datasets. Reanalysis of proteomics data using HeLa specific gene models enable significant increases in the number of peptides/proteins to be identified, providing new insights into both the genome and proteome of HeLa cells. Technical challenges and methods required for integrating genomics and proteomics data will also be discussed. In summary, given that massive parallel sequencing data are now available for many popular cell lines in public data repositories, our study provides further support for the need and benefit of an integrative data analysis for both genome and proteome analysis.TRANSCRIPT
Getting genomics and proteomics
data to work together
Prince of Wales Clinical School
Dr Jason Wong
State of the art in proteomics
Proteomics can now be use to identify and quantify tens of thousands
of proteins in a single experiment.
Nagaraj et al Mol Sys Biol 2011
HeLa cells: 10,255 proteins identified
Zhou et al Nat. Comm. 2013
mESC: 11,352 proteins identified
Mertins et al Nat. Met. 2013
Jurkat cells: 7,897 proteins
Challenges of proteomics
• Experimental perspective
• Obtaining sufficient sample
• Sample preparation
• Dynamic range
• Computational perspective
• Risk of false positive identification
• General methods only identifies known proteins
Wong et al BMC Bioinf 2007
Annotated spectra (~30%)
High quality potentially
annotatable spectra (~20%)
Non-
peptide/low
quality
spectra
(~50%)
Genomics and transcriptomics
Analysis of DNA/RNA does not have many of the limitations of proteomics,
especially with the emergence of next-generation sequencing (NGS).
•Sample quantity less of an issue when analysing DNA/RNA.
•Very large dynamic range with NGS.
•Relatively simple sequence-based data analysis.
Next-generation sequencing has allowed the discovery of:
1.Single nucleotide variants/Indels (Exome-seq/RNA-seq)
2.Novel splice variants (RNA-seq)
3.Novel proteins (Ribosome profiling)
However, in order to understand the functional importance of coding
genes, it is still essential to study them at the protein level.
Single nucleotide variants – Jurkat cells
Datasets
Experiment Details Reference
Exome-seq ~ 150 M, 100 bp PE reads Broad Institute, CCLE
RNA-seq ~ 100 M, 100 bp PE reads Sheynkman et al
MCP (2013)
Proteomics deep ~ 0.5 M spectra Sheynkman et al
MCP (2013)
Proteomics ultra-deep ~ 2.5 M spectra Mertins et al
Nat Meth (2013)
Proteomics ultra-deep
PTM
pSTY - ~ 0.85 M spectra
(ac)K - ~ 0.35 M spectra
(ubi)K - ~ 0.36 M spectra
Mertins et al
Nat Meth (2013)
Searching for peptides with SNVs
BAM
VCF
Annotated
variants Variant
peptides
GATK/samtools
ANNOVA
Python
scripts
Exome/RNA-seq
Mass spectra
Refseq
Annotated mass
spectra
Search using MaxQuant
+
Proteomics
Variants – Overlap between Exome- and RNA-seq
Exome-seq RNA-seq
8232 4584 1975 Non-synonymous
variants
Almost 70% of RNA-seq
n.s. variants overlap with
Exome-seq.
Variants – Overlap with proteomics data
RNA-seq
(Total variants
6559) 638 349 99
Mertins dataset Sheynkman dataset
525 290 81
Exome-seq
(Total variants
12816)
• Suggests that RNA-seq may be
more suited for finding variants in
proteomics data.
• However may also just be just due
to data quality issues.
Variant
peptides
Total
peptides
Mertins 987 156,606
Sheynkman 448 75,878
RNA-seq based variants validated by
mass spectrometry
Validation of peptide identifications
Variant
peptides
Reference
peptides
Heterozygous 673 465
Homozygous 314 4
Chr Pos Ref Alt Zygosity Qual Depth Depth Alt Func.refGene Gene.refGene
chr10 51363659 A G 1 222 222 200 exonic PARG
chr10 71906150 T C 1 44.8 2 2 exonic TYSND1
chr11 67414492 C T 1 43.8 2 2 exonic ACY3
chr19 3492265 G A 1 59 3 3 exonic DOHH
R e f e r e n c e a b u n d a n c e ( l o g 1 0 )
Va
ria
nt a
bu
nd
an
ce
(lo
g1
0)
4 5 6 7 8 9
4
5
6
7
8
9r2=0.19
Mertins dataset
Validation of peptide identifications
Ma
xQ
ua
nt s
co
re
Va
r i an
t p
ep
t id
es
Re
f er e
nc
e p
ep
t id
es
Al l
pe
pt i
de
s
0
5 0
1 0 0
1 5 0
2 0 0
2 5 0n . s .
n . s .
0- 5
0
50
- 10
0
10
0- 1
50
15
0- 2
00
20
0- 2
50
25
0- 3
00
30
0- 3
50
35
0- 4
00
40
0- 4
50
45
0- 5
00
50
0- 5
50
55
0- 6
00
60
0- 6
50
65
0- 7
00
70
0- 7
50
75
0- 8
00
80
0- 8
50
85
0- 9
00
90
0- 9
50
95
0- 1
00
0
>1
00
0
0 . 0
0 . 1
0 . 2
0 . 3
0 . 4
0 . 5
%v
ar
ian
ts
V a r i a n t s i d e n t i f i e d b y M S
A l l v a r ia n t s
R e a d d e p t h
Variants in application to PTMs
Variant peptides Total peptides
Phosphorylation STY(p) 357 64067
Acetylation K(Ac) 2 5805
Ubiquitination K(GG) 172 38454
(1) Variant residue not affecting
phosphorylation site
95% (339)
(2) Variant residue is
phosphorylation site
3.4% (12)
(3) Variant residue may
influence phosphorylation
1.6% (6)
How does the variant affect phosphorylation?
(1) EILpSPQ(W/C)Y
(2) EIL(A/pS)PQWY
(3) EIL(p)S(G/P)QWY
Phosphorylation sites directly affected by variants
Gene Refseq Variant Peptide SNP ID SIFT score Polyphen score
GBF1 NM_001199378 p.G1690S GGSPSALWEITWER rs11191274 0.75 0.703
LINS NM_001040616 p.R680S EFSLEPPSSPLVLK rs8451 0.36 0.023
TADA2A NM_001166105 p.P6S LGSFSNDPSDKPPCR rs7211875 1 0
TCF3 NM_001136139 p.A8S MSPVGTDKELSDLLDFSMMFPLPVTNGK rs147133056 0.05 0.997
USE1 NM_018467 p.L154S TGVAGSQPVSEKQSAAELDLVLQR rs414528 0.82 0
ATP5SL NM_001167867 p.N40S LGAAVAPEGSQKK rs2231940 0.82 0.018
TANC1 NM_001145909 p.N250S SGSSLEWNKDGSLR rs12466551 1 0
SF3B1 NM_001005526 p.A86T KPGYHTPVALLNDIPQSTEQYDPFAEHRPPK NA 0.01 0.955
DDX27 NM_017895 p.G766S QYRASPSFEER rs1130146 0.37 0.983
FAM114A2 NM_018691 p.G122S AETSLGIPSPSEISTEVK rs2578377 1 0
SPDL1 NM_017785 p.L586S SHPILYVSSK rs3777084 0.89 0
GPANK1 NM_033177 p.A78S IMKSPAAEAVAEGASGR NA 0.72 0.005
PARG NM_003631 p.L138P LENVSQLSLDKSPTEK rs4412715 NA NA
KIAA0586 NM_001244193 p.L703P EASPPPVQTWIK rs1748986 0.22 0.001
DMXL2 NM_001174116 p.S1288P FGDTEADSPNAEEAAMQDHSTFK rs12102203 0.21 0
GGA2 NM_015044 p.A424P NLLDLLSPQPAPCPLNYVSQK rs1135045 0.48 0
PFAS NM_012393 p.L621P NGQGDAPPTPPPTPVDLELEWVLGK rs11078738 1 0
ZNF235 NM_004234 p.H296P SPACSTPEKDTSYSSGIPVQQSVR rs2125579 0.02 0.001
Creation of
phospho-site
(XS/T)
Creation of
MAPK motif
(S/TX S/TP)
Splicing factor 3B subunit 1 (SF3B1)
• Part of the RNA splicing machinery
• Frequently mutated in myleodysplasia (~20%) and other leukaemias
• A86T has been identified previously in one lung cancer sample from
TCGA.
• SF3B1 is a phosphoprotein and phosphorylation of SF3B1 is known to
be important for the assembly of the RNA splicing machinery.
A86T located in exon3
Using mass spectrometry to discover new proteins
Prediction of alternative ORFs from RNA-seq
• 86 alternative ORFs
• 57% non-AUG translation initiation
Computational prediction of alternative ORFs
• Only AUG translation initiation
• 1,259 alternative ORFs!
Po
st
er
ior
Er
ro
r P
ro
ba
bil
ity
(P
EP
)
Se
r um
aO
RF
Se
r um
rO
RF
He
La
aO
RF
He
La
rO
RF
He
La
< 1
0 k
Da
aO
RF
He
La
< 1
0 k
Da
rO
RF
0 . 0 0
0 . 0 2
0 . 0 4
0 . 0 6 * * * * * * * * * * * *
Ribosome profiling
Ingolia et al. Cell 2011
• Sequence only mRNA protected
by ribosomes
• Results in identification of mRNA
that is being translated
• Use of translation inhibitors
enables discovery of newly
ORFs
Application to HEK293T cells
Lee et al PNAS 2012
o Reported 12,814 alternative ORFs in refseq genes
o Using their annotation, constructed database of proteins arising from these
alternative ORFs.
o Searched against publically available HEK293T datasets
• Geiger et al MCP 2012 , ~0.5 million spectra
RefSeq Accession Gene Symbol Relative to rTIS Annotation Frame ORF length Codon Peptide count
NM_019008 SMCR7L -311 5'UTR 1 213 ATG 3
NM_080670 SLC35A4 -719 5'UTR 1 312 ATG 2
NM_001142726 C1orf122 -609 5'UTR 0 753 TTG 2
NM_004860 FXR2 -219 5'UTR 0 2241 GTG 2
Identified novel proteins
NM_080670_SLC35A4_10_5'UTR (chr5:139,944,429-139,946,345)
NM_019008_SMCR7L_188_5'UTR (chr22:39,900,236-39,900,445)
C o n s e r v a t i o n
Pe
rc
en
ta
ge
of r
eg
ion
s0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0
0 . 0
0 . 1
0 . 2
0 . 3
0 . 4
5 'U T R
C o d i n g
(11 novel ORFs in total)
Conclusion and next steps
Variants
•Analysis indels and more datasets
Novel proteins
•Develop methods to analyse ribosome profiles in unannotated genomic
regions.
• Generate HEK293T peptidomics (< 10kDa) dataset
In terms of bioinformatics:
•Automate integration of transcriptomics and proteomics data.
•Methods to visualise identified peptides on genome browsers.
•Develop new MS-based search methods of directly finding variant peptides.
Acknowledgements
Bioinformatics and Integrative genomics team
• Dr Ranjeeta Menon
• Dr Dominik Beck
• John Ng
• Jackie Huang
• Kate Guan
• Felix Ma
• Dilmi Perera
• Diego Chacon
Carnegie Institution for Science, Baltimore, USA
• Dr Nicolas Ingolia
Funding: