getting genomics and proteomics data to work together - jason wong

Getting genomics and proteomics

data to work together

Prince of Wales Clinical School

Dr Jason Wong

State of the art in proteomics

Proteomics can now be use to identify and quantify tens of thousands

of proteins in a single experiment.

Nagaraj et al Mol Sys Biol 2011

HeLa cells: 10,255 proteins identified

Zhou et al Nat. Comm. 2013

mESC: 11,352 proteins identified

Mertins et al Nat. Met. 2013

Jurkat cells: 7,897 proteins

Challenges of proteomics

• Experimental perspective

• Obtaining sufficient sample

• Sample preparation

• Dynamic range

• Computational perspective

• Risk of false positive identification

• General methods only identifies known proteins

Wong et al BMC Bioinf 2007

Annotated spectra (~30%)

High quality potentially

annotatable spectra (~20%)

Non-

peptide/low

quality

spectra

(~50%)

Genomics and transcriptomics

Analysis of DNA/RNA does not have many of the limitations of proteomics,

especially with the emergence of next-generation sequencing (NGS).

•Sample quantity less of an issue when analysing DNA/RNA.

•Very large dynamic range with NGS.

•Relatively simple sequence-based data analysis.

Next-generation sequencing has allowed the discovery of:

1.Single nucleotide variants/Indels (Exome-seq/RNA-seq)

2.Novel splice variants (RNA-seq)

3.Novel proteins (Ribosome profiling)

However, in order to understand the functional importance of coding

genes, it is still essential to study them at the protein level.

Single nucleotide variants – Jurkat cells

Datasets

Experiment Details Reference

Exome-seq ~ 150 M, 100 bp PE reads Broad Institute, CCLE

RNA-seq ~ 100 M, 100 bp PE reads Sheynkman et al

MCP (2013)

Proteomics deep ~ 0.5 M spectra Sheynkman et al

MCP (2013)

Proteomics ultra-deep ~ 2.5 M spectra Mertins et al

Nat Meth (2013)

Proteomics ultra-deep

PTM

pSTY - ~ 0.85 M spectra

(ac)K - ~ 0.35 M spectra

(ubi)K - ~ 0.36 M spectra

Mertins et al

Nat Meth (2013)

Searching for peptides with SNVs

BAM

VCF

Annotated

variants Variant

peptides

GATK/samtools

ANNOVA

Python

scripts

Exome/RNA-seq

Mass spectra

Refseq

Annotated mass

spectra

Search using MaxQuant

+

Proteomics

Variants – Overlap between Exome- and RNA-seq

Exome-seq RNA-seq

8232 4584 1975 Non-synonymous

variants

Almost 70% of RNA-seq

n.s. variants overlap with

Exome-seq.

Variants – Overlap with proteomics data

RNA-seq

(Total variants

6559) 638 349 99

Mertins dataset Sheynkman dataset

525 290 81

Exome-seq

(Total variants

12816)

• Suggests that RNA-seq may be

more suited for finding variants in

proteomics data.

• However may also just be just due

to data quality issues.

Variant

peptides

Total

peptides

Mertins 987 156,606

Sheynkman 448 75,878

RNA-seq based variants validated by

mass spectrometry

Validation of peptide identifications

Variant

peptides

Reference

peptides

Heterozygous 673 465

Homozygous 314 4

Chr Pos Ref Alt Zygosity Qual Depth Depth Alt Func.refGene Gene.refGene

chr10 51363659 A G 1 222 222 200 exonic PARG

chr10 71906150 T C 1 44.8 2 2 exonic TYSND1

chr11 67414492 C T 1 43.8 2 2 exonic ACY3

chr19 3492265 G A 1 59 3 3 exonic DOHH

R e f e r e n c e a b u n d a n c e ( l o g 1 0 )

Va

ria

nt a

bu

nd

an

ce

(lo

g1

0)

4 5 6 7 8 9

4

5

6

7

8

9r2=0.19

Mertins dataset

Validation of peptide identifications

Ma

xQ

ua

nt s

co

re

Va

r i an

t p

ep

t id

es

Re

f er e

nc

e p

ep

t id

es

Al l

pe

pt i

de

s

0

5 0

1 0 0

1 5 0

2 0 0

2 5 0n . s .

n . s .

0- 5

0

50

- 10

0

10

0- 1

50

15

0- 2

00

20

0- 2

50

25

0- 3

00

30

0- 3

50

35

0- 4

00

40

0- 4

50

45

0- 5

00

50

0- 5

50

55

0- 6

00

60

0- 6

50

65

0- 7

00

70

0- 7

50

75

0- 8

00

80

0- 8

50

85

0- 9

00

90

0- 9

50

95

0- 1

00

0

>1

00

0

0 . 0

0 . 1

0 . 2

0 . 3

0 . 4

0 . 5

%v

ar

ian

ts

V a r i a n t s i d e n t i f i e d b y M S

A l l v a r ia n t s

R e a d d e p t h

Variants in application to PTMs

Variant peptides Total peptides

Phosphorylation STY(p) 357 64067

Acetylation K(Ac) 2 5805

Ubiquitination K(GG) 172 38454

(1) Variant residue not affecting

phosphorylation site

95% (339)

(2) Variant residue is

phosphorylation site

3.4% (12)

(3) Variant residue may

influence phosphorylation

1.6% (6)

How does the variant affect phosphorylation?

(1) EILpSPQ(W/C)Y

(2) EIL(A/pS)PQWY

(3) EIL(p)S(G/P)QWY

Phosphorylation sites directly affected by variants

Gene Refseq Variant Peptide SNP ID SIFT score Polyphen score

GBF1 NM_001199378 p.G1690S GGSPSALWEITWER rs11191274 0.75 0.703

LINS NM_001040616 p.R680S EFSLEPPSSPLVLK rs8451 0.36 0.023

TADA2A NM_001166105 p.P6S LGSFSNDPSDKPPCR rs7211875 1 0

TCF3 NM_001136139 p.A8S MSPVGTDKELSDLLDFSMMFPLPVTNGK rs147133056 0.05 0.997

USE1 NM_018467 p.L154S TGVAGSQPVSEKQSAAELDLVLQR rs414528 0.82 0

ATP5SL NM_001167867 p.N40S LGAAVAPEGSQKK rs2231940 0.82 0.018

TANC1 NM_001145909 p.N250S SGSSLEWNKDGSLR rs12466551 1 0

SF3B1 NM_001005526 p.A86T KPGYHTPVALLNDIPQSTEQYDPFAEHRPPK NA 0.01 0.955

DDX27 NM_017895 p.G766S QYRASPSFEER rs1130146 0.37 0.983

FAM114A2 NM_018691 p.G122S AETSLGIPSPSEISTEVK rs2578377 1 0

SPDL1 NM_017785 p.L586S SHPILYVSSK rs3777084 0.89 0

GPANK1 NM_033177 p.A78S IMKSPAAEAVAEGASGR NA 0.72 0.005

PARG NM_003631 p.L138P LENVSQLSLDKSPTEK rs4412715 NA NA

KIAA0586 NM_001244193 p.L703P EASPPPVQTWIK rs1748986 0.22 0.001

DMXL2 NM_001174116 p.S1288P FGDTEADSPNAEEAAMQDHSTFK rs12102203 0.21 0

GGA2 NM_015044 p.A424P NLLDLLSPQPAPCPLNYVSQK rs1135045 0.48 0

PFAS NM_012393 p.L621P NGQGDAPPTPPPTPVDLELEWVLGK rs11078738 1 0

ZNF235 NM_004234 p.H296P SPACSTPEKDTSYSSGIPVQQSVR rs2125579 0.02 0.001

Creation of

phospho-site

(XS/T)

Creation of

MAPK motif

(S/TX S/TP)

Splicing factor 3B subunit 1 (SF3B1)

• Part of the RNA splicing machinery

• Frequently mutated in myleodysplasia (~20%) and other leukaemias

• A86T has been identified previously in one lung cancer sample from

TCGA.

• SF3B1 is a phosphoprotein and phosphorylation of SF3B1 is known to

be important for the assembly of the RNA splicing machinery.

A86T located in exon3

Using mass spectrometry to discover new proteins

Prediction of alternative ORFs from RNA-seq

• 86 alternative ORFs

• 57% non-AUG translation initiation

Computational prediction of alternative ORFs

• Only AUG translation initiation

• 1,259 alternative ORFs!

Po

st

er

ior

Er

ro

r P

ro

ba

bil

ity

(P

EP

)

Se

r um

aO

RF

Se

r um

rO

RF

He

La

aO

RF

He

La

rO

RF

He

La

< 1

0 k

Da

aO

RF

He

La

< 1

0 k

Da

rO

RF

0 . 0 0

0 . 0 2

0 . 0 4

0 . 0 6 * * * * * * * * * * * *

Ribosome profiling

Ingolia et al. Cell 2011

• Sequence only mRNA protected

by ribosomes

• Results in identification of mRNA

that is being translated

• Use of translation inhibitors

enables discovery of newly

ORFs

Application to HEK293T cells

Lee et al PNAS 2012

o Reported 12,814 alternative ORFs in refseq genes

o Using their annotation, constructed database of proteins arising from these

alternative ORFs.

o Searched against publically available HEK293T datasets

• Geiger et al MCP 2012 , ~0.5 million spectra

RefSeq Accession Gene Symbol Relative to rTIS Annotation Frame ORF length Codon Peptide count

NM_019008 SMCR7L -311 5'UTR 1 213 ATG 3

NM_080670 SLC35A4 -719 5'UTR 1 312 ATG 2

NM_001142726 C1orf122 -609 5'UTR 0 753 TTG 2

NM_004860 FXR2 -219 5'UTR 0 2241 GTG 2

Identified novel proteins

NM_080670_SLC35A4_10_5'UTR (chr5:139,944,429-139,946,345)

NM_019008_SMCR7L_188_5'UTR (chr22:39,900,236-39,900,445)

C o n s e r v a t i o n

Pe

rc

en

ta

ge

of r

eg

ion

s0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0

0 . 0

0 . 1

0 . 2

0 . 3

0 . 4

5 'U T R

C o d i n g

(11 novel ORFs in total)

Conclusion and next steps

Variants

•Analysis indels and more datasets

Novel proteins

•Develop methods to analyse ribosome profiles in unannotated genomic

regions.

• Generate HEK293T peptidomics (< 10kDa) dataset

In terms of bioinformatics:

•Automate integration of transcriptomics and proteomics data.

•Methods to visualise identified peptides on genome browsers.

•Develop new MS-based search methods of directly finding variant peptides.

Acknowledgements

Bioinformatics and Integrative genomics team

• Dr Ranjeeta Menon

• Dr Dominik Beck

• John Ng

• Jackie Huang

• Kate Guan

• Felix Ma

• Dilmi Perera

• Diego Chacon

Carnegie Institution for Science, Baltimore, USA

• Dr Nicolas Ingolia

Funding:

getting genomics and proteomics data to work together - jason wong

Health & Medicine

proteomics proteomics

proteomics data rnaseq

r e fe r e n c e

proteomics deep

exomeseq total variants

n t s c

r ia n t s id e n t

limitations of proteomics