getting genomics and proteomics data to work together - jason wong

19
Getting genomics and proteomics data to work together Prince of Wales Clinical School Dr Jason Wong

Upload: australian-bioinformatics-network

Post on 07-Jul-2015

217 views

Category:

Health & Medicine


0 download

DESCRIPTION

Genomics and proteomics are closely related fields of research. An understanding of one is generally required for the other, yet in many ways, the methods used to study the two cannot be more different. With the emergence of massive parallel sequencing vast quantities of genomics and transcriptomics data are being generated. At the same time, improvements in mass spectrometry technologies are enabling proteins to be identified with greater specificity and sensitivity. This now provides new opportunities to investigate ways to integrate genomics and proteomics data and understand how the two can complement each other to advance biological knowledge. Using HeLa cells as a model system, we have comprehensively examined the gene models derived from genomics and transcriptomics data and integrated these with proteomics and phosphoproteomics datasets. Reanalysis of proteomics data using HeLa specific gene models enable significant increases in the number of peptides/proteins to be identified, providing new insights into both the genome and proteome of HeLa cells. Technical challenges and methods required for integrating genomics and proteomics data will also be discussed. In summary, given that massive parallel sequencing data are now available for many popular cell lines in public data repositories, our study provides further support for the need and benefit of an integrative data analysis for both genome and proteome analysis.

TRANSCRIPT

Page 1: Getting genomics and proteomics data to work together - Jason Wong

Getting genomics and proteomics

data to work together

Prince of Wales Clinical School

Dr Jason Wong

Page 2: Getting genomics and proteomics data to work together - Jason Wong

State of the art in proteomics

Proteomics can now be use to identify and quantify tens of thousands

of proteins in a single experiment.

Nagaraj et al Mol Sys Biol 2011

HeLa cells: 10,255 proteins identified

Zhou et al Nat. Comm. 2013

mESC: 11,352 proteins identified

Mertins et al Nat. Met. 2013

Jurkat cells: 7,897 proteins

Page 3: Getting genomics and proteomics data to work together - Jason Wong

Challenges of proteomics

• Experimental perspective

• Obtaining sufficient sample

• Sample preparation

• Dynamic range

• Computational perspective

• Risk of false positive identification

• General methods only identifies known proteins

Wong et al BMC Bioinf 2007

Annotated spectra (~30%)

High quality potentially

annotatable spectra (~20%)

Non-

peptide/low

quality

spectra

(~50%)

Page 4: Getting genomics and proteomics data to work together - Jason Wong

Genomics and transcriptomics

Analysis of DNA/RNA does not have many of the limitations of proteomics,

especially with the emergence of next-generation sequencing (NGS).

•Sample quantity less of an issue when analysing DNA/RNA.

•Very large dynamic range with NGS.

•Relatively simple sequence-based data analysis.

Next-generation sequencing has allowed the discovery of:

1.Single nucleotide variants/Indels (Exome-seq/RNA-seq)

2.Novel splice variants (RNA-seq)

3.Novel proteins (Ribosome profiling)

However, in order to understand the functional importance of coding

genes, it is still essential to study them at the protein level.

Page 5: Getting genomics and proteomics data to work together - Jason Wong

Single nucleotide variants – Jurkat cells

Datasets

Experiment Details Reference

Exome-seq ~ 150 M, 100 bp PE reads Broad Institute, CCLE

RNA-seq ~ 100 M, 100 bp PE reads Sheynkman et al

MCP (2013)

Proteomics deep ~ 0.5 M spectra Sheynkman et al

MCP (2013)

Proteomics ultra-deep ~ 2.5 M spectra Mertins et al

Nat Meth (2013)

Proteomics ultra-deep

PTM

pSTY - ~ 0.85 M spectra

(ac)K - ~ 0.35 M spectra

(ubi)K - ~ 0.36 M spectra

Mertins et al

Nat Meth (2013)

Page 6: Getting genomics and proteomics data to work together - Jason Wong

Searching for peptides with SNVs

BAM

VCF

Annotated

variants Variant

peptides

GATK/samtools

ANNOVA

Python

scripts

Exome/RNA-seq

Mass spectra

Refseq

Annotated mass

spectra

Search using MaxQuant

+

Proteomics

Page 7: Getting genomics and proteomics data to work together - Jason Wong

Variants – Overlap between Exome- and RNA-seq

Exome-seq RNA-seq

8232 4584 1975 Non-synonymous

variants

Almost 70% of RNA-seq

n.s. variants overlap with

Exome-seq.

Page 8: Getting genomics and proteomics data to work together - Jason Wong

Variants – Overlap with proteomics data

RNA-seq

(Total variants

6559) 638 349 99

Mertins dataset Sheynkman dataset

525 290 81

Exome-seq

(Total variants

12816)

• Suggests that RNA-seq may be

more suited for finding variants in

proteomics data.

• However may also just be just due

to data quality issues.

Variant

peptides

Total

peptides

Mertins 987 156,606

Sheynkman 448 75,878

RNA-seq based variants validated by

mass spectrometry

Page 9: Getting genomics and proteomics data to work together - Jason Wong

Validation of peptide identifications

Variant

peptides

Reference

peptides

Heterozygous 673 465

Homozygous 314 4

Chr Pos Ref Alt Zygosity Qual Depth Depth Alt Func.refGene Gene.refGene

chr10 51363659 A G 1 222 222 200 exonic PARG

chr10 71906150 T C 1 44.8 2 2 exonic TYSND1

chr11 67414492 C T 1 43.8 2 2 exonic ACY3

chr19 3492265 G A 1 59 3 3 exonic DOHH

R e f e r e n c e a b u n d a n c e ( l o g 1 0 )

Va

ria

nt a

bu

nd

an

ce

(lo

g1

0)

4 5 6 7 8 9

4

5

6

7

8

9r2=0.19

Mertins dataset

Page 10: Getting genomics and proteomics data to work together - Jason Wong

Validation of peptide identifications

Ma

xQ

ua

nt s

co

re

Va

r i an

t p

ep

t id

es

Re

f er e

nc

e p

ep

t id

es

Al l

pe

pt i

de

s

0

5 0

1 0 0

1 5 0

2 0 0

2 5 0n . s .

n . s .

0- 5

0

50

- 10

0

10

0- 1

50

15

0- 2

00

20

0- 2

50

25

0- 3

00

30

0- 3

50

35

0- 4

00

40

0- 4

50

45

0- 5

00

50

0- 5

50

55

0- 6

00

60

0- 6

50

65

0- 7

00

70

0- 7

50

75

0- 8

00

80

0- 8

50

85

0- 9

00

90

0- 9

50

95

0- 1

00

0

>1

00

0

0 . 0

0 . 1

0 . 2

0 . 3

0 . 4

0 . 5

%v

ar

ian

ts

V a r i a n t s i d e n t i f i e d b y M S

A l l v a r ia n t s

R e a d d e p t h

Page 11: Getting genomics and proteomics data to work together - Jason Wong

Variants in application to PTMs

Variant peptides Total peptides

Phosphorylation STY(p) 357 64067

Acetylation K(Ac) 2 5805

Ubiquitination K(GG) 172 38454

(1) Variant residue not affecting

phosphorylation site

95% (339)

(2) Variant residue is

phosphorylation site

3.4% (12)

(3) Variant residue may

influence phosphorylation

1.6% (6)

How does the variant affect phosphorylation?

(1) EILpSPQ(W/C)Y

(2) EIL(A/pS)PQWY

(3) EIL(p)S(G/P)QWY

Page 12: Getting genomics and proteomics data to work together - Jason Wong

Phosphorylation sites directly affected by variants

Gene Refseq Variant Peptide SNP ID SIFT score Polyphen score

GBF1 NM_001199378 p.G1690S GGSPSALWEITWER rs11191274 0.75 0.703

LINS NM_001040616 p.R680S EFSLEPPSSPLVLK rs8451 0.36 0.023

TADA2A NM_001166105 p.P6S LGSFSNDPSDKPPCR rs7211875 1 0

TCF3 NM_001136139 p.A8S MSPVGTDKELSDLLDFSMMFPLPVTNGK rs147133056 0.05 0.997

USE1 NM_018467 p.L154S TGVAGSQPVSEKQSAAELDLVLQR rs414528 0.82 0

ATP5SL NM_001167867 p.N40S LGAAVAPEGSQKK rs2231940 0.82 0.018

TANC1 NM_001145909 p.N250S SGSSLEWNKDGSLR rs12466551 1 0

SF3B1 NM_001005526 p.A86T KPGYHTPVALLNDIPQSTEQYDPFAEHRPPK NA 0.01 0.955

DDX27 NM_017895 p.G766S QYRASPSFEER rs1130146 0.37 0.983

FAM114A2 NM_018691 p.G122S AETSLGIPSPSEISTEVK rs2578377 1 0

SPDL1 NM_017785 p.L586S SHPILYVSSK rs3777084 0.89 0

GPANK1 NM_033177 p.A78S IMKSPAAEAVAEGASGR NA 0.72 0.005

PARG NM_003631 p.L138P LENVSQLSLDKSPTEK rs4412715 NA NA

KIAA0586 NM_001244193 p.L703P EASPPPVQTWIK rs1748986 0.22 0.001

DMXL2 NM_001174116 p.S1288P FGDTEADSPNAEEAAMQDHSTFK rs12102203 0.21 0

GGA2 NM_015044 p.A424P NLLDLLSPQPAPCPLNYVSQK rs1135045 0.48 0

PFAS NM_012393 p.L621P NGQGDAPPTPPPTPVDLELEWVLGK rs11078738 1 0

ZNF235 NM_004234 p.H296P SPACSTPEKDTSYSSGIPVQQSVR rs2125579 0.02 0.001

Creation of

phospho-site

(XS/T)

Creation of

MAPK motif

(S/TX S/TP)

Page 13: Getting genomics and proteomics data to work together - Jason Wong

Splicing factor 3B subunit 1 (SF3B1)

• Part of the RNA splicing machinery

• Frequently mutated in myleodysplasia (~20%) and other leukaemias

• A86T has been identified previously in one lung cancer sample from

TCGA.

• SF3B1 is a phosphoprotein and phosphorylation of SF3B1 is known to

be important for the assembly of the RNA splicing machinery.

A86T located in exon3

Page 14: Getting genomics and proteomics data to work together - Jason Wong

Using mass spectrometry to discover new proteins

Prediction of alternative ORFs from RNA-seq

• 86 alternative ORFs

• 57% non-AUG translation initiation

Computational prediction of alternative ORFs

• Only AUG translation initiation

• 1,259 alternative ORFs!

Po

st

er

ior

Er

ro

r P

ro

ba

bil

ity

(P

EP

)

Se

r um

aO

RF

Se

r um

rO

RF

He

La

aO

RF

He

La

rO

RF

He

La

< 1

0 k

Da

aO

RF

He

La

< 1

0 k

Da

rO

RF

0 . 0 0

0 . 0 2

0 . 0 4

0 . 0 6 * * * * * * * * * * * *

Page 15: Getting genomics and proteomics data to work together - Jason Wong

Ribosome profiling

Ingolia et al. Cell 2011

• Sequence only mRNA protected

by ribosomes

• Results in identification of mRNA

that is being translated

• Use of translation inhibitors

enables discovery of newly

ORFs

Page 16: Getting genomics and proteomics data to work together - Jason Wong

Application to HEK293T cells

Lee et al PNAS 2012

o Reported 12,814 alternative ORFs in refseq genes

o Using their annotation, constructed database of proteins arising from these

alternative ORFs.

o Searched against publically available HEK293T datasets

• Geiger et al MCP 2012 , ~0.5 million spectra

Page 17: Getting genomics and proteomics data to work together - Jason Wong

RefSeq Accession Gene Symbol Relative to rTIS Annotation Frame ORF length Codon Peptide count

NM_019008 SMCR7L -311 5'UTR 1 213 ATG 3

NM_080670 SLC35A4 -719 5'UTR 1 312 ATG 2

NM_001142726 C1orf122 -609 5'UTR 0 753 TTG 2

NM_004860 FXR2 -219 5'UTR 0 2241 GTG 2

Identified novel proteins

NM_080670_SLC35A4_10_5'UTR (chr5:139,944,429-139,946,345)

NM_019008_SMCR7L_188_5'UTR (chr22:39,900,236-39,900,445)

C o n s e r v a t i o n

Pe

rc

en

ta

ge

of r

eg

ion

s0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0

0 . 0

0 . 1

0 . 2

0 . 3

0 . 4

5 'U T R

C o d i n g

(11 novel ORFs in total)

Page 18: Getting genomics and proteomics data to work together - Jason Wong

Conclusion and next steps

Variants

•Analysis indels and more datasets

Novel proteins

•Develop methods to analyse ribosome profiles in unannotated genomic

regions.

• Generate HEK293T peptidomics (< 10kDa) dataset

In terms of bioinformatics:

•Automate integration of transcriptomics and proteomics data.

•Methods to visualise identified peptides on genome browsers.

•Develop new MS-based search methods of directly finding variant peptides.

Page 19: Getting genomics and proteomics data to work together - Jason Wong

Acknowledgements

Bioinformatics and Integrative genomics team

• Dr Ranjeeta Menon

• Dr Dominik Beck

• John Ng

• Jackie Huang

• Kate Guan

• Felix Ma

• Dilmi Perera

• Diego Chacon

Carnegie Institution for Science, Baltimore, USA

• Dr Nicolas Ingolia

Funding: