workflows and pipelines for ngs analysis: lessons from ... · genome annotation. genome annotation...

47
11 th Sep 2014 Debasis Dash Workflows and Pipelines for NGS analysis: Lessons from proteomics Conference on Applying NGS in Basic research Health care and Agriculture

Upload: others

Post on 14-Oct-2020

13 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Workflows and Pipelines for NGS analysis: Lessons from ... · Genome annotation. Genome annotation Transcriptome Proteome Structural ... biology Importance of genome annotation Armengaud

11th Sep 2014

Debasis Dash

Workflows and Pipelines for NGS analysis: Lessons from proteomics

Conference on Applying NGS in Basic research Health care and Agriculture

Page 2: Workflows and Pipelines for NGS analysis: Lessons from ... · Genome annotation. Genome annotation Transcriptome Proteome Structural ... biology Importance of genome annotation Armengaud

ATGAAGAAGCTGTGTGCTTTCACTATTGCCTTTTTTTCCCTGAAGTTTTGTCTCATCTTGTGCAGTTTGACTGAACCCAATTGCTTTTGGAAGATAAAGAAGAGAGAAGTTAATGATGGAGATTTGCAAAATGAGTGTGGTTTTGTCCTTTTTACACTTGAGAGCCCTATTGAAGAAAATTTTTATAATCACATTATTAATTTTAGGATACCAGCAAGAAAATATGAATTTTTTCTGGTAATGTTTTTTGCTACTGATGAGATCAACAAGAATCCTTATCTTTTATCCAACATGTCTTTGATATTTTCCTTCATTTTTGGTATGTGTGAAGATACAATGGGAGTTCTGGATAAAGCATATTTACATCAAAACAACTATTTCGATCTACTTAATTATAACTGTGGAAGAAAGAAACGTTGTGATGTAAAACTTACAGGACCATCATGGAAAACTTCCTTAAAACTTTCAGTTAATTCAAGGGCACCAAAGATTTTCTTTGGACCATTTAATCCTAACCTGAGTGACCATGACCAGTTTCCCTATATCTATCAGATAGCAACCAAGGACACATATTTGCTCCATGGCATGGTCTCCTTGATGTTTCATTTTGAATGGACTTGGATAGGACTGATCATCACAGATGATGACCAAGGTATTCAGTTTCACTCAGACTTGAGAGAAGAAATGCAAAGGCATGCGATCTGTTTAGCTTTTGTGATTATGATCCCAGAAAGCATTAAGTTATACAACACAAAGTTTAAGATATATGACCAACAACTTATGACATCTTCAGCAAAGGTTACTATCATTTATGGCAAAATGATCTCCACTCTAGAACTCAACTTTGCAAGATGGACATATTTAGTTGCACGGAGAATCTGGATCACAACCTCAAAATTGGATGTCATCACATATGATAAAGATTTCAGCCTTGATTTCTTCCACGGGACTGTCATTTTTGCCCACCACCACAATGACATCGCTACATTTAGAAATTTTATGCAAATAATAAACACATCCAAGTATCCAGTAGATATTTCTCAGTCTATGGGGCAGTGGAATCATTTTAACTGTTCAATCTCAAAGAACAAGAAGAAAATGGATTTTTTTATGTTGAAAAACCCAATGGAATGGTTAACACAGCACACATTTGACATGGTCCTGAGTGAAGAAGGTTACAATTTGTATAATGCTGTGTATGCTGTGGCCCACACCTATCACGAACTCATTTTTCAACAAGTAGAGTCTCAGGAAATGGCCAAACCCAAAGGACTATTCACTGACTGTCAGCAGGTGGCTTCTTTGCTTAAAACTAGGGTATTTACTAACCCTGTTGGAGAGCTGGTGAACATGAATCATAAGGAAAATCAGTGTGCCAAGTATGACATTTTCATCATTTGGAATTTTCCAAATGGCCTTGGATTAAAAGTGAAAATAGGAAGCTATTTTCCTTGTTTGCAACAGAGTCAACATCTTCATATATCTGAAGACTGGGAGTGGGTTACAGGAGAAACATTGGTTCCCTCCTCAGTGTGTAGTGAGACATGTACTGCAGGATTCAGAAAAAGTCATCAGAAACAAACAGCCAACTGCTGCTTTGATTGTGTCCAGTGCCAAGAAAATGAGATTGCCAAT

Where are the protein coding genes in a genome

http://www.picgifs.com/

Genome annotation

Page 3: Workflows and Pipelines for NGS analysis: Lessons from ... · Genome annotation. Genome annotation Transcriptome Proteome Structural ... biology Importance of genome annotation Armengaud

Genome annotation

Transcriptome

Proteome

Structural biology

ReactomeMetabolome

Interactome

Systems biology

Importance of genome annotation

Armengaud J. Proteogenomics and systems biology: quest for the ultimate missing parts. Expert Rev Proteomics. 2010

Page 4: Workflows and Pipelines for NGS analysis: Lessons from ... · Genome annotation. Genome annotation Transcriptome Proteome Structural ... biology Importance of genome annotation Armengaud

Solving a puzzle when pieces are missing or broken

http://www.puzzlewarehouse.com/missing-pieces/

Page 5: Workflows and Pipelines for NGS analysis: Lessons from ... · Genome annotation. Genome annotation Transcriptome Proteome Structural ... biology Importance of genome annotation Armengaud

How proteins are detected from samples?

Peptide Spectrum Match Scorer

Protein Extraction

Protease Digestion

LCMS

MS1 MS/MS

Experimental MS/MS

Spectrum

Protein Database

Theoretical Peptide

digestion

Peptide fragmentation

simulation

Theoretical MS/MS

Spectrum

A high-throughput method of protein identification

Page 6: Workflows and Pipelines for NGS analysis: Lessons from ... · Genome annotation. Genome annotation Transcriptome Proteome Structural ... biology Importance of genome annotation Armengaud

A large fraction of experimental spectra remain unidentified. May be because of

Unknown modifications on the peptides

Limitations of search algorithm

Noisy Spectra

Spectra are from non-peptidic origin

Peptides are missing in the search database

Identified

Unidentified

Proteomics: Challenges

Page 7: Workflows and Pipelines for NGS analysis: Lessons from ... · Genome annotation. Genome annotation Transcriptome Proteome Structural ... biology Importance of genome annotation Armengaud

Targ

et

De

coy

Threshold score

Concatenated target-decoy search*

• FDR= 2 x decoy/ (target +decoy )

Separate target and decoy search**

• FDR = decoy/target

* Nature Methods - 4, 207 - 214 (2007) **. J. Proteome Res., 2008, 7 (01), pp 29–34

sco

res

Controlling error rates through decoys

Page 8: Workflows and Pipelines for NGS analysis: Lessons from ... · Genome annotation. Genome annotation Transcriptome Proteome Structural ... biology Importance of genome annotation Armengaud

MassWiz: An advanced algorithm for peptide discovery

Intensity of matching peaks

Continuity of y-ions & b-ions

Neutral losses & Immoniumions

Fragment mass error sensitive scoring

Yadav AK, Kumar D, Dash D. MassWiz: a novel scoring algorithm with target-decoy based analysis pipeline for tandem mass spectrometry. J Proteome Res. 2011

Page 9: Workflows and Pipelines for NGS analysis: Lessons from ... · Genome annotation. Genome annotation Transcriptome Proteome Structural ... biology Importance of genome annotation Armengaud

Data: ISB standard protein mixhttp://regis-web.systemsbiology.net/PublicDatasets

Algorithm comparison: PSMs

Page 10: Workflows and Pipelines for NGS analysis: Lessons from ... · Genome annotation. Genome annotation Transcriptome Proteome Structural ... biology Importance of genome annotation Armengaud

A large fraction of experimental spectra remain unidentified. May be because of

Unknown modifications on the peptides

Limitations of search algorithm

Noisy Spectra

Spectra are from non-peptidic origin

Peptides are missing in the search database

Identified

Unidentified

Proteomics: Challenges

Page 11: Workflows and Pipelines for NGS analysis: Lessons from ... · Genome annotation. Genome annotation Transcriptome Proteome Structural ... biology Importance of genome annotation Armengaud

Proteogenomics: An alternate proteomic search strategy

Page 12: Workflows and Pipelines for NGS analysis: Lessons from ... · Genome annotation. Genome annotation Transcriptome Proteome Structural ... biology Importance of genome annotation Armengaud

Proteogenomics: An alliance of Genomics and Proteomics

Genome Annotation

Known Peptides

Novel Peptides

Proteomic identifications

Novel Gene

Gene model change

Gene on different frame

Gene on opposite strand

Armengaud J. A perfect genome annotation is within reach with the proteomics and genomics alliance. Curr Opin Microbiol. 2009

Page 13: Workflows and Pipelines for NGS analysis: Lessons from ... · Genome annotation. Genome annotation Transcriptome Proteome Structural ... biology Importance of genome annotation Armengaud

Genomics Proteomics

Lack of analysis-pipeline/software for integration of proteomics data with genome or genomics data

Page 14: Workflows and Pipelines for NGS analysis: Lessons from ... · Genome annotation. Genome annotation Transcriptome Proteome Structural ... biology Importance of genome annotation Armengaud

Bridging the Gap

Developing computational strategies to identify novel protein coding loci from MS data

Methods for identifying splice variants from proteomics data and discovery of novel translation products in eukaryotic model organism

Page 15: Workflows and Pipelines for NGS analysis: Lessons from ... · Genome annotation. Genome annotation Transcriptome Proteome Structural ... biology Importance of genome annotation Armengaud

Proteogenomic analysis of Mycobacterium tuberculosis

Page 16: Workflows and Pipelines for NGS analysis: Lessons from ... · Genome annotation. Genome annotation Transcriptome Proteome Structural ... biology Importance of genome annotation Armengaud

• Genome size: 4.4 mb (1998) Cole et al.

• 3924 ORFs annotated in the first genome draft.

• 3995 genes in re-annotation. (Camus et al 2002)

• 3988 protein coding genes (NCBI Refseq)

• 3987 protein coding genes (Sanger Institute)

• 3918 protein coding genes (TIGR/JCVI)

• 50% of the genes vary in Translation initiation site (TIS) between Sanger and TIGR annotations (deSouza et al 2008)

• 4,012 protein coding genes (Tuberculist R21)

Does Mycobacterium tuberculosis needre-annotation?

Page 17: Workflows and Pipelines for NGS analysis: Lessons from ... · Genome annotation. Genome annotation Transcriptome Proteome Structural ... biology Importance of genome annotation Armengaud

Identified hypothetical

21%

Unidentified20%

Identified59%

123 LCMS runs of cell lysate and culture filtrate of Mtb H37Rv

3176 out of 3988 NCBI Refseq proteins (80% Mtb proteome) identified

Translational evidence for 829 Hypothetical proteins

233 of 829 hypothetical proteins identified for the first time

Deep proteome profiling is achieved

In collaboration with Dr. Akhilesh Pandey & IOB

Page 18: Workflows and Pipelines for NGS analysis: Lessons from ... · Genome annotation. Genome annotation Transcriptome Proteome Structural ... biology Importance of genome annotation Armengaud

Conservation of Novel proteins

Kelkar DS, Kumar D et al Proteogenomic analysis of Mycobacterium tuberculosis by high resolution mass spectrometry. Mol Cell Proteomics. 2011

41 Novel protein coding loci

Changes in 79 existing gene models

Correction in TIS for 33 and confirming for 868

proteins

Mtb H37Rv: Novel Translations

Page 19: Workflows and Pipelines for NGS analysis: Lessons from ... · Genome annotation. Genome annotation Transcriptome Proteome Structural ... biology Importance of genome annotation Armengaud
Page 20: Workflows and Pipelines for NGS analysis: Lessons from ... · Genome annotation. Genome annotation Transcriptome Proteome Structural ... biology Importance of genome annotation Armengaud

Database creation

Spectra processing

Peptide Assignment

FDR estimation

Peptide mapping to

genome

Gene coordinate comparison

Peptide classification

Result reporting

Visualization

Challenges in proteogenomics

MassWiz

OMSSA

X!Tandem

InsPecT

A solution with complete automation and high fidelity of results is required

Page 21: Workflows and Pipelines for NGS analysis: Lessons from ... · Genome annotation. Genome annotation Transcriptome Proteome Structural ... biology Importance of genome annotation Armengaud
Page 22: Workflows and Pipelines for NGS analysis: Lessons from ... · Genome annotation. Genome annotation Transcriptome Proteome Structural ... biology Importance of genome annotation Armengaud

Integrating results from multiple algorithms: Set theory

Peptides identified by multiple algorithms have low false positives but this

method does not allow to control or estimate false discovery rate

Page 23: Workflows and Pipelines for NGS analysis: Lessons from ... · Genome annotation. Genome annotation Transcriptome Proteome Structural ... biology Importance of genome annotation Armengaud

Integrating results from multiple algorithms: FDRscore

OMSSAE-value

X!TandemP-value

InspectP-value

MassWizScore

Metrics from multiple algorithms are not comparable

FDR values from individual algorithms can be processed to generate a common scoreJones AR et al, Proteomics 2009

Score

FDR

Score

FDR

FDR

Q-value

FDRscore

Q-value

P-value P-value

FDR

FDRFDR

Q-valueFDRscore

Q-valueFDRscore based result integration allowed statistical

assessment (FDR) of final results

Page 24: Workflows and Pipelines for NGS analysis: Lessons from ... · Genome annotation. Genome annotation Transcriptome Proteome Structural ... biology Importance of genome annotation Armengaud

59 novel proteins identified

• 51 Novel proteins with 2 or more unique peptides

• Single peptide hits are selected if identified in minimum 2 samples and after manual inspection

49 gene model changes identified

• Translated start site suggested upstream to current annotation

TIS confirmed for 21 genes

• TIS correction for 1 genes

Novel Proteome of B. japonicum

Page 25: Workflows and Pipelines for NGS analysis: Lessons from ... · Genome annotation. Genome annotation Transcriptome Proteome Structural ... biology Importance of genome annotation Armengaud

FgeneSBoperon

A novel protein reveals a novel operon

• ORF length• Codon Bias• Promoter region• Ribosome binding site

Page 26: Workflows and Pipelines for NGS analysis: Lessons from ... · Genome annotation. Genome annotation Transcriptome Proteome Structural ... biology Importance of genome annotation Armengaud

A gene model change

Page 27: Workflows and Pipelines for NGS analysis: Lessons from ... · Genome annotation. Genome annotation Transcriptome Proteome Structural ... biology Importance of genome annotation Armengaud

Novel proteins are short

TTG start codon in Gene model changes

Novel peptides are distributed throughout the genome

Is there a common theme of novel identifications?

Most novel proteins are short proteins

Page 28: Workflows and Pipelines for NGS analysis: Lessons from ... · Genome annotation. Genome annotation Transcriptome Proteome Structural ... biology Importance of genome annotation Armengaud

A methylotroph- Organisms with ability to grow on reduced carbon compound like methanol or methylamine

Ecologically important- Supports vegetation by producing phytohormones

Industrial application- In production of important chemicals and bio-molecules on methanol feedstock

Model organism- to study methylotrophic metabolism

Member of Methylobacteriacea family: A diverse taxonomy with many genes specific to one genome

Page 29: Workflows and Pipelines for NGS analysis: Lessons from ... · Genome annotation. Genome annotation Transcriptome Proteome Structural ... biology Importance of genome annotation Armengaud

31 Novel protein coding genes

70 gene model changes

104 methylotrophy gene products

2,678 Proteins

Page 30: Workflows and Pipelines for NGS analysis: Lessons from ... · Genome annotation. Genome annotation Transcriptome Proteome Structural ... biology Importance of genome annotation Armengaud

Limited conservation and Low GC content of novel genes suggest Lateral gene transfer as probable mode of origin

Page 31: Workflows and Pipelines for NGS analysis: Lessons from ... · Genome annotation. Genome annotation Transcriptome Proteome Structural ... biology Importance of genome annotation Armengaud

Developing computational strategies to identify novel protein coding loci from MS data

Methods for identifying splice variants from proteomics data and discovery of novel translation products in eukaryotic model organism

Page 32: Workflows and Pipelines for NGS analysis: Lessons from ... · Genome annotation. Genome annotation Transcriptome Proteome Structural ... biology Importance of genome annotation Armengaud

1 Exon boundary peptide

Exon junction peptides for detecting splice variants

2 Splice variant

3 New exon

4 A new 3’ splice site

5 A new 5’ splice site

Page 33: Workflows and Pipelines for NGS analysis: Lessons from ... · Genome annotation. Genome annotation Transcriptome Proteome Structural ... biology Importance of genome annotation Armengaud

Junction Peptide map onINTRONPeptides map on

Different translation frame

Peptides map on

INTERGENIC

Peptides map on NON-CODING

GENE

Peptides map on

UTR

Peptides map on

INTRON

Peptides map on Opposite Strand

Eukaryotic Proteogenomics

Gene

Peptides

Novel Peptides

Page 34: Workflows and Pipelines for NGS analysis: Lessons from ... · Genome annotation. Genome annotation Transcriptome Proteome Structural ... biology Importance of genome annotation Armengaud

Prokaryotic Proteogenomics

Proteogenomics: Prokaryotic vs. Eukaryotic

Eukaryotic Proteogenomics

Page 35: Workflows and Pipelines for NGS analysis: Lessons from ... · Genome annotation. Genome annotation Transcriptome Proteome Structural ... biology Importance of genome annotation Armengaud

>TCONS_00006262 gene=XLOC_004176 loc:1|58177-58500|-

ATTTTGGAGTTGTGTAGCCAAT………………………………………………………………………………………………..

>TCONS_00006264 gene=XLOC_004177 loc:1|169401-172238|-

AAGGTTCAAGGTACAAGGTGGGGTATGCC……………………………………………………………………………………

>TCONS00006267_420_548_3

TQTHIGQGRDEYLYDSHGSLSRPSSMSTSLPFNRASEHGICC…………………………

>TCONS00006268_769_999_1

SSKVWWLKYTWPMASGSVRRYGLFGVDVAFEEVCHCGGGMGF……………………….

1

1 Raw Rna-Seq reads from NCBI-SRA repository

2

2 Read QC and processing using Trimmomatic

3

3 Filtered read mapping on reference genome using STAR aligner

4

4 Transcript assembly by Cufflinks

5

5 Assembly QC and comparison using cuffcompare and BLAST

6

6 Fasta of all transcripts generated using gffread

7

7 Theoretical translated protein database

RNA-seq analysis pipeline to capture transcriptome

Page 36: Workflows and Pipelines for NGS analysis: Lessons from ... · Genome annotation. Genome annotation Transcriptome Proteome Structural ... biology Importance of genome annotation Armengaud

>TCONS_00006262 gene=XLOC_004176 loc:1|58177-58500|-

ATTTTGGAGTTGTGTAGCCAAT………………………………………………………………………………………………..

>TCONS_00006264 gene=XLOC_004177 loc:1|169401-172238|-

AAGGTTCAAGGTACAAGGTGGGGTATGCC……………………………………………………………………………………

>TCONS00006267_420_548_3

TQTHIGQGRDEYLYDSHGSLSRPSSMSTSLPFNRASEHGICC…………………………

>TCONS00006268_769_999_1

SSKVWWLKYTWPMASGSVRRYGLFGVDVAFEEVCHCGGGMGF……………………….

1

2

3 4 5

6

7

GenoSuite

OMSSA

X!TANDEM

8 Tandem mass spectra 9 Peptide identification

EuGenoSuite: Integrates transcriptomics to proteomics

10Protein grouping/Protein assembler

11

12

13

14

Novel / Known categorization

Page 37: Workflows and Pipelines for NGS analysis: Lessons from ... · Genome annotation. Genome annotation Transcriptome Proteome Structural ... biology Importance of genome annotation Armengaud

Organism Genome Size$(Mb) Annotated Proteins*

Human 3,284.83 104,763

Mouse 2,796.64 52,165

Rat 2,909.70 25,725

*Ensembl release 74$NCBI Genome

Genome size and annotation comparison

Page 38: Workflows and Pipelines for NGS analysis: Lessons from ... · Genome annotation. Genome annotation Transcriptome Proteome Structural ... biology Importance of genome annotation Armengaud

Rattusnorvegicus

Brain

Liver

Spleen

Testes

KidneyColon

Muscle

Lung

Heart

9 tissues and 3 replicate for

each

Sequencing instruments

• HiSeq 2000

• IlluminaGAII

Case study dataset

Sample 1 Sample 2 Sample 3

T1 T2 T1 T2 T1 T2

T1: Technical Replicate 1 T2: Technical Replicate 2

Page 39: Workflows and Pipelines for NGS analysis: Lessons from ... · Genome annotation. Genome annotation Transcriptome Proteome Structural ... biology Importance of genome annotation Armengaud

11,725 Peptides (1%FDR, identified in both T1 and T2)

EuGenoSuite

Transcriptomicanalysis pipeline

400million

Paired end Reads

2 Million MS/MS spectra

312 Novel Peptides (275 unique mapping)

11,413 mapped to known proteins

45 Spliced peptides

145intergenic

18different

frame

28non coding

loci

25UTR

14intronic

Page 40: Workflows and Pipelines for NGS analysis: Lessons from ... · Genome annotation. Genome annotation Transcriptome Proteome Structural ... biology Importance of genome annotation Armengaud

Discovery of splice variant for Threonyl t-RNA synthetase

Page 41: Workflows and Pipelines for NGS analysis: Lessons from ... · Genome annotation. Genome annotation Transcriptome Proteome Structural ... biology Importance of genome annotation Armengaud

Translation of Pseudogene

Pseudogene

Paralog(PCBP2)

Page 42: Workflows and Pipelines for NGS analysis: Lessons from ... · Genome annotation. Genome annotation Transcriptome Proteome Structural ... biology Importance of genome annotation Armengaud

105,380 unique transcripts assembled

≈2,900 Annotated proteins identified

Transcripts and peptides for Eight Pseudogenes

Translation of exons annotated as non-coding (15 genes)

45 splice variants detected

Rat Analysis Summary

Page 43: Workflows and Pipelines for NGS analysis: Lessons from ... · Genome annotation. Genome annotation Transcriptome Proteome Structural ... biology Importance of genome annotation Armengaud

Translation of a novel gene locus

Page 44: Workflows and Pipelines for NGS analysis: Lessons from ... · Genome annotation. Genome annotation Transcriptome Proteome Structural ... biology Importance of genome annotation Armengaud

Summary

• N-terminal acetylation of bacterial proteins

Part 1

Part 2

•Proteomics data when searched against genomic background aids novel protein discovery

•GenoSuite : A fully automated multi-algorithmic proteomics and proteogenomics analysis tool

•Comprehensive proteogenomic analysis of B. japonicum improves protein annotation of rhizobia

• Integrated analysis of RNA-seq and mass spectrometry proteomics data tracks down novel protein isoforms

• EuGenoSuite : An in-house pipeline for eukaryotic proteogenomics

• Translation of pseudogenes in rat microglia

Page 45: Workflows and Pipelines for NGS analysis: Lessons from ... · Genome annotation. Genome annotation Transcriptome Proteome Structural ... biology Importance of genome annotation Armengaud

Conclusion

Proteomics

Genomics

TranscriptomicsData

Integration

Novel Discovery

Genome Annotation

GenoSuite

EuGenoSuite

Page 46: Workflows and Pipelines for NGS analysis: Lessons from ... · Genome annotation. Genome annotation Transcriptome Proteome Structural ... biology Importance of genome annotation Armengaud

Acknowledgements

IGIBIT Team

IGIB friends and family

&IOB team

Page 47: Workflows and Pipelines for NGS analysis: Lessons from ... · Genome annotation. Genome annotation Transcriptome Proteome Structural ... biology Importance of genome annotation Armengaud

Thank you