workflows and pipelines for ngs analysis: lessons from ... · genome annotation. genome annotation...

Post on 14-Oct-2020

14 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

11th Sep 2014

Debasis Dash

Workflows and Pipelines for NGS analysis: Lessons from proteomics

Conference on Applying NGS in Basic research Health care and Agriculture

ATGAAGAAGCTGTGTGCTTTCACTATTGCCTTTTTTTCCCTGAAGTTTTGTCTCATCTTGTGCAGTTTGACTGAACCCAATTGCTTTTGGAAGATAAAGAAGAGAGAAGTTAATGATGGAGATTTGCAAAATGAGTGTGGTTTTGTCCTTTTTACACTTGAGAGCCCTATTGAAGAAAATTTTTATAATCACATTATTAATTTTAGGATACCAGCAAGAAAATATGAATTTTTTCTGGTAATGTTTTTTGCTACTGATGAGATCAACAAGAATCCTTATCTTTTATCCAACATGTCTTTGATATTTTCCTTCATTTTTGGTATGTGTGAAGATACAATGGGAGTTCTGGATAAAGCATATTTACATCAAAACAACTATTTCGATCTACTTAATTATAACTGTGGAAGAAAGAAACGTTGTGATGTAAAACTTACAGGACCATCATGGAAAACTTCCTTAAAACTTTCAGTTAATTCAAGGGCACCAAAGATTTTCTTTGGACCATTTAATCCTAACCTGAGTGACCATGACCAGTTTCCCTATATCTATCAGATAGCAACCAAGGACACATATTTGCTCCATGGCATGGTCTCCTTGATGTTTCATTTTGAATGGACTTGGATAGGACTGATCATCACAGATGATGACCAAGGTATTCAGTTTCACTCAGACTTGAGAGAAGAAATGCAAAGGCATGCGATCTGTTTAGCTTTTGTGATTATGATCCCAGAAAGCATTAAGTTATACAACACAAAGTTTAAGATATATGACCAACAACTTATGACATCTTCAGCAAAGGTTACTATCATTTATGGCAAAATGATCTCCACTCTAGAACTCAACTTTGCAAGATGGACATATTTAGTTGCACGGAGAATCTGGATCACAACCTCAAAATTGGATGTCATCACATATGATAAAGATTTCAGCCTTGATTTCTTCCACGGGACTGTCATTTTTGCCCACCACCACAATGACATCGCTACATTTAGAAATTTTATGCAAATAATAAACACATCCAAGTATCCAGTAGATATTTCTCAGTCTATGGGGCAGTGGAATCATTTTAACTGTTCAATCTCAAAGAACAAGAAGAAAATGGATTTTTTTATGTTGAAAAACCCAATGGAATGGTTAACACAGCACACATTTGACATGGTCCTGAGTGAAGAAGGTTACAATTTGTATAATGCTGTGTATGCTGTGGCCCACACCTATCACGAACTCATTTTTCAACAAGTAGAGTCTCAGGAAATGGCCAAACCCAAAGGACTATTCACTGACTGTCAGCAGGTGGCTTCTTTGCTTAAAACTAGGGTATTTACTAACCCTGTTGGAGAGCTGGTGAACATGAATCATAAGGAAAATCAGTGTGCCAAGTATGACATTTTCATCATTTGGAATTTTCCAAATGGCCTTGGATTAAAAGTGAAAATAGGAAGCTATTTTCCTTGTTTGCAACAGAGTCAACATCTTCATATATCTGAAGACTGGGAGTGGGTTACAGGAGAAACATTGGTTCCCTCCTCAGTGTGTAGTGAGACATGTACTGCAGGATTCAGAAAAAGTCATCAGAAACAAACAGCCAACTGCTGCTTTGATTGTGTCCAGTGCCAAGAAAATGAGATTGCCAAT

Where are the protein coding genes in a genome

http://www.picgifs.com/

Genome annotation

Genome annotation

Transcriptome

Proteome

Structural biology

ReactomeMetabolome

Interactome

Systems biology

Importance of genome annotation

Armengaud J. Proteogenomics and systems biology: quest for the ultimate missing parts. Expert Rev Proteomics. 2010

Solving a puzzle when pieces are missing or broken

http://www.puzzlewarehouse.com/missing-pieces/

How proteins are detected from samples?

Peptide Spectrum Match Scorer

Protein Extraction

Protease Digestion

LCMS

MS1 MS/MS

Experimental MS/MS

Spectrum

Protein Database

Theoretical Peptide

digestion

Peptide fragmentation

simulation

Theoretical MS/MS

Spectrum

A high-throughput method of protein identification

A large fraction of experimental spectra remain unidentified. May be because of

Unknown modifications on the peptides

Limitations of search algorithm

Noisy Spectra

Spectra are from non-peptidic origin

Peptides are missing in the search database

Identified

Unidentified

Proteomics: Challenges

Targ

et

De

coy

Threshold score

Concatenated target-decoy search*

• FDR= 2 x decoy/ (target +decoy )

Separate target and decoy search**

• FDR = decoy/target

* Nature Methods - 4, 207 - 214 (2007) **. J. Proteome Res., 2008, 7 (01), pp 29–34

sco

res

Controlling error rates through decoys

MassWiz: An advanced algorithm for peptide discovery

Intensity of matching peaks

Continuity of y-ions & b-ions

Neutral losses & Immoniumions

Fragment mass error sensitive scoring

Yadav AK, Kumar D, Dash D. MassWiz: a novel scoring algorithm with target-decoy based analysis pipeline for tandem mass spectrometry. J Proteome Res. 2011

Data: ISB standard protein mixhttp://regis-web.systemsbiology.net/PublicDatasets

Algorithm comparison: PSMs

A large fraction of experimental spectra remain unidentified. May be because of

Unknown modifications on the peptides

Limitations of search algorithm

Noisy Spectra

Spectra are from non-peptidic origin

Peptides are missing in the search database

Identified

Unidentified

Proteomics: Challenges

Proteogenomics: An alternate proteomic search strategy

Proteogenomics: An alliance of Genomics and Proteomics

Genome Annotation

Known Peptides

Novel Peptides

Proteomic identifications

Novel Gene

Gene model change

Gene on different frame

Gene on opposite strand

Armengaud J. A perfect genome annotation is within reach with the proteomics and genomics alliance. Curr Opin Microbiol. 2009

Genomics Proteomics

Lack of analysis-pipeline/software for integration of proteomics data with genome or genomics data

Bridging the Gap

Developing computational strategies to identify novel protein coding loci from MS data

Methods for identifying splice variants from proteomics data and discovery of novel translation products in eukaryotic model organism

Proteogenomic analysis of Mycobacterium tuberculosis

• Genome size: 4.4 mb (1998) Cole et al.

• 3924 ORFs annotated in the first genome draft.

• 3995 genes in re-annotation. (Camus et al 2002)

• 3988 protein coding genes (NCBI Refseq)

• 3987 protein coding genes (Sanger Institute)

• 3918 protein coding genes (TIGR/JCVI)

• 50% of the genes vary in Translation initiation site (TIS) between Sanger and TIGR annotations (deSouza et al 2008)

• 4,012 protein coding genes (Tuberculist R21)

Does Mycobacterium tuberculosis needre-annotation?

Identified hypothetical

21%

Unidentified20%

Identified59%

123 LCMS runs of cell lysate and culture filtrate of Mtb H37Rv

3176 out of 3988 NCBI Refseq proteins (80% Mtb proteome) identified

Translational evidence for 829 Hypothetical proteins

233 of 829 hypothetical proteins identified for the first time

Deep proteome profiling is achieved

In collaboration with Dr. Akhilesh Pandey & IOB

Conservation of Novel proteins

Kelkar DS, Kumar D et al Proteogenomic analysis of Mycobacterium tuberculosis by high resolution mass spectrometry. Mol Cell Proteomics. 2011

41 Novel protein coding loci

Changes in 79 existing gene models

Correction in TIS for 33 and confirming for 868

proteins

Mtb H37Rv: Novel Translations

Database creation

Spectra processing

Peptide Assignment

FDR estimation

Peptide mapping to

genome

Gene coordinate comparison

Peptide classification

Result reporting

Visualization

Challenges in proteogenomics

MassWiz

OMSSA

X!Tandem

InsPecT

A solution with complete automation and high fidelity of results is required

Integrating results from multiple algorithms: Set theory

Peptides identified by multiple algorithms have low false positives but this

method does not allow to control or estimate false discovery rate

Integrating results from multiple algorithms: FDRscore

OMSSAE-value

X!TandemP-value

InspectP-value

MassWizScore

Metrics from multiple algorithms are not comparable

FDR values from individual algorithms can be processed to generate a common scoreJones AR et al, Proteomics 2009

Score

FDR

Score

FDR

FDR

Q-value

FDRscore

Q-value

P-value P-value

FDR

FDRFDR

Q-valueFDRscore

Q-valueFDRscore based result integration allowed statistical

assessment (FDR) of final results

59 novel proteins identified

• 51 Novel proteins with 2 or more unique peptides

• Single peptide hits are selected if identified in minimum 2 samples and after manual inspection

49 gene model changes identified

• Translated start site suggested upstream to current annotation

TIS confirmed for 21 genes

• TIS correction for 1 genes

Novel Proteome of B. japonicum

FgeneSBoperon

A novel protein reveals a novel operon

• ORF length• Codon Bias• Promoter region• Ribosome binding site

A gene model change

Novel proteins are short

TTG start codon in Gene model changes

Novel peptides are distributed throughout the genome

Is there a common theme of novel identifications?

Most novel proteins are short proteins

A methylotroph- Organisms with ability to grow on reduced carbon compound like methanol or methylamine

Ecologically important- Supports vegetation by producing phytohormones

Industrial application- In production of important chemicals and bio-molecules on methanol feedstock

Model organism- to study methylotrophic metabolism

Member of Methylobacteriacea family: A diverse taxonomy with many genes specific to one genome

31 Novel protein coding genes

70 gene model changes

104 methylotrophy gene products

2,678 Proteins

Limited conservation and Low GC content of novel genes suggest Lateral gene transfer as probable mode of origin

Developing computational strategies to identify novel protein coding loci from MS data

Methods for identifying splice variants from proteomics data and discovery of novel translation products in eukaryotic model organism

1 Exon boundary peptide

Exon junction peptides for detecting splice variants

2 Splice variant

3 New exon

4 A new 3’ splice site

5 A new 5’ splice site

Junction Peptide map onINTRONPeptides map on

Different translation frame

Peptides map on

INTERGENIC

Peptides map on NON-CODING

GENE

Peptides map on

UTR

Peptides map on

INTRON

Peptides map on Opposite Strand

Eukaryotic Proteogenomics

Gene

Peptides

Novel Peptides

Prokaryotic Proteogenomics

Proteogenomics: Prokaryotic vs. Eukaryotic

Eukaryotic Proteogenomics

>TCONS_00006262 gene=XLOC_004176 loc:1|58177-58500|-

ATTTTGGAGTTGTGTAGCCAAT………………………………………………………………………………………………..

>TCONS_00006264 gene=XLOC_004177 loc:1|169401-172238|-

AAGGTTCAAGGTACAAGGTGGGGTATGCC……………………………………………………………………………………

>TCONS00006267_420_548_3

TQTHIGQGRDEYLYDSHGSLSRPSSMSTSLPFNRASEHGICC…………………………

>TCONS00006268_769_999_1

SSKVWWLKYTWPMASGSVRRYGLFGVDVAFEEVCHCGGGMGF……………………….

1

1 Raw Rna-Seq reads from NCBI-SRA repository

2

2 Read QC and processing using Trimmomatic

3

3 Filtered read mapping on reference genome using STAR aligner

4

4 Transcript assembly by Cufflinks

5

5 Assembly QC and comparison using cuffcompare and BLAST

6

6 Fasta of all transcripts generated using gffread

7

7 Theoretical translated protein database

RNA-seq analysis pipeline to capture transcriptome

>TCONS_00006262 gene=XLOC_004176 loc:1|58177-58500|-

ATTTTGGAGTTGTGTAGCCAAT………………………………………………………………………………………………..

>TCONS_00006264 gene=XLOC_004177 loc:1|169401-172238|-

AAGGTTCAAGGTACAAGGTGGGGTATGCC……………………………………………………………………………………

>TCONS00006267_420_548_3

TQTHIGQGRDEYLYDSHGSLSRPSSMSTSLPFNRASEHGICC…………………………

>TCONS00006268_769_999_1

SSKVWWLKYTWPMASGSVRRYGLFGVDVAFEEVCHCGGGMGF……………………….

1

2

3 4 5

6

7

GenoSuite

OMSSA

X!TANDEM

8 Tandem mass spectra 9 Peptide identification

EuGenoSuite: Integrates transcriptomics to proteomics

10Protein grouping/Protein assembler

11

12

13

14

Novel / Known categorization

Organism Genome Size$(Mb) Annotated Proteins*

Human 3,284.83 104,763

Mouse 2,796.64 52,165

Rat 2,909.70 25,725

*Ensembl release 74$NCBI Genome

Genome size and annotation comparison

Rattusnorvegicus

Brain

Liver

Spleen

Testes

KidneyColon

Muscle

Lung

Heart

9 tissues and 3 replicate for

each

Sequencing instruments

• HiSeq 2000

• IlluminaGAII

Case study dataset

Sample 1 Sample 2 Sample 3

T1 T2 T1 T2 T1 T2

T1: Technical Replicate 1 T2: Technical Replicate 2

11,725 Peptides (1%FDR, identified in both T1 and T2)

EuGenoSuite

Transcriptomicanalysis pipeline

400million

Paired end Reads

2 Million MS/MS spectra

312 Novel Peptides (275 unique mapping)

11,413 mapped to known proteins

45 Spliced peptides

145intergenic

18different

frame

28non coding

loci

25UTR

14intronic

Discovery of splice variant for Threonyl t-RNA synthetase

Translation of Pseudogene

Pseudogene

Paralog(PCBP2)

105,380 unique transcripts assembled

≈2,900 Annotated proteins identified

Transcripts and peptides for Eight Pseudogenes

Translation of exons annotated as non-coding (15 genes)

45 splice variants detected

Rat Analysis Summary

Translation of a novel gene locus

Summary

• N-terminal acetylation of bacterial proteins

Part 1

Part 2

•Proteomics data when searched against genomic background aids novel protein discovery

•GenoSuite : A fully automated multi-algorithmic proteomics and proteogenomics analysis tool

•Comprehensive proteogenomic analysis of B. japonicum improves protein annotation of rhizobia

• Integrated analysis of RNA-seq and mass spectrometry proteomics data tracks down novel protein isoforms

• EuGenoSuite : An in-house pipeline for eukaryotic proteogenomics

• Translation of pseudogenes in rat microglia

Conclusion

Proteomics

Genomics

TranscriptomicsData

Integration

Novel Discovery

Genome Annotation

GenoSuite

EuGenoSuite

Acknowledgements

IGIBIT Team

IGIB friends and family

&IOB team

Thank you

top related