workflows and pipelines for ngs analysis: lessons from … · • genome size: 4.4 mb (1998) cole...
TRANSCRIPT
![Page 1: Workflows and Pipelines for NGS analysis: Lessons from … · • Genome size: 4.4 mb (1998) Cole et al. • 3924 ORFs annotated in the first genome draft. • 3995 genes in re-annotation](https://reader034.vdocuments.mx/reader034/viewer/2022050308/5f7099fdcf45b341625ffc0a/html5/thumbnails/1.jpg)
11th Sep 2014
Debasis Dash
Workflows and Pipelines for NGS analysis: Lessons from proteomics
Conference on Applying NGS in Basic research Health care and Agriculture
![Page 2: Workflows and Pipelines for NGS analysis: Lessons from … · • Genome size: 4.4 mb (1998) Cole et al. • 3924 ORFs annotated in the first genome draft. • 3995 genes in re-annotation](https://reader034.vdocuments.mx/reader034/viewer/2022050308/5f7099fdcf45b341625ffc0a/html5/thumbnails/2.jpg)
ATGAAGAAGCTGTGTGCTTTCACTATTGCCTTTTTTTCCCTGAAGTTTTGTCTCATCTTGTGCAGTTTGACTGAACCCAATTGCTTTTGGAAGATAAAGAAGAGAGAAGTTAATGATGGAGATTTGCAAAATGAGTGTGGTTTTGTCCTTTTTACACTTGAGAGCCCTATTGAAGAAAATTTTTATAATCACATTATTAATTTTAGGATACCAGCAAGAAAATATGAATTTTTTCTGGTAATGTTTTTTGCTACTGATGAGATCAACAAGAATCCTTATCTTTTATCCAACATGTCTTTGATATTTTCCTTCATTTTTGGTATGTGTGAAGATACAATGGGAGTTCTGGATAAAGCATATTTACATCAAAACAACTATTTCGATCTACTTAATTATAACTGTGGAAGAAAGAAACGTTGTGATGTAAAACTTACAGGACCATCATGGAAAACTTCCTTAAAACTTTCAGTTAATTCAAGGGCACCAAAGATTTTCTTTGGACCATTTAATCCTAACCTGAGTGACCATGACCAGTTTCCCTATATCTATCAGATAGCAACCAAGGACACATATTTGCTCCATGGCATGGTCTCCTTGATGTTTCATTTTGAATGGACTTGGATAGGACTGATCATCACAGATGATGACCAAGGTATTCAGTTTCACTCAGACTTGAGAGAAGAAATGCAAAGGCATGCGATCTGTTTAGCTTTTGTGATTATGATCCCAGAAAGCATTAAGTTATACAACACAAAGTTTAAGATATATGACCAACAACTTATGACATCTTCAGCAAAGGTTACTATCATTTATGGCAAAATGATCTCCACTCTAGAACTCAACTTTGCAAGATGGACATATTTAGTTGCACGGAGAATCTGGATCACAACCTCAAAATTGGATGTCATCACATATGATAAAGATTTCAGCCTTGATTTCTTCCACGGGACTGTCATTTTTGCCCACCACCACAATGACATCGCTACATTTAGAAATTTTATGCAAATAATAAACACATCCAAGTATCCAGTAGATATTTCTCAGTCTATGGGGCAGTGGAATCATTTTAACTGTTCAATCTCAAAGAACAAGAAGAAAATGGATTTTTTTATGTTGAAAAACCCAATGGAATGGTTAACACAGCACACATTTGACATGGTCCTGAGTGAAGAAGGTTACAATTTGTATAATGCTGTGTATGCTGTGGCCCACACCTATCACGAACTCATTTTTCAACAAGTAGAGTCTCAGGAAATGGCCAAACCCAAAGGACTATTCACTGACTGTCAGCAGGTGGCTTCTTTGCTTAAAACTAGGGTATTTACTAACCCTGTTGGAGAGCTGGTGAACATGAATCATAAGGAAAATCAGTGTGCCAAGTATGACATTTTCATCATTTGGAATTTTCCAAATGGCCTTGGATTAAAAGTGAAAATAGGAAGCTATTTTCCTTGTTTGCAACAGAGTCAACATCTTCATATATCTGAAGACTGGGAGTGGGTTACAGGAGAAACATTGGTTCCCTCCTCAGTGTGTAGTGAGACATGTACTGCAGGATTCAGAAAAAGTCATCAGAAACAAACAGCCAACTGCTGCTTTGATTGTGTCCAGTGCCAAGAAAATGAGATTGCCAAT
Where are the protein coding genes in a genome
http://www.picgifs.com/
Genome annotation
![Page 3: Workflows and Pipelines for NGS analysis: Lessons from … · • Genome size: 4.4 mb (1998) Cole et al. • 3924 ORFs annotated in the first genome draft. • 3995 genes in re-annotation](https://reader034.vdocuments.mx/reader034/viewer/2022050308/5f7099fdcf45b341625ffc0a/html5/thumbnails/3.jpg)
Genome annotation
Transcriptome
Proteome
Structural biology
ReactomeMetabolome
Interactome
Systems biology
Importance of genome annotation
Armengaud J. Proteogenomics and systems biology: quest for the ultimate missing parts. Expert Rev Proteomics. 2010
![Page 4: Workflows and Pipelines for NGS analysis: Lessons from … · • Genome size: 4.4 mb (1998) Cole et al. • 3924 ORFs annotated in the first genome draft. • 3995 genes in re-annotation](https://reader034.vdocuments.mx/reader034/viewer/2022050308/5f7099fdcf45b341625ffc0a/html5/thumbnails/4.jpg)
Solving a puzzle when pieces are missing or broken
http://www.puzzlewarehouse.com/missing-pieces/
![Page 5: Workflows and Pipelines for NGS analysis: Lessons from … · • Genome size: 4.4 mb (1998) Cole et al. • 3924 ORFs annotated in the first genome draft. • 3995 genes in re-annotation](https://reader034.vdocuments.mx/reader034/viewer/2022050308/5f7099fdcf45b341625ffc0a/html5/thumbnails/5.jpg)
How proteins are detected from samples?
Peptide Spectrum Match Scorer
Protein Extraction
Protease Digestion
LCMS
MS1 MS/MS
Experimental MS/MS
Spectrum
Protein Database
Theoretical Peptide
digestion
Peptide fragmentation
simulation
Theoretical MS/MS
Spectrum
A high-throughput method of protein identification
![Page 6: Workflows and Pipelines for NGS analysis: Lessons from … · • Genome size: 4.4 mb (1998) Cole et al. • 3924 ORFs annotated in the first genome draft. • 3995 genes in re-annotation](https://reader034.vdocuments.mx/reader034/viewer/2022050308/5f7099fdcf45b341625ffc0a/html5/thumbnails/6.jpg)
A large fraction of experimental spectra remain unidentified. May be because of
Unknown modifications on the peptides
Limitations of search algorithm
Noisy Spectra
Spectra are from non-peptidic origin
Peptides are missing in the search database
Identified
Unidentified
Proteomics: Challenges
![Page 7: Workflows and Pipelines for NGS analysis: Lessons from … · • Genome size: 4.4 mb (1998) Cole et al. • 3924 ORFs annotated in the first genome draft. • 3995 genes in re-annotation](https://reader034.vdocuments.mx/reader034/viewer/2022050308/5f7099fdcf45b341625ffc0a/html5/thumbnails/7.jpg)
Targ
et
De
coy
Threshold score
Concatenated target-decoy search*
• FDR= 2 x decoy/ (target +decoy )
Separate target and decoy search**
• FDR = decoy/target
* Nature Methods - 4, 207 - 214 (2007) **. J. Proteome Res., 2008, 7 (01), pp 29–34
sco
res
Controlling error rates through decoys
![Page 8: Workflows and Pipelines for NGS analysis: Lessons from … · • Genome size: 4.4 mb (1998) Cole et al. • 3924 ORFs annotated in the first genome draft. • 3995 genes in re-annotation](https://reader034.vdocuments.mx/reader034/viewer/2022050308/5f7099fdcf45b341625ffc0a/html5/thumbnails/8.jpg)
MassWiz: An advanced algorithm for peptide discovery
Intensity of matching peaks
Continuity of y-ions & b-ions
Neutral losses & Immoniumions
Fragment mass error sensitive scoring
Yadav AK, Kumar D, Dash D. MassWiz: a novel scoring algorithm with target-decoy based analysis pipeline for tandem mass spectrometry. J Proteome Res. 2011
![Page 9: Workflows and Pipelines for NGS analysis: Lessons from … · • Genome size: 4.4 mb (1998) Cole et al. • 3924 ORFs annotated in the first genome draft. • 3995 genes in re-annotation](https://reader034.vdocuments.mx/reader034/viewer/2022050308/5f7099fdcf45b341625ffc0a/html5/thumbnails/9.jpg)
Data: ISB standard protein mixhttp://regis-web.systemsbiology.net/PublicDatasets
Algorithm comparison: PSMs
![Page 10: Workflows and Pipelines for NGS analysis: Lessons from … · • Genome size: 4.4 mb (1998) Cole et al. • 3924 ORFs annotated in the first genome draft. • 3995 genes in re-annotation](https://reader034.vdocuments.mx/reader034/viewer/2022050308/5f7099fdcf45b341625ffc0a/html5/thumbnails/10.jpg)
A large fraction of experimental spectra remain unidentified. May be because of
Unknown modifications on the peptides
Limitations of search algorithm
Noisy Spectra
Spectra are from non-peptidic origin
Peptides are missing in the search database
Identified
Unidentified
Proteomics: Challenges
![Page 11: Workflows and Pipelines for NGS analysis: Lessons from … · • Genome size: 4.4 mb (1998) Cole et al. • 3924 ORFs annotated in the first genome draft. • 3995 genes in re-annotation](https://reader034.vdocuments.mx/reader034/viewer/2022050308/5f7099fdcf45b341625ffc0a/html5/thumbnails/11.jpg)
Proteogenomics: An alternate proteomic search strategy
![Page 12: Workflows and Pipelines for NGS analysis: Lessons from … · • Genome size: 4.4 mb (1998) Cole et al. • 3924 ORFs annotated in the first genome draft. • 3995 genes in re-annotation](https://reader034.vdocuments.mx/reader034/viewer/2022050308/5f7099fdcf45b341625ffc0a/html5/thumbnails/12.jpg)
Proteogenomics: An alliance of Genomics and Proteomics
Genome Annotation
Known Peptides
Novel Peptides
Proteomic identifications
Novel Gene
Gene model change
Gene on different frame
Gene on opposite strand
Armengaud J. A perfect genome annotation is within reach with the proteomics and genomics alliance. Curr Opin Microbiol. 2009
![Page 13: Workflows and Pipelines for NGS analysis: Lessons from … · • Genome size: 4.4 mb (1998) Cole et al. • 3924 ORFs annotated in the first genome draft. • 3995 genes in re-annotation](https://reader034.vdocuments.mx/reader034/viewer/2022050308/5f7099fdcf45b341625ffc0a/html5/thumbnails/13.jpg)
Genomics Proteomics
Lack of analysis-pipeline/software for integration of proteomics data with genome or genomics data
![Page 14: Workflows and Pipelines for NGS analysis: Lessons from … · • Genome size: 4.4 mb (1998) Cole et al. • 3924 ORFs annotated in the first genome draft. • 3995 genes in re-annotation](https://reader034.vdocuments.mx/reader034/viewer/2022050308/5f7099fdcf45b341625ffc0a/html5/thumbnails/14.jpg)
Bridging the Gap
Developing computational strategies to identify novel protein coding loci from MS data
Methods for identifying splice variants from proteomics data and discovery of novel translation products in eukaryotic model organism
![Page 15: Workflows and Pipelines for NGS analysis: Lessons from … · • Genome size: 4.4 mb (1998) Cole et al. • 3924 ORFs annotated in the first genome draft. • 3995 genes in re-annotation](https://reader034.vdocuments.mx/reader034/viewer/2022050308/5f7099fdcf45b341625ffc0a/html5/thumbnails/15.jpg)
Proteogenomic analysis of Mycobacterium tuberculosis
![Page 16: Workflows and Pipelines for NGS analysis: Lessons from … · • Genome size: 4.4 mb (1998) Cole et al. • 3924 ORFs annotated in the first genome draft. • 3995 genes in re-annotation](https://reader034.vdocuments.mx/reader034/viewer/2022050308/5f7099fdcf45b341625ffc0a/html5/thumbnails/16.jpg)
• Genome size: 4.4 mb (1998) Cole et al.
• 3924 ORFs annotated in the first genome draft.
• 3995 genes in re-annotation. (Camus et al 2002)
• 3988 protein coding genes (NCBI Refseq)
• 3987 protein coding genes (Sanger Institute)
• 3918 protein coding genes (TIGR/JCVI)
• 50% of the genes vary in Translation initiation site (TIS) between Sanger and TIGR annotations (deSouza et al 2008)
• 4,012 protein coding genes (Tuberculist R21)
Does Mycobacterium tuberculosis needre-annotation?
![Page 17: Workflows and Pipelines for NGS analysis: Lessons from … · • Genome size: 4.4 mb (1998) Cole et al. • 3924 ORFs annotated in the first genome draft. • 3995 genes in re-annotation](https://reader034.vdocuments.mx/reader034/viewer/2022050308/5f7099fdcf45b341625ffc0a/html5/thumbnails/17.jpg)
Identified hypothetical
21%
Unidentified20%
Identified59%
123 LCMS runs of cell lysate and culture filtrate of Mtb H37Rv
3176 out of 3988 NCBI Refseq proteins (80% Mtb proteome) identified
Translational evidence for 829 Hypothetical proteins
233 of 829 hypothetical proteins identified for the first time
Deep proteome profiling is achieved
In collaboration with Dr. Akhilesh Pandey & IOB
![Page 18: Workflows and Pipelines for NGS analysis: Lessons from … · • Genome size: 4.4 mb (1998) Cole et al. • 3924 ORFs annotated in the first genome draft. • 3995 genes in re-annotation](https://reader034.vdocuments.mx/reader034/viewer/2022050308/5f7099fdcf45b341625ffc0a/html5/thumbnails/18.jpg)
Conservation of Novel proteins
Kelkar DS, Kumar D et al Proteogenomic analysis of Mycobacterium tuberculosis by high resolution mass spectrometry. Mol Cell Proteomics. 2011
41 Novel protein coding loci
Changes in 79 existing gene models
Correction in TIS for 33 and confirming for 868
proteins
Mtb H37Rv: Novel Translations
![Page 19: Workflows and Pipelines for NGS analysis: Lessons from … · • Genome size: 4.4 mb (1998) Cole et al. • 3924 ORFs annotated in the first genome draft. • 3995 genes in re-annotation](https://reader034.vdocuments.mx/reader034/viewer/2022050308/5f7099fdcf45b341625ffc0a/html5/thumbnails/19.jpg)
![Page 20: Workflows and Pipelines for NGS analysis: Lessons from … · • Genome size: 4.4 mb (1998) Cole et al. • 3924 ORFs annotated in the first genome draft. • 3995 genes in re-annotation](https://reader034.vdocuments.mx/reader034/viewer/2022050308/5f7099fdcf45b341625ffc0a/html5/thumbnails/20.jpg)
Database creation
Spectra processing
Peptide Assignment
FDR estimation
Peptide mapping to
genome
Gene coordinate comparison
Peptide classification
Result reporting
Visualization
Challenges in proteogenomics
MassWiz
OMSSA
X!Tandem
InsPecT
A solution with complete automation and high fidelity of results is required
![Page 21: Workflows and Pipelines for NGS analysis: Lessons from … · • Genome size: 4.4 mb (1998) Cole et al. • 3924 ORFs annotated in the first genome draft. • 3995 genes in re-annotation](https://reader034.vdocuments.mx/reader034/viewer/2022050308/5f7099fdcf45b341625ffc0a/html5/thumbnails/21.jpg)
![Page 22: Workflows and Pipelines for NGS analysis: Lessons from … · • Genome size: 4.4 mb (1998) Cole et al. • 3924 ORFs annotated in the first genome draft. • 3995 genes in re-annotation](https://reader034.vdocuments.mx/reader034/viewer/2022050308/5f7099fdcf45b341625ffc0a/html5/thumbnails/22.jpg)
Integrating results from multiple algorithms: Set theory
Peptides identified by multiple algorithms have low false positives but this
method does not allow to control or estimate false discovery rate
![Page 23: Workflows and Pipelines for NGS analysis: Lessons from … · • Genome size: 4.4 mb (1998) Cole et al. • 3924 ORFs annotated in the first genome draft. • 3995 genes in re-annotation](https://reader034.vdocuments.mx/reader034/viewer/2022050308/5f7099fdcf45b341625ffc0a/html5/thumbnails/23.jpg)
Integrating results from multiple algorithms: FDRscore
OMSSAE-value
X!TandemP-value
InspectP-value
MassWizScore
Metrics from multiple algorithms are not comparable
FDR values from individual algorithms can be processed to generate a common scoreJones AR et al, Proteomics 2009
Score
FDR
Score
FDR
FDR
Q-value
FDRscore
Q-value
P-value P-value
FDR
FDRFDR
Q-valueFDRscore
Q-valueFDRscore based result integration allowed statistical
assessment (FDR) of final results
![Page 24: Workflows and Pipelines for NGS analysis: Lessons from … · • Genome size: 4.4 mb (1998) Cole et al. • 3924 ORFs annotated in the first genome draft. • 3995 genes in re-annotation](https://reader034.vdocuments.mx/reader034/viewer/2022050308/5f7099fdcf45b341625ffc0a/html5/thumbnails/24.jpg)
59 novel proteins identified
• 51 Novel proteins with 2 or more unique peptides
• Single peptide hits are selected if identified in minimum 2 samples and after manual inspection
49 gene model changes identified
• Translated start site suggested upstream to current annotation
TIS confirmed for 21 genes
• TIS correction for 1 genes
Novel Proteome of B. japonicum
![Page 25: Workflows and Pipelines for NGS analysis: Lessons from … · • Genome size: 4.4 mb (1998) Cole et al. • 3924 ORFs annotated in the first genome draft. • 3995 genes in re-annotation](https://reader034.vdocuments.mx/reader034/viewer/2022050308/5f7099fdcf45b341625ffc0a/html5/thumbnails/25.jpg)
FgeneSBoperon
A novel protein reveals a novel operon
• ORF length• Codon Bias• Promoter region• Ribosome binding site
![Page 26: Workflows and Pipelines for NGS analysis: Lessons from … · • Genome size: 4.4 mb (1998) Cole et al. • 3924 ORFs annotated in the first genome draft. • 3995 genes in re-annotation](https://reader034.vdocuments.mx/reader034/viewer/2022050308/5f7099fdcf45b341625ffc0a/html5/thumbnails/26.jpg)
A gene model change
![Page 27: Workflows and Pipelines for NGS analysis: Lessons from … · • Genome size: 4.4 mb (1998) Cole et al. • 3924 ORFs annotated in the first genome draft. • 3995 genes in re-annotation](https://reader034.vdocuments.mx/reader034/viewer/2022050308/5f7099fdcf45b341625ffc0a/html5/thumbnails/27.jpg)
Novel proteins are short
TTG start codon in Gene model changes
Novel peptides are distributed throughout the genome
Is there a common theme of novel identifications?
Most novel proteins are short proteins
![Page 28: Workflows and Pipelines for NGS analysis: Lessons from … · • Genome size: 4.4 mb (1998) Cole et al. • 3924 ORFs annotated in the first genome draft. • 3995 genes in re-annotation](https://reader034.vdocuments.mx/reader034/viewer/2022050308/5f7099fdcf45b341625ffc0a/html5/thumbnails/28.jpg)
A methylotroph- Organisms with ability to grow on reduced carbon compound like methanol or methylamine
Ecologically important- Supports vegetation by producing phytohormones
Industrial application- In production of important chemicals and bio-molecules on methanol feedstock
Model organism- to study methylotrophic metabolism
Member of Methylobacteriacea family: A diverse taxonomy with many genes specific to one genome
![Page 29: Workflows and Pipelines for NGS analysis: Lessons from … · • Genome size: 4.4 mb (1998) Cole et al. • 3924 ORFs annotated in the first genome draft. • 3995 genes in re-annotation](https://reader034.vdocuments.mx/reader034/viewer/2022050308/5f7099fdcf45b341625ffc0a/html5/thumbnails/29.jpg)
31 Novel protein coding genes
70 gene model changes
104 methylotrophy gene products
2,678 Proteins
![Page 30: Workflows and Pipelines for NGS analysis: Lessons from … · • Genome size: 4.4 mb (1998) Cole et al. • 3924 ORFs annotated in the first genome draft. • 3995 genes in re-annotation](https://reader034.vdocuments.mx/reader034/viewer/2022050308/5f7099fdcf45b341625ffc0a/html5/thumbnails/30.jpg)
Limited conservation and Low GC content of novel genes suggest Lateral gene transfer as probable mode of origin
![Page 31: Workflows and Pipelines for NGS analysis: Lessons from … · • Genome size: 4.4 mb (1998) Cole et al. • 3924 ORFs annotated in the first genome draft. • 3995 genes in re-annotation](https://reader034.vdocuments.mx/reader034/viewer/2022050308/5f7099fdcf45b341625ffc0a/html5/thumbnails/31.jpg)
Developing computational strategies to identify novel protein coding loci from MS data
Methods for identifying splice variants from proteomics data and discovery of novel translation products in eukaryotic model organism
![Page 32: Workflows and Pipelines for NGS analysis: Lessons from … · • Genome size: 4.4 mb (1998) Cole et al. • 3924 ORFs annotated in the first genome draft. • 3995 genes in re-annotation](https://reader034.vdocuments.mx/reader034/viewer/2022050308/5f7099fdcf45b341625ffc0a/html5/thumbnails/32.jpg)
1 Exon boundary peptide
Exon junction peptides for detecting splice variants
2 Splice variant
3 New exon
4 A new 3’ splice site
5 A new 5’ splice site
![Page 33: Workflows and Pipelines for NGS analysis: Lessons from … · • Genome size: 4.4 mb (1998) Cole et al. • 3924 ORFs annotated in the first genome draft. • 3995 genes in re-annotation](https://reader034.vdocuments.mx/reader034/viewer/2022050308/5f7099fdcf45b341625ffc0a/html5/thumbnails/33.jpg)
Junction Peptide map onINTRONPeptides map on
Different translation frame
Peptides map on
INTERGENIC
Peptides map on NON-CODING
GENE
Peptides map on
UTR
Peptides map on
INTRON
Peptides map on Opposite Strand
Eukaryotic Proteogenomics
Gene
Peptides
Novel Peptides
![Page 34: Workflows and Pipelines for NGS analysis: Lessons from … · • Genome size: 4.4 mb (1998) Cole et al. • 3924 ORFs annotated in the first genome draft. • 3995 genes in re-annotation](https://reader034.vdocuments.mx/reader034/viewer/2022050308/5f7099fdcf45b341625ffc0a/html5/thumbnails/34.jpg)
Prokaryotic Proteogenomics
Proteogenomics: Prokaryotic vs. Eukaryotic
Eukaryotic Proteogenomics
![Page 35: Workflows and Pipelines for NGS analysis: Lessons from … · • Genome size: 4.4 mb (1998) Cole et al. • 3924 ORFs annotated in the first genome draft. • 3995 genes in re-annotation](https://reader034.vdocuments.mx/reader034/viewer/2022050308/5f7099fdcf45b341625ffc0a/html5/thumbnails/35.jpg)
>TCONS_00006262 gene=XLOC_004176 loc:1|58177-58500|-
ATTTTGGAGTTGTGTAGCCAAT………………………………………………………………………………………………..
>TCONS_00006264 gene=XLOC_004177 loc:1|169401-172238|-
AAGGTTCAAGGTACAAGGTGGGGTATGCC……………………………………………………………………………………
>TCONS00006267_420_548_3
TQTHIGQGRDEYLYDSHGSLSRPSSMSTSLPFNRASEHGICC…………………………
>TCONS00006268_769_999_1
SSKVWWLKYTWPMASGSVRRYGLFGVDVAFEEVCHCGGGMGF……………………….
1
1 Raw Rna-Seq reads from NCBI-SRA repository
2
2 Read QC and processing using Trimmomatic
3
3 Filtered read mapping on reference genome using STAR aligner
4
4 Transcript assembly by Cufflinks
5
5 Assembly QC and comparison using cuffcompare and BLAST
6
6 Fasta of all transcripts generated using gffread
7
7 Theoretical translated protein database
RNA-seq analysis pipeline to capture transcriptome
![Page 36: Workflows and Pipelines for NGS analysis: Lessons from … · • Genome size: 4.4 mb (1998) Cole et al. • 3924 ORFs annotated in the first genome draft. • 3995 genes in re-annotation](https://reader034.vdocuments.mx/reader034/viewer/2022050308/5f7099fdcf45b341625ffc0a/html5/thumbnails/36.jpg)
>TCONS_00006262 gene=XLOC_004176 loc:1|58177-58500|-
ATTTTGGAGTTGTGTAGCCAAT………………………………………………………………………………………………..
>TCONS_00006264 gene=XLOC_004177 loc:1|169401-172238|-
AAGGTTCAAGGTACAAGGTGGGGTATGCC……………………………………………………………………………………
>TCONS00006267_420_548_3
TQTHIGQGRDEYLYDSHGSLSRPSSMSTSLPFNRASEHGICC…………………………
>TCONS00006268_769_999_1
SSKVWWLKYTWPMASGSVRRYGLFGVDVAFEEVCHCGGGMGF……………………….
1
2
3 4 5
6
7
GenoSuite
OMSSA
X!TANDEM
8 Tandem mass spectra 9 Peptide identification
EuGenoSuite: Integrates transcriptomics to proteomics
10Protein grouping/Protein assembler
11
12
13
14
Novel / Known categorization
![Page 37: Workflows and Pipelines for NGS analysis: Lessons from … · • Genome size: 4.4 mb (1998) Cole et al. • 3924 ORFs annotated in the first genome draft. • 3995 genes in re-annotation](https://reader034.vdocuments.mx/reader034/viewer/2022050308/5f7099fdcf45b341625ffc0a/html5/thumbnails/37.jpg)
Organism Genome Size$(Mb) Annotated Proteins*
Human 3,284.83 104,763
Mouse 2,796.64 52,165
Rat 2,909.70 25,725
*Ensembl release 74$NCBI Genome
Genome size and annotation comparison
![Page 38: Workflows and Pipelines for NGS analysis: Lessons from … · • Genome size: 4.4 mb (1998) Cole et al. • 3924 ORFs annotated in the first genome draft. • 3995 genes in re-annotation](https://reader034.vdocuments.mx/reader034/viewer/2022050308/5f7099fdcf45b341625ffc0a/html5/thumbnails/38.jpg)
Rattusnorvegicus
Brain
Liver
Spleen
Testes
KidneyColon
Muscle
Lung
Heart
9 tissues and 3 replicate for
each
Sequencing instruments
• HiSeq 2000
• IlluminaGAII
Case study dataset
Sample 1 Sample 2 Sample 3
T1 T2 T1 T2 T1 T2
T1: Technical Replicate 1 T2: Technical Replicate 2
![Page 39: Workflows and Pipelines for NGS analysis: Lessons from … · • Genome size: 4.4 mb (1998) Cole et al. • 3924 ORFs annotated in the first genome draft. • 3995 genes in re-annotation](https://reader034.vdocuments.mx/reader034/viewer/2022050308/5f7099fdcf45b341625ffc0a/html5/thumbnails/39.jpg)
11,725 Peptides (1%FDR, identified in both T1 and T2)
EuGenoSuite
Transcriptomicanalysis pipeline
400million
Paired end Reads
2 Million MS/MS spectra
312 Novel Peptides (275 unique mapping)
11,413 mapped to known proteins
45 Spliced peptides
145intergenic
18different
frame
28non coding
loci
25UTR
14intronic
![Page 40: Workflows and Pipelines for NGS analysis: Lessons from … · • Genome size: 4.4 mb (1998) Cole et al. • 3924 ORFs annotated in the first genome draft. • 3995 genes in re-annotation](https://reader034.vdocuments.mx/reader034/viewer/2022050308/5f7099fdcf45b341625ffc0a/html5/thumbnails/40.jpg)
Discovery of splice variant for Threonyl t-RNA synthetase
![Page 41: Workflows and Pipelines for NGS analysis: Lessons from … · • Genome size: 4.4 mb (1998) Cole et al. • 3924 ORFs annotated in the first genome draft. • 3995 genes in re-annotation](https://reader034.vdocuments.mx/reader034/viewer/2022050308/5f7099fdcf45b341625ffc0a/html5/thumbnails/41.jpg)
Translation of Pseudogene
Pseudogene
Paralog(PCBP2)
![Page 42: Workflows and Pipelines for NGS analysis: Lessons from … · • Genome size: 4.4 mb (1998) Cole et al. • 3924 ORFs annotated in the first genome draft. • 3995 genes in re-annotation](https://reader034.vdocuments.mx/reader034/viewer/2022050308/5f7099fdcf45b341625ffc0a/html5/thumbnails/42.jpg)
105,380 unique transcripts assembled
≈2,900 Annotated proteins identified
Transcripts and peptides for Eight Pseudogenes
Translation of exons annotated as non-coding (15 genes)
45 splice variants detected
Rat Analysis Summary
![Page 43: Workflows and Pipelines for NGS analysis: Lessons from … · • Genome size: 4.4 mb (1998) Cole et al. • 3924 ORFs annotated in the first genome draft. • 3995 genes in re-annotation](https://reader034.vdocuments.mx/reader034/viewer/2022050308/5f7099fdcf45b341625ffc0a/html5/thumbnails/43.jpg)
Translation of a novel gene locus
![Page 44: Workflows and Pipelines for NGS analysis: Lessons from … · • Genome size: 4.4 mb (1998) Cole et al. • 3924 ORFs annotated in the first genome draft. • 3995 genes in re-annotation](https://reader034.vdocuments.mx/reader034/viewer/2022050308/5f7099fdcf45b341625ffc0a/html5/thumbnails/44.jpg)
Summary
• N-terminal acetylation of bacterial proteins
Part 1
Part 2
•Proteomics data when searched against genomic background aids novel protein discovery
•GenoSuite : A fully automated multi-algorithmic proteomics and proteogenomics analysis tool
•Comprehensive proteogenomic analysis of B. japonicum improves protein annotation of rhizobia
• Integrated analysis of RNA-seq and mass spectrometry proteomics data tracks down novel protein isoforms
• EuGenoSuite : An in-house pipeline for eukaryotic proteogenomics
• Translation of pseudogenes in rat microglia
![Page 45: Workflows and Pipelines for NGS analysis: Lessons from … · • Genome size: 4.4 mb (1998) Cole et al. • 3924 ORFs annotated in the first genome draft. • 3995 genes in re-annotation](https://reader034.vdocuments.mx/reader034/viewer/2022050308/5f7099fdcf45b341625ffc0a/html5/thumbnails/45.jpg)
Conclusion
Proteomics
Genomics
TranscriptomicsData
Integration
Novel Discovery
Genome Annotation
GenoSuite
EuGenoSuite
![Page 46: Workflows and Pipelines for NGS analysis: Lessons from … · • Genome size: 4.4 mb (1998) Cole et al. • 3924 ORFs annotated in the first genome draft. • 3995 genes in re-annotation](https://reader034.vdocuments.mx/reader034/viewer/2022050308/5f7099fdcf45b341625ffc0a/html5/thumbnails/46.jpg)
Acknowledgements
IGIBIT Team
IGIB friends and family
&IOB team
![Page 47: Workflows and Pipelines for NGS analysis: Lessons from … · • Genome size: 4.4 mb (1998) Cole et al. • 3924 ORFs annotated in the first genome draft. • 3995 genes in re-annotation](https://reader034.vdocuments.mx/reader034/viewer/2022050308/5f7099fdcf45b341625ffc0a/html5/thumbnails/47.jpg)
Thank you