next generation sequencing data analysis with biohpc · 2019-06-27 · next generation sequencing...
TRANSCRIPT
Next Generation Sequencing Data Analysis with BioHPC
1Updated for 2015-04-15
Genomic, transcriptomic sequencing now
commonplace in projects. Now very cheap!
UTSW McDermott Core typical pricing:
Whole Genome PE100 $7,500
Whole Transcriptome PE100 $875
Most common experiment across the University:
Use RNA-Seq to identify gene expression
changes in response to a stimulus / caused by
a disease.
Next Generation Sequencing
2
Let’s focus on this today – but you can do
other things on our systems!
Typical RNA Seq Workflow
3
A BPrepare/ Obtain Samples for different conditions
Extract RNA and prepare library for sequencing
Run library on Illumina sequencer
Obtain short-read sequences
Typical RNA Seq Workflow – Data Analysis
4
Check quality and/or filter reads
Align to the genome or transcriptome
Quantify transcript abundance across conditions
Identify significant differences in expression between conditions
Powerful processing and lots of storage!
72 x 32/48 core nodes to run mapping that can take days on a
PC.
>1Petabyte of storage. No need to shuffle data between drives
when working on large projects.
Storage is fast. Best at the things NGS analysis does most
(accessing large files sequentially)
Why Use BioHPC?
5
Tools to help you!
NGS Pipeline Standard workflows with little effort. Best for beginners.
Galaxy Powerful environment with many tools, workflow designer.
Batch Scripts Various NGS tools available as modules on the cluster, for expert users.
This is NOT an attempt to replace comprehensive services that a sequencing core
provides, where data analysis is performed as part of the sequencing service.
But…
–Now common to need to integrate existing public data into projects
–Common to obtain data from collaborators, outside facilities
–Labs often have students/postdocs who have received NGS analysis training
–More flexibility – many tools available to create complex pipelines
But the sequencing core does it for me….
6
Use our services with caution.
You *should* have a basic understanding of the limitations of the techniques.
Option 1 - BioHPC NGS Pipeline
7
BioHPC Portal -> Cloud Services -> NGS Pipeline (ngs.biohpc.swmed.edu)
Common workflows, made easy.Currently RNA-SEQ Differential Expression Analysis
Option 2 - BioHPC Galaxy Service
8
BioHPC Portal -> Cloud Services -> Galaxy (galaxy.biohpc.swmed.edu)
Reproducible workflows, with many available tools, via the web.Widely used by many institutions.
Option 3 - Modules and Sequence Data / Indices
9
module avail/project/apps_database/iGenomes
Common NGS tools and Illumina iGenome databases are available on the clusterExperts can write their own pipelines using cluster sbatch jobs
Today we are going to…
10
Follow a simple and real-world RNA-SEQ differential expression analysis using:
• The BioHPC NGS Pipeline
• BioHPC Galaxy Service
Try it out with your own data!
Email [email protected] with questions
Bring your problems to the BioHPC drop-in coffee session next week!
A ‘toy’ example I can show you in real time (hopefully!)
75,000 reads from chr19, extracted from a larger study
2 Conditions – brain tissue vs adrenal tissue
What’s the difference in expression for the limited number
of transcripts we can see in this data?
Courtesy Galaxy Project, Illumina Body Map:
https://usegalaxy.org/u/jeremy/p/galaxy-rna-seq-analysis-exercise
Example 1 – Brain vs Adrenal
11
A real study from a lab I used to work in.
Public data downloaded via EMBL-EBI ArrayExpress.
We’ll take 4 conditions, all samples MCF7 cells subjected to hypoxia:
– Control (scrambled siRNA)
– HIF1A knock-down by siRNA
– HIF2A knock-down by siRNA
– HIF1A + HIF2A double knock-down by siRNA
2 replicates for each condition. Illumina HiSeq platform.
Example 2 – HIF1α, HIF2α Single and Double siRNA knock-down in MCF7 Cells
12
Extensive regulation of the non‐coding transcriptome by hypoxia: role of HIF in releasing paused RNApol2Hani Choudhry, Johannes Schödel, Spyros Oikonomopoulos, Carme Camps, Steffen Grampp, Adrian L Harris, Peter J Ratcliffe, Jiannis Ragoussis, David R MoleDOI 10.1002/embr.201337642 | Published online 21.12.2013 EMBO reports (2014) 15, 70-76
TopHat / Cufflinks Pipeline
13
14
NGS Pipeline Demo
See Handouts
15
Galaxy Demo
See Handouts
NGS Pipeline
Developed by Yi Du in conjunction with CRI, Zhiyu Zhao.
Galaxy
Many thanks to John Chilton, Martin Chech, Nicola Soranzo, Andrew Robinson, Dannon Baker for
assistance incorporating BioHPC required changes into the Galaxy project.
Acknowledgements
16