next generation sequencing data analysis with biohpc · 2019-06-27 · next generation sequencing...

16
Next Generation Sequencing Data Analysis with BioHPC 1 Updated for 2015-04-15

Upload: others

Post on 02-Aug-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Next Generation Sequencing Data Analysis with BioHPC · 2019-06-27 · Next Generation Sequencing Data Analysis with BioHPC 1 Updated for 2015-04-15. Genomic, transcriptomic sequencing

Next Generation Sequencing Data Analysis with BioHPC

1Updated for 2015-04-15

Page 2: Next Generation Sequencing Data Analysis with BioHPC · 2019-06-27 · Next Generation Sequencing Data Analysis with BioHPC 1 Updated for 2015-04-15. Genomic, transcriptomic sequencing

Genomic, transcriptomic sequencing now

commonplace in projects. Now very cheap!

UTSW McDermott Core typical pricing:

Whole Genome PE100 $7,500

Whole Transcriptome PE100 $875

Most common experiment across the University:

Use RNA-Seq to identify gene expression

changes in response to a stimulus / caused by

a disease.

Next Generation Sequencing

2

Let’s focus on this today – but you can do

other things on our systems!

Page 3: Next Generation Sequencing Data Analysis with BioHPC · 2019-06-27 · Next Generation Sequencing Data Analysis with BioHPC 1 Updated for 2015-04-15. Genomic, transcriptomic sequencing

Typical RNA Seq Workflow

3

A BPrepare/ Obtain Samples for different conditions

Extract RNA and prepare library for sequencing

Run library on Illumina sequencer

Obtain short-read sequences

Page 4: Next Generation Sequencing Data Analysis with BioHPC · 2019-06-27 · Next Generation Sequencing Data Analysis with BioHPC 1 Updated for 2015-04-15. Genomic, transcriptomic sequencing

Typical RNA Seq Workflow – Data Analysis

4

Check quality and/or filter reads

Align to the genome or transcriptome

Quantify transcript abundance across conditions

Identify significant differences in expression between conditions

Page 5: Next Generation Sequencing Data Analysis with BioHPC · 2019-06-27 · Next Generation Sequencing Data Analysis with BioHPC 1 Updated for 2015-04-15. Genomic, transcriptomic sequencing

Powerful processing and lots of storage!

72 x 32/48 core nodes to run mapping that can take days on a

PC.

>1Petabyte of storage. No need to shuffle data between drives

when working on large projects.

Storage is fast. Best at the things NGS analysis does most

(accessing large files sequentially)

Why Use BioHPC?

5

Tools to help you!

NGS Pipeline Standard workflows with little effort. Best for beginners.

Galaxy Powerful environment with many tools, workflow designer.

Batch Scripts Various NGS tools available as modules on the cluster, for expert users.

Page 6: Next Generation Sequencing Data Analysis with BioHPC · 2019-06-27 · Next Generation Sequencing Data Analysis with BioHPC 1 Updated for 2015-04-15. Genomic, transcriptomic sequencing

This is NOT an attempt to replace comprehensive services that a sequencing core

provides, where data analysis is performed as part of the sequencing service.

But…

–Now common to need to integrate existing public data into projects

–Common to obtain data from collaborators, outside facilities

–Labs often have students/postdocs who have received NGS analysis training

–More flexibility – many tools available to create complex pipelines

But the sequencing core does it for me….

6

Use our services with caution.

You *should* have a basic understanding of the limitations of the techniques.

Page 7: Next Generation Sequencing Data Analysis with BioHPC · 2019-06-27 · Next Generation Sequencing Data Analysis with BioHPC 1 Updated for 2015-04-15. Genomic, transcriptomic sequencing

Option 1 - BioHPC NGS Pipeline

7

BioHPC Portal -> Cloud Services -> NGS Pipeline (ngs.biohpc.swmed.edu)

Common workflows, made easy.Currently RNA-SEQ Differential Expression Analysis

Page 8: Next Generation Sequencing Data Analysis with BioHPC · 2019-06-27 · Next Generation Sequencing Data Analysis with BioHPC 1 Updated for 2015-04-15. Genomic, transcriptomic sequencing

Option 2 - BioHPC Galaxy Service

8

BioHPC Portal -> Cloud Services -> Galaxy (galaxy.biohpc.swmed.edu)

Reproducible workflows, with many available tools, via the web.Widely used by many institutions.

Page 9: Next Generation Sequencing Data Analysis with BioHPC · 2019-06-27 · Next Generation Sequencing Data Analysis with BioHPC 1 Updated for 2015-04-15. Genomic, transcriptomic sequencing

Option 3 - Modules and Sequence Data / Indices

9

module avail/project/apps_database/iGenomes

Common NGS tools and Illumina iGenome databases are available on the clusterExperts can write their own pipelines using cluster sbatch jobs

Page 10: Next Generation Sequencing Data Analysis with BioHPC · 2019-06-27 · Next Generation Sequencing Data Analysis with BioHPC 1 Updated for 2015-04-15. Genomic, transcriptomic sequencing

Today we are going to…

10

Follow a simple and real-world RNA-SEQ differential expression analysis using:

• The BioHPC NGS Pipeline

• BioHPC Galaxy Service

Try it out with your own data!

Email [email protected] with questions

Bring your problems to the BioHPC drop-in coffee session next week!

Page 11: Next Generation Sequencing Data Analysis with BioHPC · 2019-06-27 · Next Generation Sequencing Data Analysis with BioHPC 1 Updated for 2015-04-15. Genomic, transcriptomic sequencing

A ‘toy’ example I can show you in real time (hopefully!)

75,000 reads from chr19, extracted from a larger study

2 Conditions – brain tissue vs adrenal tissue

What’s the difference in expression for the limited number

of transcripts we can see in this data?

Courtesy Galaxy Project, Illumina Body Map:

https://usegalaxy.org/u/jeremy/p/galaxy-rna-seq-analysis-exercise

Example 1 – Brain vs Adrenal

11

Page 12: Next Generation Sequencing Data Analysis with BioHPC · 2019-06-27 · Next Generation Sequencing Data Analysis with BioHPC 1 Updated for 2015-04-15. Genomic, transcriptomic sequencing

A real study from a lab I used to work in.

Public data downloaded via EMBL-EBI ArrayExpress.

We’ll take 4 conditions, all samples MCF7 cells subjected to hypoxia:

– Control (scrambled siRNA)

– HIF1A knock-down by siRNA

– HIF2A knock-down by siRNA

– HIF1A + HIF2A double knock-down by siRNA

2 replicates for each condition. Illumina HiSeq platform.

Example 2 – HIF1α, HIF2α Single and Double siRNA knock-down in MCF7 Cells

12

Extensive regulation of the non‐coding transcriptome by hypoxia: role of HIF in releasing paused RNApol2Hani Choudhry, Johannes Schödel, Spyros Oikonomopoulos, Carme Camps, Steffen Grampp, Adrian L Harris, Peter J Ratcliffe, Jiannis Ragoussis, David R MoleDOI 10.1002/embr.201337642 | Published online 21.12.2013 EMBO reports (2014) 15, 70-76

Page 13: Next Generation Sequencing Data Analysis with BioHPC · 2019-06-27 · Next Generation Sequencing Data Analysis with BioHPC 1 Updated for 2015-04-15. Genomic, transcriptomic sequencing

TopHat / Cufflinks Pipeline

13

Page 14: Next Generation Sequencing Data Analysis with BioHPC · 2019-06-27 · Next Generation Sequencing Data Analysis with BioHPC 1 Updated for 2015-04-15. Genomic, transcriptomic sequencing

14

NGS Pipeline Demo

See Handouts

Page 15: Next Generation Sequencing Data Analysis with BioHPC · 2019-06-27 · Next Generation Sequencing Data Analysis with BioHPC 1 Updated for 2015-04-15. Genomic, transcriptomic sequencing

15

Galaxy Demo

See Handouts

Page 16: Next Generation Sequencing Data Analysis with BioHPC · 2019-06-27 · Next Generation Sequencing Data Analysis with BioHPC 1 Updated for 2015-04-15. Genomic, transcriptomic sequencing

NGS Pipeline

Developed by Yi Du in conjunction with CRI, Zhiyu Zhao.

Galaxy

Many thanks to John Chilton, Martin Chech, Nicola Soranzo, Andrew Robinson, Dannon Baker for

assistance incorporating BioHPC required changes into the Galaxy project.

Acknowledgements

16