sarek, a portable workflow for wgs analysis of germline ... · sarek a portable workflow for wgs...

56
Sarek A portable workflow for WGS analysis of germline and somatic mutations Maxime U. Garcia maxulysse.github.io @MaxUlysse @gau

Upload: others

Post on 19-Oct-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Sarek, a portable workflow for WGS analysis of germline ... · Sarek A portable workflow for WGS analysis of germline and somatic mutations Maxime U. Garcia maxulysse.github.io @MaxUlysse

SarekA portable workflow for WGS analysisof germline and somatic mutations

Maxime U. Garcia maxulysse.github.io @MaxUlysse @gau

Page 2: Sarek, a portable workflow for WGS analysis of germline ... · Sarek A portable workflow for WGS analysis of germline and somatic mutations Maxime U. Garcia maxulysse.github.io @MaxUlysse

Sarek

Sarek, the National Park in Northen Sweden 1/22

Page 3: Sarek, a portable workflow for WGS analysis of germline ... · Sarek A portable workflow for WGS analysis of germline and somatic mutations Maxime U. Garcia maxulysse.github.io @MaxUlysse

The most dramatic and grandiose of all

• Long, deep, narrow valleys and wild, turbulent water.• A tortuous delta landscape.• Completely lacking in comfortable accommodations.• Sarek is one of Sweden’s most inaccessible national parks• There are no roads leading up to the national park.

Sarek National Park website

2/22

Page 4: Sarek, a portable workflow for WGS analysis of germline ... · Sarek A portable workflow for WGS analysis of germline and somatic mutations Maxime U. Garcia maxulysse.github.io @MaxUlysse

Where we’re going we don’t need roads

3/22

Page 5: Sarek, a portable workflow for WGS analysis of germline ... · Sarek A portable workflow for WGS analysis of germline and somatic mutations Maxime U. Garcia maxulysse.github.io @MaxUlysse

What is Sarek?

Sarek

http://sarek.scilifelab.se/

• Nextflow pipeline

• Developed at NGI• In collaboration with NBIS• Support from The Swedish Childhood Tumor Biobank

4/22

Page 6: Sarek, a portable workflow for WGS analysis of germline ... · Sarek A portable workflow for WGS analysis of germline and somatic mutations Maxime U. Garcia maxulysse.github.io @MaxUlysse

What is Sarek?

Sarek

http://sarek.scilifelab.se/

• Nextflow pipeline

• Developed at NGI• In collaboration with NBIS• Support from The Swedish Childhood Tumor Biobank

4/22

Page 7: Sarek, a portable workflow for WGS analysis of germline ... · Sarek A portable workflow for WGS analysis of germline and somatic mutations Maxime U. Garcia maxulysse.github.io @MaxUlysse

What is Sarek?

Sarek

http://sarek.scilifelab.se/

• Nextflow pipeline• Developed at NGI

• In collaboration with NBIS• Support from The Swedish Childhood Tumor Biobank

4/22

Page 8: Sarek, a portable workflow for WGS analysis of germline ... · Sarek A portable workflow for WGS analysis of germline and somatic mutations Maxime U. Garcia maxulysse.github.io @MaxUlysse

What is Sarek?

Sarek

http://sarek.scilifelab.se/

• Nextflow pipeline• Developed at NGI• In collaboration with NBIS

• Support from The Swedish Childhood Tumor Biobank

4/22

Page 9: Sarek, a portable workflow for WGS analysis of germline ... · Sarek A portable workflow for WGS analysis of germline and somatic mutations Maxime U. Garcia maxulysse.github.io @MaxUlysse

What is Sarek?

Sarek

http://sarek.scilifelab.se/

• Nextflow pipeline• Developed at NGI• In collaboration with NBIS• Support from The Swedish Childhood Tumor Biobank

4/22

Page 10: Sarek, a portable workflow for WGS analysis of germline ... · Sarek A portable workflow for WGS analysis of germline and somatic mutations Maxime U. Garcia maxulysse.github.io @MaxUlysse

Nextflow

https://www.nextflow.io/

• Data-driven workflow language

• Portable (executable on multiple platforms)• Shareable and reproducible (with containers)

5/22

Page 11: Sarek, a portable workflow for WGS analysis of germline ... · Sarek A portable workflow for WGS analysis of germline and somatic mutations Maxime U. Garcia maxulysse.github.io @MaxUlysse

Nextflow

https://www.nextflow.io/

• Data-driven workflow language• Portable (executable on multiple platforms)

• Shareable and reproducible (with containers)

5/22

Page 12: Sarek, a portable workflow for WGS analysis of germline ... · Sarek A portable workflow for WGS analysis of germline and somatic mutations Maxime U. Garcia maxulysse.github.io @MaxUlysse

Nextflow

https://www.nextflow.io/

• Data-driven workflow language• Portable (executable on multiple platforms)• Shareable and reproducible (with containers)

5/22

Page 13: Sarek, a portable workflow for WGS analysis of germline ... · Sarek A portable workflow for WGS analysis of germline and somatic mutations Maxime U. Garcia maxulysse.github.io @MaxUlysse

Singularity

https://singularity.lbl.gov/

• Docker-like container engine• Specific for HPC environnment

• Without the root user security problem• Supported by Nextflow• Can pull containers from Docker-hub

6/22

Page 14: Sarek, a portable workflow for WGS analysis of germline ... · Sarek A portable workflow for WGS analysis of germline and somatic mutations Maxime U. Garcia maxulysse.github.io @MaxUlysse

Singularity

https://singularity.lbl.gov/

• Docker-like container engine• Specific for HPC environnment• Without the root user security problem

• Supported by Nextflow• Can pull containers from Docker-hub

6/22

Page 15: Sarek, a portable workflow for WGS analysis of germline ... · Sarek A portable workflow for WGS analysis of germline and somatic mutations Maxime U. Garcia maxulysse.github.io @MaxUlysse

Singularity

https://singularity.lbl.gov/

• Docker-like container engine• Specific for HPC environnment• Without the root user security problem• Supported by Nextflow

• Can pull containers from Docker-hub

6/22

Page 16: Sarek, a portable workflow for WGS analysis of germline ... · Sarek A portable workflow for WGS analysis of germline and somatic mutations Maxime U. Garcia maxulysse.github.io @MaxUlysse

Singularity

https://singularity.lbl.gov/

• Docker-like container engine• Specific for HPC environnment• Without the root user security problem• Supported by Nextflow• Can pull containers from Docker-hub

6/22

Page 17: Sarek, a portable workflow for WGS analysis of germline ... · Sarek A portable workflow for WGS analysis of germline and somatic mutations Maxime U. Garcia maxulysse.github.io @MaxUlysse

Sarek exists in multiple flavors

Sarek

SarekGermline

SarekSomatic

SarekExome

7/22

Page 18: Sarek, a portable workflow for WGS analysis of germline ... · Sarek A portable workflow for WGS analysis of germline and somatic mutations Maxime U. Garcia maxulysse.github.io @MaxUlysse

Sarek exists in multiple flavors

Sarek

SarekGermline

SarekSomatic

SarekExome

7/22

Page 19: Sarek, a portable workflow for WGS analysis of germline ... · Sarek A portable workflow for WGS analysis of germline and somatic mutations Maxime U. Garcia maxulysse.github.io @MaxUlysse

Sarek exists in multiple flavors

Sarek

SarekGermline

SarekSomatic

SarekExome

7/22

Page 20: Sarek, a portable workflow for WGS analysis of germline ... · Sarek A portable workflow for WGS analysis of germline and somatic mutations Maxime U. Garcia maxulysse.github.io @MaxUlysse

Sarek exists in multiple flavors

Sarek

SarekGermline

SarekSomatic

SarekExome

7/22

Page 21: Sarek, a portable workflow for WGS analysis of germline ... · Sarek A portable workflow for WGS analysis of germline and somatic mutations Maxime U. Garcia maxulysse.github.io @MaxUlysse

Preprocessing

https://software.broadinstitute.org/gatk/best-practices/

Based on GATK Best Practices (GATK 4.0)

• Reads mapped to reference genome with bwa

• Duplicates marked with picard MarkDuplicates

• Recalibrate with GATK BaseRecalibrator

8/22

Page 22: Sarek, a portable workflow for WGS analysis of germline ... · Sarek A portable workflow for WGS analysis of germline and somatic mutations Maxime U. Garcia maxulysse.github.io @MaxUlysse

Preprocessing

https://software.broadinstitute.org/gatk/best-practices/

Based on GATK Best Practices (GATK 4.0)

• Reads mapped to reference genome with bwa

• Duplicates marked with picard MarkDuplicates

• Recalibrate with GATK BaseRecalibrator

8/22

Page 23: Sarek, a portable workflow for WGS analysis of germline ... · Sarek A portable workflow for WGS analysis of germline and somatic mutations Maxime U. Garcia maxulysse.github.io @MaxUlysse

Preprocessing

https://software.broadinstitute.org/gatk/best-practices/

Based on GATK Best Practices (GATK 4.0)

• Reads mapped to reference genome with bwa

• Duplicates marked with picard MarkDuplicates

• Recalibrate with GATK BaseRecalibrator

8/22

Page 24: Sarek, a portable workflow for WGS analysis of germline ... · Sarek A portable workflow for WGS analysis of germline and somatic mutations Maxime U. Garcia maxulysse.github.io @MaxUlysse

Preprocessing

https://software.broadinstitute.org/gatk/best-practices/

Based on GATK Best Practices (GATK 4.0)

• Reads mapped to reference genome with bwa

• Duplicates marked with picard MarkDuplicates

• Recalibrate with GATK BaseRecalibrator

8/22

Page 25: Sarek, a portable workflow for WGS analysis of germline ... · Sarek A portable workflow for WGS analysis of germline and somatic mutations Maxime U. Garcia maxulysse.github.io @MaxUlysse

Germline Variant Calling

• SNVs and small indels:

• HaplotypeCaller• Strelka2

• Structural variants:• Manta

9/22

Page 26: Sarek, a portable workflow for WGS analysis of germline ... · Sarek A portable workflow for WGS analysis of germline and somatic mutations Maxime U. Garcia maxulysse.github.io @MaxUlysse

Germline Variant Calling

• SNVs and small indels:• HaplotypeCaller• Strelka2

• Structural variants:• Manta

9/22

Page 27: Sarek, a portable workflow for WGS analysis of germline ... · Sarek A portable workflow for WGS analysis of germline and somatic mutations Maxime U. Garcia maxulysse.github.io @MaxUlysse

Germline Variant Calling

• SNVs and small indels:• HaplotypeCaller• Strelka2

• Structural variants:

• Manta

9/22

Page 28: Sarek, a portable workflow for WGS analysis of germline ... · Sarek A portable workflow for WGS analysis of germline and somatic mutations Maxime U. Garcia maxulysse.github.io @MaxUlysse

Germline Variant Calling

• SNVs and small indels:• HaplotypeCaller• Strelka2

• Structural variants:• Manta

9/22

Page 29: Sarek, a portable workflow for WGS analysis of germline ... · Sarek A portable workflow for WGS analysis of germline and somatic mutations Maxime U. Garcia maxulysse.github.io @MaxUlysse

Somatic Variant Calling

• SNVs and small indels:

• MuTect2• Freebayes• Strelka2

• Structural variants:• Manta

• Sample heterogeneity, ploidy and CNVs:• ASCAT• Control-FREEC ( adding)

10/22

Page 30: Sarek, a portable workflow for WGS analysis of germline ... · Sarek A portable workflow for WGS analysis of germline and somatic mutations Maxime U. Garcia maxulysse.github.io @MaxUlysse

Somatic Variant Calling

• SNVs and small indels:• MuTect2• Freebayes• Strelka2

• Structural variants:• Manta

• Sample heterogeneity, ploidy and CNVs:• ASCAT• Control-FREEC ( adding)

10/22

Page 31: Sarek, a portable workflow for WGS analysis of germline ... · Sarek A portable workflow for WGS analysis of germline and somatic mutations Maxime U. Garcia maxulysse.github.io @MaxUlysse

Somatic Variant Calling

• SNVs and small indels:• MuTect2• Freebayes• Strelka2

• Structural variants:

• Manta• Sample heterogeneity, ploidy and CNVs:

• ASCAT• Control-FREEC ( adding)

10/22

Page 32: Sarek, a portable workflow for WGS analysis of germline ... · Sarek A portable workflow for WGS analysis of germline and somatic mutations Maxime U. Garcia maxulysse.github.io @MaxUlysse

Somatic Variant Calling

• SNVs and small indels:• MuTect2• Freebayes• Strelka2

• Structural variants:• Manta

• Sample heterogeneity, ploidy and CNVs:• ASCAT• Control-FREEC ( adding)

10/22

Page 33: Sarek, a portable workflow for WGS analysis of germline ... · Sarek A portable workflow for WGS analysis of germline and somatic mutations Maxime U. Garcia maxulysse.github.io @MaxUlysse

Somatic Variant Calling

• SNVs and small indels:• MuTect2• Freebayes• Strelka2

• Structural variants:• Manta

• Sample heterogeneity, ploidy and CNVs:

• ASCAT• Control-FREEC ( adding)

10/22

Page 34: Sarek, a portable workflow for WGS analysis of germline ... · Sarek A portable workflow for WGS analysis of germline and somatic mutations Maxime U. Garcia maxulysse.github.io @MaxUlysse

Somatic Variant Calling

• SNVs and small indels:• MuTect2• Freebayes• Strelka2

• Structural variants:• Manta

• Sample heterogeneity, ploidy and CNVs:• ASCAT• Control-FREEC ( adding)

10/22

Page 35: Sarek, a portable workflow for WGS analysis of germline ... · Sarek A portable workflow for WGS analysis of germline and somatic mutations Maxime U. Garcia maxulysse.github.io @MaxUlysse

Annotation

• VEP and SnpEff

• ClinVar, COSMIC, dbSNP, GENCODE, gnomAD,polyphen, sift, etc.

11/22

Page 36: Sarek, a portable workflow for WGS analysis of germline ... · Sarek A portable workflow for WGS analysis of germline and somatic mutations Maxime U. Garcia maxulysse.github.io @MaxUlysse

Annotation

• VEP and SnpEff• ClinVar, COSMIC, dbSNP, GENCODE, gnomAD,

polyphen, sift, etc.

11/22

Page 37: Sarek, a portable workflow for WGS analysis of germline ... · Sarek A portable workflow for WGS analysis of germline and somatic mutations Maxime U. Garcia maxulysse.github.io @MaxUlysse

Prioritization

• First step towards clinical use

• Rank scores are computed for all variants• COSMIC, ClinVar, SweFreq and MSK-IMPACT

(cancerhotspots.org)• Findings are ranked in three tiers

• 1st tier: well known, high-impact variants• 2nd tier: variants in known cancer-related genes• 3rd tier: the remaining variants

12/22

Page 38: Sarek, a portable workflow for WGS analysis of germline ... · Sarek A portable workflow for WGS analysis of germline and somatic mutations Maxime U. Garcia maxulysse.github.io @MaxUlysse

Prioritization

• First step towards clinical use• Rank scores are computed for all variants

• COSMIC, ClinVar, SweFreq and MSK-IMPACT(cancerhotspots.org)

• Findings are ranked in three tiers• 1st tier: well known, high-impact variants• 2nd tier: variants in known cancer-related genes• 3rd tier: the remaining variants

12/22

Page 39: Sarek, a portable workflow for WGS analysis of germline ... · Sarek A portable workflow for WGS analysis of germline and somatic mutations Maxime U. Garcia maxulysse.github.io @MaxUlysse

Prioritization

• First step towards clinical use• Rank scores are computed for all variants

• COSMIC, ClinVar, SweFreq and MSK-IMPACT(cancerhotspots.org)

• Findings are ranked in three tiers

• 1st tier: well known, high-impact variants• 2nd tier: variants in known cancer-related genes• 3rd tier: the remaining variants

12/22

Page 40: Sarek, a portable workflow for WGS analysis of germline ... · Sarek A portable workflow for WGS analysis of germline and somatic mutations Maxime U. Garcia maxulysse.github.io @MaxUlysse

Prioritization

• First step towards clinical use• Rank scores are computed for all variants

• COSMIC, ClinVar, SweFreq and MSK-IMPACT(cancerhotspots.org)

• Findings are ranked in three tiers• 1st tier: well known, high-impact variants• 2nd tier: variants in known cancer-related genes• 3rd tier: the remaining variants

12/22

Page 41: Sarek, a portable workflow for WGS analysis of germline ... · Sarek A portable workflow for WGS analysis of germline and somatic mutations Maxime U. Garcia maxulysse.github.io @MaxUlysse

Reports

http://multiqc.info/

13/22

Page 42: Sarek, a portable workflow for WGS analysis of germline ... · Sarek A portable workflow for WGS analysis of germline and somatic mutations Maxime U. Garcia maxulysse.github.io @MaxUlysse

Workflow

fastqfastqfastq

bambambam

vcfvcfvcf

Based on GATK Best Practices

Preprocessing

• HaplotypeCaller Strelka2• Manta

Variant Calling• Freebayes, MuTect2, Strelka2• Manta• ASCAT

snpEff, VEP

SarekGermline

SarekSomatic

snpEff, VEP

Annotation

Reports

14/22

Page 43: Sarek, a portable workflow for WGS analysis of germline ... · Sarek A portable workflow for WGS analysis of germline and somatic mutations Maxime U. Garcia maxulysse.github.io @MaxUlysse

Reference genomes

• GRCh37 and GRCh38

• Custom genome• Other organisms

15/22

Page 44: Sarek, a portable workflow for WGS analysis of germline ... · Sarek A portable workflow for WGS analysis of germline and somatic mutations Maxime U. Garcia maxulysse.github.io @MaxUlysse

Reference genomes

• GRCh37 and GRCh38• Custom genome

• Other organisms

15/22

Page 45: Sarek, a portable workflow for WGS analysis of germline ... · Sarek A portable workflow for WGS analysis of germline and somatic mutations Maxime U. Garcia maxulysse.github.io @MaxUlysse

Reference genomes

• GRCh37 and GRCh38• Custom genome• Other organisms

15/22

Page 46: Sarek, a portable workflow for WGS analysis of germline ... · Sarek A portable workflow for WGS analysis of germline and somatic mutations Maxime U. Garcia maxulysse.github.io @MaxUlysse

More AWS testing

16/22

Page 47: Sarek, a portable workflow for WGS analysis of germline ... · Sarek A portable workflow for WGS analysis of germline and somatic mutations Maxime U. Garcia maxulysse.github.io @MaxUlysse

Production ready

17/22

Page 48: Sarek, a portable workflow for WGS analysis of germline ... · Sarek A portable workflow for WGS analysis of germline and somatic mutations Maxime U. Garcia maxulysse.github.io @MaxUlysse

Sarek at work

• 50 tumor/normal pairs with GRCh37 reference

• 90 tumor/normal pairs (with some relapse) with GRCh38reference

• The whole SweGen dataset with GRCh38 reference• 1 000 samples in germline settings

• 4 clinical samples• more coming with Genomic Medicine Sweden initiative

18/22

Page 49: Sarek, a portable workflow for WGS analysis of germline ... · Sarek A portable workflow for WGS analysis of germline and somatic mutations Maxime U. Garcia maxulysse.github.io @MaxUlysse

Sarek at work

• 50 tumor/normal pairs with GRCh37 reference• 90 tumor/normal pairs (with some relapse) with GRCh38

reference

• The whole SweGen dataset with GRCh38 reference• 1 000 samples in germline settings

• 4 clinical samples• more coming with Genomic Medicine Sweden initiative

18/22

Page 50: Sarek, a portable workflow for WGS analysis of germline ... · Sarek A portable workflow for WGS analysis of germline and somatic mutations Maxime U. Garcia maxulysse.github.io @MaxUlysse

Sarek at work

• 50 tumor/normal pairs with GRCh37 reference• 90 tumor/normal pairs (with some relapse) with GRCh38

reference• The whole SweGen dataset with GRCh38 reference

• 1 000 samples in germline settings

• 4 clinical samples• more coming with Genomic Medicine Sweden initiative

18/22

Page 51: Sarek, a portable workflow for WGS analysis of germline ... · Sarek A portable workflow for WGS analysis of germline and somatic mutations Maxime U. Garcia maxulysse.github.io @MaxUlysse

Sarek at work

• 50 tumor/normal pairs with GRCh37 reference• 90 tumor/normal pairs (with some relapse) with GRCh38

reference• The whole SweGen dataset with GRCh38 reference

• 1 000 samples in germline settings• 4 clinical samples

• more coming with Genomic Medicine Sweden initiative

18/22

Page 52: Sarek, a portable workflow for WGS analysis of germline ... · Sarek A portable workflow for WGS analysis of germline and somatic mutations Maxime U. Garcia maxulysse.github.io @MaxUlysse

Preprint available at BioRxiv

Sarek: A portable workflow for whole-genome sequencinganalysis of germline and somatic variantsMaxime Garcia, Szilveszter Juhos, Malin Larsson, Pall I Olason, MarcelMartin, Jesper Eisfeldt, Sebastian DiLorenzo, Johanna Sandgren, TeresitaDiaz de Ståhl, Valtteri Wirta, Monica Nistèr, Björn Nystedt, Max Käller

https://doi.org/10.1101/316976

19/22

Page 53: Sarek, a portable workflow for WGS analysis of germline ... · Sarek A portable workflow for WGS analysis of germline and somatic mutations Maxime U. Garcia maxulysse.github.io @MaxUlysse

Get involved!

• Our code is hosted on Github https://github.com/SciLifeLab/Sarek https://github.com/nf-core

• We have gitter channels https://gitter.im/SciLifeLab/Sarek https://gitter.im/nf-core/Lobby

20/22

Page 54: Sarek, a portable workflow for WGS analysis of germline ... · Sarek A portable workflow for WGS analysis of germline and somatic mutations Maxime U. Garcia maxulysse.github.io @MaxUlysse

Get involved!

• Our code is hosted on Github https://github.com/SciLifeLab/Sarek https://github.com/nf-core

• We have gitter channels https://gitter.im/SciLifeLab/Sarek https://gitter.im/nf-core/Lobby

20/22

Page 55: Sarek, a portable workflow for WGS analysis of germline ... · Sarek A portable workflow for WGS analysis of germline and somatic mutations Maxime U. Garcia maxulysse.github.io @MaxUlysse

Acknowledgments

Barntumörbanken Elisa Basmaci NGI Johannes Alneberg NBIS Sebastian DiLorenzoSzilveszter Juhos Anandashankar Anil Malin LarssonGustaf Ljungman Franziska Bonath Marcel MartinMonica Nistèr Orlando Contreras‐López Markus MayrhoferGabriela Prochazka Phil Ewels Björn NystedtJohanna Sandgren Sofia Haglund Markus RingnérTeresita Díaz De Ståhl Max Käller Pall I OlasonKatarzyna Zielinska-Chomej Anna Konrad Jonas Söderberg

Pär LundinGrupp Nistèr Saad Alqahtani Remi-Andre Olsen Clinical Genomics Kenny Billiau

Min Guo Senthilkumar Panneerselvam Hassan Foroughi AslDaniel Hägerstrand Fanny Taborsak Valtteri WirtaAnna Hedrén Chuan WangMartin Proks Nextflow folks Paolo Di TommasoRong Yu Sven FillingerJian Zhao Clinical Genetics Jesper Eisfeldt Alexander Peltzer

nf-core

21/22

Page 56: Sarek, a portable workflow for WGS analysis of germline ... · Sarek A portable workflow for WGS analysis of germline and somatic mutations Maxime U. Garcia maxulysse.github.io @MaxUlysse

Any questions?

http://sarek.scilifelab.se/

https://github.com/SciLifeLab/Sarek

https://maxulysse.github.io/dnaclub2018