big data biology for pythonistas: getting in on the genomics revolution

36
BIG DATA BIOLOGY FOR PYTHONISTAS: GETTING IN ON THE GENOMICS REVOLUTION DARYA VANICHKINA

Upload: darya-vanichkina

Post on 11-Apr-2017

427 views

Category:

Science


0 download

TRANSCRIPT

Page 1: Big data biology for pythonistas: getting in on the genomics revolution

BIG DATA BIOLOGY FOR PYTHONISTAS: GETTING IN ON THE GENOMICS REVOLUTION

DARYA VANICHKINA

Page 2: Big data biology for pythonistas: getting in on the genomics revolution

STRUCTURE OF MY TALK

▸ Whoami, and why now?

▸ The meaning biology of life

▸ The data

▸ The reality (case studies)

▸ Other areas that need development talent

Page 3: Big data biology for pythonistas: getting in on the genomics revolution

BIOLOGY 101

WHY BIOLOGY? WHY NOW?

Page 4: Big data biology for pythonistas: getting in on the genomics revolution

WHY SHOULD *YOU* CARE? - IF YOU’RE A HUMAN BEING IN THE XXI CENTURY

Page 5: Big data biology for pythonistas: getting in on the genomics revolution

BIOLOGY 101: A VERY SIMPLIFIED VIEW OF WHAT IT TAKES TO BE ALIVE/HUMAN

THE CENTRAL DOGMA5’ - ATG TCT TAC AAG TGC GTG - 3’3’ - TAC AGA ATG TTC ACG CAC - 5’

GENETIC CODE

NUCLEUS

DNA

DOUBLE HELIX.

Page 6: Big data biology for pythonistas: getting in on the genomics revolution

BIOLOGY 101: A VERY SIMPLIFIED VIEW OF WHAT IT TAKES TO BE ALIVE/HUMAN

THE CENTRAL DOGMA5’ - ATG TCT TAC AAG TGC GTG - 3’3’ - TAC AGA ATG TTC ACG CAC - 5’

5’ - AUG UCU UAC AAG UGC GUG - 3’

5’ - AUG UCU UAC AAG UGC GUG - 3’

H2N - MET SER TYR LYS CYS VAL - COOH

GENETIC CODE

NUCLEUS

CYTOPLASM

DNA

RNA

PROTEIN

TRANSCRIPTION

TRANSLATION

DOUBLE HELIX. ATGC. ~6 BILLION/HUMAN CELL. [37.2 TRILLION CELLS/BODY] PACKAGED IN 23 PAIRS OF CHROMOSOMES

20K CODING GENES

Page 7: Big data biology for pythonistas: getting in on the genomics revolution

BIOLOGY 201: A SIMPLIFIED VIEW OF WHAT IT TAKES TO BE ALIVE

[A BIT] BEYOND THE CENTRAL DOGMA5’ - ATG TCT TAmC AAG TGC GTG - 3’3’ - TAC AGA ATG TTC ACG CAC - 5’

5’ - AUG UCU UAC AAG UGC GUG - 3’

5’ - AUG UCU UIC AAG UGC GUG - 3’

H2N - MET SER pTYR LYS CYS VAL - COOH

NUCLEUS

CYTOPLASM

DNA

RNA

PROTEIN

TRANSCRIPTION

TRANSLATION

5’ - AUGUCUUUCTTAUGCGUG - 3’

NCRNA

H2N - MET SER CYS LYS CYS VAL - COOH

Page 8: Big data biology for pythonistas: getting in on the genomics revolution

WHAT THE DATA LOOKS LIKE

CODIFYING THE CENTRAL DOGMA5’ - ATG TCT TAC AAG TGC GTG - 3’3’ - TAC AGA ATG TTC ACG CAC - 5’

5’ - AUG UCU UAC AAG UGC GUG - 3’

5’ - AUG UCU UAC AAG UGC GUG - 3’

H2N - MET SER TYR LYS CYS VAL - COOH

GENETIC CODE

CYTOPLASM

DNA [GENOME/EXOME]

RNA [TRANSCRIPTOME]

PROTEIN

TRANSCRIPTION

TRANSLATION

ATGC STRING!

AUGC STRING!

21 LETTER STRING!

Page 9: Big data biology for pythonistas: getting in on the genomics revolution

WHAT DO YOU DO WITH THE DATA?

▸ Try to explain/understand diseases (especially rare/Mendelian ones)

▸ Identify family relationships

▸ Identify ethnic origin

▸ Carrier status

▸ Targeted drug prescription, and rational prediction of side effects

▸ Identify patients at risk of diseases, and “catch” them earlier

THE THEORY

Page 10: Big data biology for pythonistas: getting in on the genomics revolution

EUROPEAN EXAMPLE EXTRA INFO

▸ Taken from http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2735096/figure/F1/

▸ a, A statistical summary of genetic data from 1,387 Europeans based on principal component axis one (PC1) and axis two (PC2). Small coloured labels represent individuals and large coloured points represent median PC1 and PC2 values for each country. The inset map provides a key to the labels. The PC axes are rotated to emphasize the similarity to the geographic map of Europe. AL, Albania; AT, Austria; BA, Bosnia-Herzegovina; BE, Belgium; BG, Bulgaria; CH, Switzerland; CY, Cyprus; CZ, Czech Republic; DE, Germany; DK, Denmark; ES, Spain; FI, Finland; FR, France; GB, United Kingdom; GR, Greece; HR, Croatia; HU, Hungary; IE, Ireland; IT, Italy; KS, Kosovo; LV, Latvia; MK, Macedonia; NO, Norway; NL, Netherlands; PL, Poland; PT, Portugal; RO, Romania; RS, Serbia and Montenegro; RU, Russia, Sct, Scotland; SE, Sweden; SI, Slovenia; SK, Slovakia; TR, Turkey; UA, Ukraine; YG, Yugoslavia. b, A magnification of the area around Switzerland from a showing differentiation within Switzerland by language. c, Genetic similarity versus geographic distance. Median genetic correlation between pairs of individuals as a function of geographic distance between their respective populations.

Page 11: Big data biology for pythonistas: getting in on the genomics revolution

DATA ANALYSIS

PIPELINE FOR PROCESSING GENOMIC DATASEQUENCE GENOME

MAP READS TO REFERENCE CALL VARIANTS INTERPRET

@ERR030890.1 HWI-BRUNOP16X_0001:3:2:1148:1061#0/1 NNCAATGCTACTCTCAACAAGTTCACAGAGGAACTTAAGAAGTATGGAGTGACGNNTTTGGNTCGNGTTTGTGAT + ##++**++++FFFFF5::88:=???FFFFFFFFFFFFFFFFF=F<8?############################

“Read”, 10 - 100+ million of these per dataset. Can be paired. https://en.wikipedia.org/wiki/FASTQ_format

+OR +

DNA

Page 12: Big data biology for pythonistas: getting in on the genomics revolution

DATA ANALYSIS

PIPELINE FOR PROCESSING GENOMIC DATASEQUENCE GENOME

MAP READS TO REFERENCE CALL VARIANTS INTERPRET

ERR030890.15421060 272 chr1 564478 3 75M * 0 0 GTCTCAGGCTTCAACATCGAATACGCCGCAGGCCCCTTCGCCCTATTCTTCATAGCCGAATACACAAACATTANN 1576:<F<FF=::??=5?DDFFFFF<FFF<?=?=;>>??=?=???66?=;FFFFFFFFFF=???6&)(*++**## AS:i:-2 XN:i:0 XM:i:2XO:i:0 XG:i:0 NM:i:2 MD:Z:73T0T0 YT:Z:UU XS:A:- NH:i:2 CC:Z:chrM CP:i:3929 HI:i:0

Alignment programs (run independently) - bwa, bowtie2 Output: SAM file (sequence alignment/map) # Example for 1 read:

https://en.wikibooks.org/wiki/Next_Generation_Sequencing_(NGS)/Alignment http://genome.sph.umich.edu/wiki/SAM Official (obtuse) documentation https://samtools.github.io/hts-specs/SAMv1.pdf

Reference == genome

Page 13: Big data biology for pythonistas: getting in on the genomics revolution

DATA ANALYSIS

PIPELINE FOR PROCESSING GENOMIC DATASEQUENCE GENOME

MAP READS TO REFERENCE CALL VARIANTS INTERPRET

GCTGATGTGCCGCCTCACTTCGGTGGTGAGGTG chromosome 1 CTGATGTGCCGCCTCACTTCGGTGGT read1 TGATGTGCCGCCTCACTACGGTGGTG read2 GATGTGCCGCCTCACTTCGGTGGTGA read3 GCTGATGTGCCGCCTCACTACGGTG read4 GCTGATGTGCCGCCTCACTACGGTG read5

For visualising SAM - use http://software.broadinstitute.org/software/igv/

CACCTCACCACCGAAGTGAGGCGGCACATCAGC chromosome 1 CCTCACCA------GTGAGGCGGCACATCA read1 TCACCA------GTGAGGCGGCACATCAGC read2 CACCTCACCA------GTGAGGCGGCACA read3 CTCACCA------GTGAGGCGGCACAGC read4 ACCTCACCA------GTGAGGCGGCAC read5

Mismatch Deletion [Insertion]

Page 14: Big data biology for pythonistas: getting in on the genomics revolution

DATA ANALYSIS

PIPELINE FOR PROCESSING GENOMIC DATASEQUENCE GENOME

MAP READS TO REFERENCE CALL VARIANTS INTERPRET

GCTGATGTGCCGCCTCACTTCGGTGGTGAGGTG chromosome 1 CTGATGTGCCGCCTCACTTCGGTGGT read1 TGATGTGCCGCCTCACTACGGTGGTG read2 GATGTGCCGCCTCACTTCGGTGGTGA read3 GCTGATGTGCCGCCTCACTACGGTG read4 GCTGATGTGCCGCCTCACTACGGTG read5

CACCTCACCACCGAAGTGAGGCGGCACATCAGC chromosome 1 CCTCACCA------GTGAGGCGGCACATCA read1 TCACCA------GTGAGGCGGCACATCAGC read2 CACCTCACCA------GTGAGGCGGCACA read3 CTCACCA------GTGAGGCGGCACAGC read4 ACCTCACCA------GTGAGGCGGCAC read5

Mismatch (SNV) Deletion [Insertion]

Find difference to reference

https://usegalaxy.org/

3 - 5 million variants vs reference

Page 15: Big data biology for pythonistas: getting in on the genomics revolution

BIOLOGY 101: A VERY SIMPLIFIED VIEW OF WHAT IT TAKES TO BE ALIVE/HUMAN

CHROMOSOMAL MODE OF INHERITANCE

60 new mutations per generation, with a 20-year-old father transmitting ~ 25 mutations to his child, a 40-year-old father transmitting around 65 (Kong et al Nature 2012 DOI:10.1038/nature11396; Francioli et al 2015 Nature Genetics DOI:10.1038/ng.3292)

Page 16: Big data biology for pythonistas: getting in on the genomics revolution

DATA ANALYSIS

PIPELINE FOR PROCESSING GENOMIC DATASEQUENCE GENOME

MAP READS TO REFERENCE CALL VARIANTS INTERPRET

GCTGATGTGCCGCCTCACTTCGGTGGTGAGGTG chromosome 1 CTGATGTGCCGCCTCACTTCGGTGGT read1 TGATGTGCCGCCTCACTACGGTGGTG read2 GATGTGCCGCCTCACTTCGGTGGTGA read3 GCTGATGTGCCGCCTCACTACGGTG read4 GCTGATGTGCCGCCTCACTACGGTG read5

CACCTCACCACCGAAGTGAGGCGGCACATCAGC chromosome 1 CCTCACCA------GTGAGGCGGCACATCA read1 TCACCA------GTGAGGCGGCACATCAGC read2 CACCTCACCA------GTGAGGCGGCACA read3 CTCACCA------GTGAGGCGGCACAGC read4 ACCTCACCA------GTGAGGCGGCAC read5

Mismatch (SNV) Deletion [Insertion]

Homozygous/heterozygous

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003

20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.

VCF file

Page 17: Big data biology for pythonistas: getting in on the genomics revolution

DATA ANALYSIS

PIPELINE FOR PROCESSING GENOMIC DATASEQUENCE GENOME

MAP READS TO REFERENCE CALL VARIANTS INTERPRET

GCTGATGTGCCGCCTCACTTCGGTGGTGAGGTG chromosome 1 CTGATGTGCCGCCTCACTTCGGTGGT read1 TGATGTGCCGCCTCACTACGGTGGTG read2 GATGTGCCGCCTCACTTCGGTGGTGA read3 GCTGATGTGCCGCCTCACTACGGTG read4 GCTGATGTGCCGCCTCACTACGGTG read5

CACCTCACCACCGAAGTGAGGCGGCACATCAGC chromosome 1 CCTCACCA------GTGAGGCGGCACATCA read1 TCACCA------GTGAGGCGGCACATCAGC read2 CACCTCACCA------GTGAGGCGGCACA read3 CTCACCA------GTGAGGCGGCACAGC read4 ACCTCACCA------GTGAGGCGGCAC read5

Mismatch (SNV) Deletion [Insertion]

Homozygous/heterozygous

Good tutorial on this (VLSCI) https://docs.google.com/document/d/1lfDYNzHjfDA1pHTHd-0w3xHhg7L4TipT1gRfzgiV8es/pub http://vlsci.github.io/lscc_docs/tutorials/variant_calling_galaxy_1/variant_calling_galaxy_1/ http://vlsci.github.io/lscc_docs/tutorials/var_detect_advanced/var_detect_advanced/ samtools pileup, GATK, FreeBayes -> Variant Call Format (VCF)

Page 18: Big data biology for pythonistas: getting in on the genomics revolution

DATA ANALYSIS

PIPELINE FOR PROCESSING GENOMIC DATASEQUENCE GENOME

MAP READS TO REFERENCE CALL VARIANTS INTERPRET

What do the differences actually mean?

What we currently do:

1. See if any of the observed variants match disease-associated mutations we’ve seen before (databases like OMIM, dbSNP, ClinVar, SNPedia)

2. Predict whether mutation would “break” protein by introducing a “STOP” earlier in the sequence, or shift the frame, or change a critical amino acid

Page 19: Big data biology for pythonistas: getting in on the genomics revolution

BIOLOGY 201: BUT …

BUT THERE ARE MANY CHALLENGES THAT NEED TO BE ADDRESSED

Page 20: Big data biology for pythonistas: getting in on the genomics revolution

CASE STUDIES

Page 21: Big data biology for pythonistas: getting in on the genomics revolution

CASE STUDIES

UK: GENOMICS ENGLAND 100 000 GENOMES FOR THE NHS

Page 22: Big data biology for pythonistas: getting in on the genomics revolution

CASE STUDIES

UK: GENOMICS ENGLAND .100 000 GENOMES FOR THE NHS JESSICA WRIGHT

▸ Epilepsy, movement disorders, developmental delay

▸ Standard testing: MRI, lumbar puncture, EEGs and other testing (including invasive tests) did not pinpoint a cause

▸ Genomic sequencing identified a de novo mutation in Glut1, which codes for a protein responsible for transporting glucose from the blood into the brain

▸ => Ketogenic diet (low carbohydrate, high fat diet)

Page 23: Big data biology for pythonistas: getting in on the genomics revolution

CASE STUDIES

23&ME DIRECT TO CONSUMER GENETICS

▸ 23andme

▸ Illumina HumanOmniExpress-24 array

▸ opt-in research

▸ 36 FDA approved tests + ancestry vs original kit: 254 diseases/conditions

▸ Manuel Corpas - sample data of himself and his family (23&Me, Exome sequencing)

Page 24: Big data biology for pythonistas: getting in on the genomics revolution

CASE STUDIES

23&ME DIRECT TO CONSUMER GENETICS

▸ “Genetic information can reveal that someone you thought you were related to is not your biological relative. This happens most frequently in the case of paternity.”

▸ “Learning that your genotype is associated with an increased risk of a particular condition can be difficult, especially if you have seen a friend or family member struggle with a similar issue.”

▸ “Because genetic information is hereditary, knowing something about your genetics also tells you something about those closely related to you. Your family may or may not want to know this information as well, and relationships with others can be affected by learning about your DNA.”

▸ Link & Siblings and half-siblings & Genome view

Page 25: Big data biology for pythonistas: getting in on the genomics revolution

CASE STUDIES

VERIFI/HARMONY GENETIC TESTS (AUSTRALIAN PATHOLOGY)

▸ $450 AUD

▸ Tests for chromosome abnormalities: trisomy 21 (Down syndrome), trisomy 18 (Edwards syndrome) and trisomy13 (Patau syndrome)

▸ Optional gender, Turner (Monosomy X) and Klinefelter (XXY) syndromes

▸ http://www.sonicgenetics.com.au/nipt/patients/how-it-works/

Page 26: Big data biology for pythonistas: getting in on the genomics revolution

BIOLOGY 201: BUT …

BUT THERE ARE MANY (PRACTICAL) CHALLENGES THAT NEED TO BE ADDRESSED

▸ Speed (of mappers, cleaners, collapsers, annotators) is a *major* problem - in the real world, outside of the Ivory Tower

▸ Tools are not designed to work together

▸ Technical reproducibility between centres

▸ Data sharing issues, and lack of consistent nomenclature and file format (and chr) horrors

▸ Getting it wrong can have devastating consequences (pathogenic variant later reclassified as benign in prenatal diagnosis; athletes deemed to be erroneously at risk of cardiac failure)

▸ Differences in interpretation between pathologists/ doctors - and hence different patient outcomes

Page 27: Big data biology for pythonistas: getting in on the genomics revolution

BEYOND THE GENOME

Page 28: Big data biology for pythonistas: getting in on the genomics revolution

THE ONE SLIDE ABOUT WHAT I ACTUALLY DO…

▸ GENCODE 25

▸ hg38

Page 29: Big data biology for pythonistas: getting in on the genomics revolution

ADDITIONAL RESOURCES

▸ Galaxy tutorials and work-throughs (for when you’re starting out) https://wiki.galaxyproject.org/Learn/GalaxyNGS101

▸ Broad Institute (Harvard/MIT) Public Lectures

▸ Genomics England Youtube

▸ PyCon talk by Titus Brown, with example of how to run bcbio on Ashkenazi trio dataset

▸ Bcbio sample datasets and analyses, especially the exome and whole genome variant analysis, tumour vs normal comparisons [Good for trying out variant analysis, not so good for RNA at the moment]

Page 30: Big data biology for pythonistas: getting in on the genomics revolution

IF YOU WANT TO TRY THIS AT HOME…

WHERE TO GET DATA, AND HOW TO PROCESS IT

▸ Look for research study you’re interested in pubmed, and find where they link to the raw data (Methods section and supplementary tables, with “weird" identifiers, in fastq)

▸ Data from all research studies *[must be] is usually* deposited in the European Nucleotide Archive (ENA), where you can download it in fastq format.

▸ First, try to process it to reproduce the authors’ results. Galaxy provides a web interface that runs many standard command-line tools and allows you to look at the output - good as “leading strings”

▸ Frameworks such as bcbio provide managed environments for analysis

▸ Most biological software runs on linux, and can be chained together using bash. I would go from an exploratory analysis in Galaxy to an analysis that chains together existing tools via bash or a complex bioinformatics pipeline management system (Wikipedia)

Page 31: Big data biology for pythonistas: getting in on the genomics revolution

IF YOU WANT TO TRY THIS AT HOME…

DANGER, WILL ROBINSON! DANGER!

▸ BUT: Because of the latest technologies, you as a programming-literate individual are in a better position to understand this data than most

▸ Understanding and playing with this data is addictive - and beautiful…

▸ This is coming to in a hospital near your

Page 32: Big data biology for pythonistas: getting in on the genomics revolution

OTHER “BIOLOGY” OF INTEREST…

▸ “Algorithms stuff” (Talk tomorrow!)

▸ Biological image analysis (fMRI, microscopy)

▸ Contribute to projects such as galaxy and bcbio

▸ Machine learning of patient records

▸ Integrating IOT and wearables with medical data and patient records

▸ Cool stuff in cataloguing the genetic diversity of life, choosing which areas should be made into national parks based on data, or understanding disease spread (ex. flu across Asia)

Page 33: Big data biology for pythonistas: getting in on the genomics revolution

ACKNOWLEDGEMENTS (I.E. THE PEOPLE I WORK WITH, WHO ARE AWESOME)

CURE THE FUTURERASKOLAB.GITHUB.IO

Page 34: Big data biology for pythonistas: getting in on the genomics revolution

QUESTIONS?

@dvanichkina Slides & Questions http://daryavanichkina.com/blog/pycon2016.html

Four domains of Big Data in 2025. In each of the four domains, the projected annual storage and computing needs are presented across the data lifecycle. Big Data: Astronomical or Genomical? http://dx.doi.org/10.1371/journal.pbio.1002195

Page 35: Big data biology for pythonistas: getting in on the genomics revolution

IMAGES USED

▸ Genomics England

▸ https://www.genomicsengland.co.uk/wp-content/uploads/2016/05/PhilMynott_004-1024x681.jpg

▸ NHGRI

▸ https://www.genome.gov/sequencingcostsdata/

▸ Lung tumour image: http://edoc.hu-berlin.de/dissertationen/pietas-agnieszka-2004-11-22/HTML/chapter3.html

▸ Open Clip Art

Page 36: Big data biology for pythonistas: getting in on the genomics revolution

IMAGES USED

▸ Spurious correlations http://www.tylervigen.com/spurious-correlations

▸ http://phys.org/news/2009-11-conquer-social-network-cells.html

▸ http://lobsangstudio.com/nc_pop.cfm?id=291

▸ BBC education - splicing http://www.bbc.co.uk/education/guides/zgrccdm/revision/2

▸ https://www.dnastar.com/arraystar_help/index.html#!Documents/snptable.htm

▸ http://circgenetics.ahajournals.org/content/7/6/911/F2.expansion.html