the genome access course november 2014 next generation dna sequencing illumina hiseq x 1.8 tbp (3...
TRANSCRIPT
The Genome Access Course
November 2014
Next Generation DNA Sequencing
Illumina HiSeq X1.8 Tbp
(3 billion reads) in ~3 days
(as of 11/6/2014)
The Genome Access Course
November 2014
Whole Genome Shotgun Sequencing
Randomly Fragment
Genomic DNA
Genome Assembly...ATCCGTAAATGGGCTGATACTACTAATGC TGGGCTGATACTACTAATGCCAAACTGTACTAGTCCTG...
...ATCCGTAAATGGGCTGATACTACTAATGCCAAACTGTACTAGTCCTG...
Contiguous Sequence (Contig)
SequenceFragments
The Genome Access Course
November 2014
RNA Sequencing (RNA-Seq)
1. Characterize all RNA in sample
2. Gene expression level proportional to number of reads
3. Detect alternatively spliced transcripts
Garber et al, Nat Methods (2011)
SequenceFragments
cDNA made from RNA
cDNA
The Genome Access Course
November 2014
Typical Next Gen Experiments
• Genome sequencing– Novel genomes– Resequencing
• Transcriptome sequencing (RNA-seq)– Characterize transcripts with or without reference genome
• Typical length• Short (microRNAs, …)
– Find differentially expressed transcripts
• Other– Methyl-seq– ChIP-seq
The Genome Access Course
November 2014
The Genome Access Course
November 2014
Illumina SequencingDNA Sample
ConstructLibrary
Cluster Generation in Flow Cell
200+ million reads per lane(>100 bp reads)
Sequencing by Synthesis
The Genome Access Course
November 2014
Types of Sequencing Libraries
Single-End Reads - 5’ or 3’ (random)
Paired-End Reads - 5’ and 3’
Mate-Pair Reads - 5’ and 3’
2-5 kbp
200-500 bp
The Genome Access Course
November 2014
Taken from GIGA Newsletter 13 – Universite de Liège
The Genome Access Course
November 2014
What Does the Data Look Like?FASTQ File Format
Sequence
Quality (ASCII character for each base)
> 200 million reads in one lane
Files so big that they break them up in 40 million reads per file
The Genome Access Course
November 2014
Example Analysis WorkflowPaired-End FASTQ Files
FASTQ(_R1.txt)
FASTQ(_R2.txt)
FastQC(Diagnostics)
FastQC(Diagnostics)
Trim Reads(if needed)
Trim Reads(if needed)
Align Reads to GenomeAlign Reads to Genome
SAM FileSAM File
BAM FileBAM File
The Genome Access Course
November 2014
Sequence Composition Diagnostics
Unbiased Reads
Biased Reads
First Position Nearly Always “T”
The Genome Access Course
November 2014
GC Bias in First ~15 bp Due to Random Hexamer Priming
The Genome Access Course
November 2014
Trim Sequences Prior To Analysis
• Make sure sequencing adapters are removed• Trim ends of sequence based on quality scores
The Genome Access Course
November 2014
Trimmomatic
FastX Toolkit – Hannon Lab at CSHL
The Genome Access Course
November 2014
Example Analysis WorkflowPaired-End FASTQ Files
FASTQ(_R1.txt)
FASTQ(_R2.txt)
FastQC(Diagnostics)
FastQC(Diagnostics)
Trim Reads(if needed)
Trim Reads(if needed)
Align Reads to GenomeAlign Reads to Genome
SAM FileSAM File
BAM FileBAM File
The Genome Access Course
November 2014
Sequence Alignment/Map (SAM) Format
Common file format to store: - Reads - Quality of each base - How reads align to a reference sequenceGenerated by most next gen analysis software
samtools software package
The Genome Access Course
November 2014
samtools Used to Manipulate SAM Files
SAM FileSAM File
BAM FileBAM FilePileUp
FilePileUp
File
samtoolssamtools
Pileup output file
chr1 272 T 24 ,.$.....,,.,.,...,,,.,..^+. <<<+;<<<<<<<<<<<=<;<;7<&chr1 273 T 23 ,.....,,.,.,...,,,.,..A <<<;<<<<<<<<<3<=<<<;<<+chr1 274 T 23 ,.$....,,.,.,...,,,.,... 7<7;<;<<<<<<<<<=<;<;<<6chr1 275 A 23 ,$....,,.,.,...,,,.,...^l. <+;9*<<<<<<<<<=<<:;<<<<chr1 276 G 22 TTTTTTTTTTTTTTTTTTTTTTT 33;+<<7=7<<7<&<<1;<<6<chr1 277 T 22 ....,,.,.,.C.,,,.,..G. +7<;<<<<<<<&<=<<:;<<&<chr1 278 G 23 ....,,.,.,...,,,.,....^k. %38*<<;<7<<7<=<<<;<<<<<chr1 279 C 23 A..T,,.,.,...,,,.,..... ;75&<<<<<<<<<=<<<9<<:<<
Call Variants
Call Variants …
The Genome Access Course
November 2014
Binary Alignment (BAM) Files
• Common file format to store reads and their alignment to a reference sequence– Generated by most next gen analysis software
• samtools software package
• UCSC Genome Browser and Ensembl can display them as a custom track– IGV from Broad very useful
The Genome Access Course
November 2014
UCSC Genome Browser with 1,000 Genomes Project Data
The Genome Access Course
November 2014
Integrated Genomics Viewer (IGV)
The Genome Access Course
November 2014
LookSeq at Sanger Mouse Genomes Project
The Genome Access Course
November 2014
Glo1 CNV Present in Mouse Genomes Data for A/J
Proximal FlankChr17: 30.5Mb
Max ~50x coverage
Glo1 LocusChr17: 30.7Mb
Max >100x coverage
Distal FlankChr17: 31.2Mb
Max ~50x coverage
50kb 50kb 50kb
The Genome Access Course
November 2014
Glo1 CNV Not Present in Mouse Genomes Data for NZO
Proximal FlankChr17: 30.5Mb
Max ~25x coverage
Glo1 LocusChr17: 30.7Mb
Max ~25x coverage
Distal FlankChr17: 31.2Mb
Max ~25x coverage
50kb 50kb 50kb
The Genome Access Course
November 2014
Galaxy (http://main.g2.bx.psu.edu)
The Genome Access Course
November 2014
Public Data Repositories
SRA Formatted Files
SRA Formatted Files
FASTQ FilesFASTQ Files
SRA ToolKitSRA ToolKit
FASTQ FilesFASTQ Files
Automatically Forward FASTQ Files to Galaxy
EBINCBI
The Genome Access Course
November 2014
NCBI BioProject
The Genome Access Course
November 2014
NCBI Gene Expression Omnibus
The Genome Access Course
November 2014
Overall Analysis Workflow
FASTQ FilesFASTQ Files
Tertiary Analysis1.Analysis of Read Counts
e.g., Differentially expressed genes2.Analysis of Gene Lists
1. Enrichment2. Pathway and networks
3.Analysis of Expression Patterns
Tertiary Analysis1.Analysis of Read Counts
e.g., Differentially expressed genes2.Analysis of Gene Lists
1. Enrichment2. Pathway and networks
3.Analysis of Expression Patterns
Secondary Analysis1.Read Preprocessing & Diagnostics2.Align Reads to Reference
3.Analysis of Aligned Readse.g., Read counts per gene from RNA-Seq
Secondary Analysis1.Read Preprocessing & Diagnostics2.Align Reads to Reference
3.Analysis of Aligned Readse.g., Read counts per gene from RNA-Seq
The Genome Access Course
November 2014
Push-Button Bioinformatics … Be Careful
The Genome Access Course
November 2014
Third Generation Sequencing