dna microarrays a ngs · porovnanie ngs technológií generácia techológia cena/ stroj cena/ beh...
TRANSCRIPT
DNA microarrays a NGS
Základy bioinformatického spracovania dát
Bratislava 10.-11. november 2015 Ľuboš Kľučár CC Attribution-ShareAlike License
NGS
Next Generation Sequencing
Sekvenovanie DNA
1. generácia
– Maxam-Gilbert
– Sanger
2. generácia (Next-generation Sequencing – NGS)
– masívne paralelné sekvenovanie
3. generácia
– sekvenovanie jednej molekuly DNA v reálnom čase
2. generácia sekvenovania
• Roche/454 FLX
• Illumina GA
• SOLiD
• IonTorrent
3. generácia sekvenovania
• Heliscope
• PacBio
Next-generation sequencing (NGS)
Výhody
– high-throughput
– nízka cena
– sekvenovanie de novo
Nevýhody
– nepresné sekvenovanie dlhých homopolymérnych úsekov
– náročnejšia analýza dát
Workflow NGS
• príprava DNA templátu (DNA knižnice)
• amplifikácia DNA knižnice
• sekvenovanie
Sanger NGS Výhody NGS
klonálna amplifikácia • nevyžaduje in vivo
klonovanie, transformáciu,
odpichovanie kolónií...
sekvenovania na čipoch • vyššia úroveň paralelizácie v
porovnaní s kapilárnym
sekvenovaním
Alex Sánchez, VHIR Vall d’Hebron Institut
de Recerca, 2011 http://www.slideshare.net/ueb52/introduction-to-next-
generation-sequencing-v2
a. Emulsion PCR (emPCR) A reaction mixture consisting of an oil–aqueous emulsion is created to encapsulate bead–DNA complexes into single aqueous droplets. PCR amplification is performed within these droplets to create beads containing several thousand copies of the same template sequence. EmPCR beads can be chemically attached to a glass slide or deposited into PicoTiterPlate wells . b. Solid-phase amplification Composed of two basic steps: initial priming and extending of the single-stranded, single-molecule template, and bridge amplification of the immobilized template with immediately adjacent primers to form clusters.
NGS – clonal amplification
NGS – imaging
Illumina
Helicos BioSciences
SOLiD
Ion Torrent - Process overview
Merriman B et al. Electrophoresis 33(23), 3397-417 (201). doi: 10.1002/elps.201200424
Ion Torrent
JM Rothberg et al. Nature 475, 348-352 (2011) doi:10.1038/nature10242
Sensor, well and chip architecture.
Ion Torrent
Wafer, die and chip packaging.
JM Rothberg et al. Nature 475, 348-352 (2011) doi:10.1038/nature10242
Ion Torrent Moore's Law style scaling of successive chip generations
Merriman B et al. Electrophoresis 33(23), 3397-417 (201). doi: 10.1002/elps.201200424
Ion PGM™ Sequencer
Ion Torrent Single read accuracy
JM Rothberg et al. Nature 475, 348-352 (2011) doi:10.1038/nature10242
Ion Torrent
Vibrio fisheri, E. coli, Rhodopseudomonas palustris and Homo sapiens
JM Rothberg et al. Nature 475, 348-352 (2011) doi:10.1038/nature10242
Porovnanie NGS technológií Generácia Technológia
Cena/ stroj
Cena/ beh
Cena/ Mb
Čas behu
Reads/ dĺžka
Reads/ počet
Objem Chybovosť
1. 3730XL 100k $ 100 $ 1 600 $ 2 h 800 60 kb substitúcie
0,1-1%
2.
454 FLX 500k $ 6 000 $ 7 $ 23 h 750 1 mil. 900 Mb indel 1%
HiSeq 2000 700k $ 24 000 $ 0,04 $ 11 dní 2 x 100 3 mld. 600 Gb substitúcie
>0,1%
SOLiD 5500 600k $ 10 000 $ 0,07 $ 14 dní 75 + 35 1,5 mld. 160 Gb indel
>0,01%
2. (desktop)
454 GS JR 100k $ 1 000 $ 22 $ 9 h 400 100 tis. 50 Mb indel 1%
MiSeq 120k $ 1 000 $ 1 $ 27 h 2 x 150 5 mil. 1 Gb substitúcie
>0,1%
IonTorrent
80k $
150k $
(314) 200 $ (316) 400 $ (318) 600 $ (P) 1 000 $
20 $ 4 $
0,6 $ 0,1 $
3 h
200
100
600 tis. 3 mil. 6 mil.
82 mil.
10 Mb 100 Mb
1 Gb 10 Gb
indel ~1%
3. PacBio RS 700k $ 300 $ 3-14 $ 1,5 h >1 000 50 tis. 100 Mb
indel ~15%
HeliScope 1M $ 0,5 $ 8 dní 55 1 mil. 35 Gb delécie
3%
Kontrola kvality
Posúdenie kvality – skóre Phred
Phred skóre (Phred-33 | Phred-64)
Pravdepodobnosť nesprávne priradenej bázy
Presnosť priradenia bázy
10 (+|J) 1 z 10 90%
20 (5|T) 1 zo 100 99%
30 (?|^) 1 z 1000 99,9%
40 (I|h) 1 z 10 000 99,99%
50 (S|r) 1 zo 100 000 99,999%
Phred skóre Q
• vlastnosť, ktorá má logaritmický vzťah k pravdepodobnosti P
nesprávneho priradenia bázy
• ASCII vyjadrenie pripočítaním 33 k Phred skóre
(Illumina pripočítava 64)
http://en.wikipedia.org/wiki/Phred_quality_score
Kódovanie kvality
SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS.....................................................
..........................XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX......................
...............................IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII......................
.................................JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ......................
LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL....................................................
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
| | | | | |
33 59 64 73 104 126
0........................26...31.......40
-5....0........9.............................40
0........9.............................40
3.....9.............................40
0........................26...31........41
S - Sanger Phred+33, raw reads typically (0, 40)
X - Solexa Solexa+64, raw reads typically (-5, 40)
I - Illumina 1.3+ Phred+64, raw reads typically (0, 40)
J - Illumina 1.5+ Phred+64, raw reads typically (3, 40)
with 0=unused, 1=unused, 2=Read Segment Quality Control Indicator (underline)
L - Illumina 1.8+ Phred+33, raw reads typically (0, 41)
@HWI-EAS209_0006_FC706VJ:5:58:5894:21141#ATCACG/1
TTAATTGGTAAATAAATCTCCTAATAGCTTAGATNTTACCTTNNNNNNNNNNTAGTTTCTTGAGATTTGTTGGGGGAGACATTTTTGTGATTGCCTTGAT
+HWI-EAS209_0006_FC706VJ:5:58:5894:21141#ATCACG/1
efcfffffcfeefffcffffffddf`feed]`]_Ba_^__[YBBBBBBBBBBRTT\]][]dddd`ddd^dddadd^BBBBBBBBBBBBBBBBBBBBBBBB
http://en.wikipedia.org/wiki/Phred_quality_score
Distribúcia dát
0
1
2
3
4
5
6
7
8
9
10
1 2 3 4 5 6 7 8 9
76657216666758978554765664398473
12334445555566666666677777788899
12334445555566666666677777788899
12334445555566666666677777788899
12334445555566666666677777788899
2
4
6
8
↑ ↑↑ ↑↑ ↑↑ ↑
min q1 median q3 max
box and whiskers
Reads statistics 1. C T T G G G A A A T A A T T T A T A A T
8 8 8 8 8 8 8 4 4 4 4 4 3 3 3 3 3 1 1 1
2. C T T G G G A A A T A A T T T A T A A A
7 7 7 7 7 7 4 4 4 4 4 4 4 4 3 3 1 1 1 1
3. T A T A A C A A A A T C C T T T T T A T
9 9 9 8 8 8 8 7 8 8 7 7 7 4 5 4 3 1 2 2
4. T G T A T C A A A A C A G C T T G G G A
9 8 8 5 7 7 7 7 8 8 7 5 4 4 4 4 4 1 2 2
5. G T T A G T G T G T G T A T C A A A A C
7 7 7 7 7 7 7 6 6 6 4 4 3 3 4 4 4 2 2 2
6. A T C T G T T A G T G T G T G T A T C A
8 8 8 5 5 5 5 5 4 5 5 5 4 4 4 4 4 2 2 2
7. A T A A C A A A A T C C T T T T T A T A
8 8 8 8 8 8 8 5 5 5 5 5 4 4 4 4 4 2 2 2
8. T T A T C G A T T A A A G A T A G A A A
8 8 8 8 8 8 8 5 5 5 5 5 4 4 4 4 4 2 2 2
9. T A G A G T A T C T G T T A G T G T G T
5 6 5 6 5 6 6 5 4 5 5 5 4 4 3 4 4 2 2 2
10. T A G A G T A T C T G T T A G T G T G T
4 6 5 4 4 8 6 5 4 5 4 5 2 4 3 4 4 1 1 1
0
2
4
6
4 5 6 Mean sequence quality (Phered Score)
Per sequence quality scores
0
2
4
6
8
10
1 3 5 7 9 11 13 15 17 19
Position in read (bp)
Per base sequence quality
average 7 8 7 7 7 7 7 5 5 6 5 5 4 4 4 4 4 2 2 2
median 8 8 8 7 7 8 7 5 5 5 5 5 4 4 4 4 4 2 2 2
(per base)
average
(per sequence)
5
4
6
6
5
5
5
5
4
4
FastQC
quality control tool for high throughput sequence data
http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
http://www.youtube.com/watch?v=bz93ReOv87Y
FastQC - Per base sequence quality
good sequence bad sequence
FastQC - Per sequence quality scores
good sequence bad sequence
FastQC - Per tile sequence quality
FastQC - Per base sequence content
good sequence bad sequence
FastQC - Per base GC content
good sequence bad sequence
FastQC - Per sequence GC content
good sequence bad sequence
FastQC - Per base N content
good sequence bad sequence
FastQC - Sequence Length Distribution
good sequence bad sequence
FastQC - Sequence Duplication Levels
good sequence bad sequence
FastQC - Overrepresented sequences
good sequence bad sequence
No overrepresented sequences
FastQC - Kmer Content
good sequence bad sequence
Štatistické parametre depth / depth of coverage /
coverage / read coverage
• (priemerný) počet nukleotidov ktorý prispel k získaniu sekvencie daného úseku (napr. 30x)
genome coverage
• podiel báz referenčného genómu, ktoré sú zachytené poskladanými kontigmi (napr. 99,9%)
N50
• dĺžka kontigu pre ktorý platí, že polovica kontigov ma takúto alebo väčšiu veľkosť (napr. 18 654 bp)
maximum / median / average contig size, number of contigs
• maximálna dĺžka, medián a priemerná dĺžka kontigov a ich počet
Chipster Open source platform for data analysis
http://http://chipster.csc.fi/
Galaxy Data intensive biology for everyone
http://galaxyproject.org/
IGV Integrative Genomics Viewer
http://www.broadinstitute.org/igv/
Knižnice
Príprava knižnice sinlge-end + jednoduchšie
+ 100 ng–1 μg DNA
- nevhodné na spájanie kontigov
paired-end + presnejší alignment
+ repeaty
- 100 ng–1 μg
mate pair + de novo
+ genome finishing
+ structural variants detection
- 5-120 μg DNA
200-800 bp
2-20 Kbp
mate pair knižnice
Berglund et al. Investigative Genetics 2011 2:23 doi:10.1186/2041-2223-2-23
Alignment a de novo assembly
Berglund et al. Investigative Genetics 2011 2:23 doi:10.1186/2041-2223-2-23
Alignment paired-end readov
De novo assembly
Multiplexing vzoriek (barcoding)
multiplex de-multiplex align
Illumina: An Introduction to Next-Generation Sequencing Technology.
+ až 120 rôznych vzoriek v jednom behu
De novo assembly
.fastq
assembler
.fasta
Velvet (de Bruijn grafy)
Mapovanie
.fastq
aligner .sam
.bam
BWA
Bowtie
SAM / BAM
SAM (Sequence Alignment Map format)
BAM (Binary Sequence Alignment) • binárna verzia formátu SAM
http://samtools.sourceforge.net/SAM1.pdf
SRR017937.312 16 chr20 43108717 37 76M * 0 0
TGAGCCTCCGGGCTATGTGTGCTCACTGACAGAAGACCTGGTCACCAAAGCCCGGGAAGAGCTGCAGGAAAAGCCG
?,@A=A<5=,@==A:BB@=B9(.;A@B;>@ABBB@@9BB@:@5<BBBB9)>BBB2<BBB@BBB?;;BABBBBBBB@
QNAME: Query name of the read
FLAG
RNAME: Reference sequence name
POS: Position of alignment in reference sequence
MAPQ: Mapping quality (Phred-scaled)
CIGAR: Specifics of the alignment against the reference
MRNM
MPOS
ISIZE
Detekcia variantov
.bam [pileup]
SAMTools
.bed
Rudy G: A Hitchhiker’s Guide to Next Generation Sequencing.
Štatistické predpoklady
• predpokladaný výskyt variantov je asi 1 na 1000 bp
• asi 85% variantov je už všeobecne známych
(dbSNP)
• pomer tranzícií/transverizií ma byť > 2 ak sú
varianty vysoko kvalitné (a pomer je ešte vyšší, ak
pochádzajú z kódujúcich oblastí)
Pileup
• prehľad alignmentu na každej pozícii
http://samtools.sourceforge.net/pileup.shtml
chr1 272 T 24 ,.$.....,,.,.,...,,,.,..^+. <<<+;<<<<<<<<<<<=<;<;7<&
chr1 273 T 23 ,.....,,.,.,...,,,.,..A <<<;<<<<<<<<<3<=<<<;<<+
chr1 274 T 23 ,.$....,,.,.+1A,...,,,.,... 7<7;<;<<<<<<<<<=<;<;<<6
chr1 275 A 23 ,$....,,.,-2AG,...,-2AG,.,...^l. <+;9*<<<<<<<<<=<<:;<<<<
chr1 276 G 22 ...T,,.,.,...,,,.,.... 33;+<<7=7<<7<&<<1;<<6<
ReferenceSeq [string] - name of the reference sequence
Coordinate [integer] - position in the reference sequence
ReferenceBase [A/C/G/T/N] - reference base at that position
#Reads [integer] - number of reads aligning to that base
ReadBases [variable length string, see below]
BaseQualities [variable length string, Phred encoded]
BED
BED (Browser Extensible Display format) • definícia dát pre zobrazenie v prehliadači
http://www.ensembl.org/info/website/upload/bed.html
chr22 25043062 25043063 I 0 +
chr22 25745895 25745896 I 0 +
chr22 26769465 26769466 I 0 -
chr22 26886954 26886955 D 0 +
chrom - the name of the chromosome (e.g. chr3, chrY)
chromStart - the starting position of the feature in the chromosome (1=0)
chromEnd - the ending position of the feature in the chromosome
name - defines the name of the feature
score - a score between 0 and 1000 (e.g for coloring the feature)
strand - defines the strand - either '+' or '-'.
RNA-seq
• štúdium úrovne expresie génov, detekcie nových génov a nekódujúcich RNA
• NGS alternatíva k DNA čipom
• nie je limitovaná iba pre známe gény
• poskytuje informácie o alternatívnom splicingu
• deteguje aj sekvenčné varianty
• minimálny alebo žiaden background
• nemá horný limit pre kvantifikáciu – rozsah 4 rády (v porovnaní s DNA čipmi - cca 2 rády)
RNA-seq
A.
B.
RNA-seq
FPKM
Fragments Per Kilobase of transcript per Million mapped reads
𝐹𝑃𝐾𝑀 =𝒕𝒐𝒕𝒂𝒍 𝒇𝒓𝒂𝒈𝒎𝒆𝒏𝒕𝒔
𝑒𝑥𝑜𝑛 𝑙𝑒𝑛𝑔𝑡ℎ 𝐾𝐵 ∗ 𝑚𝑎𝑝𝑝𝑒𝑑 𝑟𝑒𝑎𝑑𝑠 𝑚𝑖𝑙𝑙𝑖𝑜𝑛𝑠
RPKM
Reads Per Kilobase of transcript per Million mapped reads
𝑅𝑃𝐾𝑀 =𝒕𝒐𝒕𝒂𝒍 𝒆𝒙𝒐𝒏 𝒓𝒆𝒂𝒅𝒔
𝑒𝑥𝑜𝑛 𝑙𝑒𝑛𝑔𝑡ℎ 𝐾𝐵 ∗ 𝑚𝑎𝑝𝑝𝑒𝑑 𝑟𝑒𝑎𝑑𝑠 𝑚𝑖𝑙𝑙𝑖𝑜𝑛𝑠
Kvantifikácia
• Normalizácia
• RPKM (reads per kilobase per millions)
Vzorka 1
Vzorka 2
6 mil
8 mil
Gén A – 0,6 kb Gén B – 1,1 kb Gén C – 1,4 kb
RPKM=12/(0,6*6)=3,33 RPKM=24/(1,1*6)=3,64 RPKM=11/(1,4*6)=1,31
RPKM=19/(0,6*8)=3,96 RPKM=28/(1,1*8)=1,94 RPKM=16/(1,4*8)=1,43
12 readov 24 readov 11 readov
19 readov 28 readov 16 readov
RNA-seq
Cufflinks
Cuffmerge
final transcriptome assembly .gtf
reads .fastq
TopHat
reads .fastq
mapped reads .bam
mapped reads .bam
mapped reads .bam
mapped reads .bam
Cuffdiff
differentialy expressed genes
condition A condition B reference genome
plný protokol
• pre genómy s horšie
preštudovaným
tranksriptómom (nové
genómy, onkogénne
genómy)
skrátený protokol
• dobre preštudované
genómy
• nekompletný alebo
nesprávne anotovaný
genóm zapríčiní
nepresné stanovenie
úrovne expresie génov
assembled transcripts .gtf
assembled transcripts .gtf
Tra
pnell
at
al. (
2012)
Natu
re
Pro
tocols
7(3
): 5
62-5
78.
Ribo-seq
(Ribo-seq = Ribosome profiling) • využitie NGS na monitorovanie
translácie in vivo • na rozdiel od RNA-seq je zameraná
iba na mRNA momentálne naviazané na ribozómy - ktoré sa momentálne translatujú
Využitie: – identifikácia miesta počiatku
translácie – určenie miery proteosyntézy – predpoveď množstva proteínu
Ingolia, NT (2014) Nat Rev Genet. 15:205-213