overview analyzing sequence data

Overview Analyzing Sequence Data

Eric L. Stevens, Ph.D.International Policy AnalystInternational Affairs [email protected]

Center for Food Safety and Applied Nutrition

mailto:[email protected]

Overview: Genome Assembly

• Understand the process from taking raw sequence reads (either FASTQ or FASTA) and getting a genome assembly (SAM or BAM format)

• Have a conceptual understanding of the difference between alignment and assembly or whatever you hear it described as (i.e. with or without using a reference sequence as a guide)

• Have a conceptual understanding of how reads are aligned/assembled and the pipelines that exist

• Understand how genome sequences are QC’ed and what SAM and BAM format looks like

Overview: Variant Detection

• Understand the process from taking aligned reads (or raw reads) to identifying variants

• Understand the rationale behind the different options (kmer, reference free, reference based).

• Have a conceptual understanding of the major concepts underlying variant identification• IMPORTANT FOR EXPLAINING HOW ROBUST THE DATA AND ANALYSIS

IS!!!!!

• Understand how this genetic variation will set us up for phylogenetic tree construction

4

What Is a Computer?• Computer

– Does computations and carries out logical decisions– Faster than humans

• Computer program– Set of instructions that a computer uses to do something

• Hardware– Physical components of a computer system

• Software– Computer programs that run on a computer

Other Options for Data Analysis

6

FastQ• A more compact format to store sequence and qualities

• Normally on 4 lines:

– “@” followed by the sequence ID

– Sequence

– “+”

– The quality score

• Quality score:– ASCII encoding of phred scores

– Sanger has one scale, Illumina has 3 different (…)

• Can be gzip-ped and used as such by some programs

@SEQ_ID

GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT

+

!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

7

@EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG

EAS139 the unique instrument name

136 the run id

FC706VJ the flowcell id

2 flowcell lane

2104 tile number within the flowcell lane

15343'x'-coordinate of the cluster within the tile

197393'y'-coordinate of the cluster within the tile

1the member of a pair, 1 or 2 (paired-end or mate-pair reads only)

YY if the read fails filter (read is bad), N otherwise

180 when none of the control bits are on, otherwise it is an even number

ATCACG index sequence

8

Phred Score

Phred Quality ScoreProbability of incorrect base call

Base call accuracy

10 1 in 10 90 %

20 1 in 100 99 %

30 1 in 1000 99.9 %

40 1 in 10000 99.99 %

50 1 in 100000 99.999 %

9

Fasta• Most basic file format to represent nucleotide or amino-acid

sequences

• Each sequence is represented by:

– A single description line (shouldn’t exceed 80 characters):

• Starts with “>”

• Followed by the sequence ID, and a space, then

• More information (description)

– The sequence, over one or several lines (the number of characters per line is generally 70 or 80, but it doesnt matter)

Data Preprocessing

• The bases of sequencing reads are not always correct

• Different sequencing platforms have different errors• Therefore, you must know how the sequencing is performed

• There are several common steps– Removing bad sequences that we have low confidence in

– Remove leftover adaptor sequences

– prune the ends of the reads

– k-mer correction …

More could go wrong?

• GC Bias

• Homopolymers

• Low quality bases (especially on the 3’ end)

• Clonal duplicates

• Contamination

• Sequencing artifacts…

14

- Isolate bacterial genome and break it up into many small fragments

- Sequence the fragments using Short or Long-Read Sequencing Technology

- Assemble the thousands to millions of sequenced reads back into the entire genomic sequence of the isolate

- Find genetic differences between isolates (SNPs )

- Use SNPs to create a phylogenetic tree to inform genetic relatedness of foodborne isolates

14

Current WGS Methodology (Simplified)

15

Overview:Current WGS methodology involves growing up the bacterial genome

to be sequenced into a pure culture (one bacterial colony) so that there are many copies of the same bacterial genome. Then the bacterial DNA

is extracted.

Isolate and copy bacterial genome

15

16

Bacterial Genome• 1 circular genome• 3-5 million

nucleotides (AGCTs)

• May have extra DNA (plasmids)

16

Isolate and copy bacterial genome

17

Break up multiple copies of same genome into different pieces

Overview:Current WGS methodology involves cutting multiple copies of the same genome in different positions to generate shorter fragments (of the same length that you can

size select for) that will overlap later on when reassembling the genome (note: this is key to generating a whole-genome sequence)

17

18

Genome copy 1

Genome copy 2

Genome copy 3

• Depending on sequencing

chemistry you will have

thousands to millions of

reads

• Short read = up to ~800bp

(can be millions of reads)

and involves Paired-End

Sequencing

• Long read = up to 40kb in

length (thousands of reads)

and fragmentation is

different

18

19

Break up multiple copies of same genome into different pieces

19

20

Genome Copy

1

1

2

3

1

2

3

1

2

3

1

2

3Genome Copy

2

20

21

How Assembly works: (simplified)Overview: Find overlaps (same string of nucleotides) between reads with different

breakpoints.

1

2

3

1

2

3

1

1

2

2

3

3

Bold =

reassembled

genome sequence

21

22

Condensing multiple reads into a draft or complete genome

AGGATTGTTGGCAGGGAATGTTGGCAGT

GAATGTTGGCAGTCAATGTTGGCAGTCG

• 4 sequenced reads that have been aligned together from the SAME isolate

• Red letter indicates possible sequencing error that is resolved by looking at other nucleotides from other reads at that position

AGGAATGTTGGCAGTCG • Genomic sequence of the isolate that takes into account the other sequencing reads

22

Assembly without a Reference

• Requires mate-pair (read-pair) sequence information• Ideally a combination of the two with different insert sizes

• Analysis can be complicated• Assembly, scaffolding, finishing

• could require some manual steps

• Getting easier all the time• increased read length

• better algorithms

• Necessary if you do not want to be biased by a reference genome (will not assemble novel genomic loci)

Assembly with a Reference

• Possess a sequenced genome that is very similar (usually same species or closer)

• Sequencing reads are then aligned to the reference genome

• Can help guide where the reads go• But can also mislead

• Will not align reads that do not match the reference!

• Will miss plasmid, novel genes, amr, etc.

• Can be faster and good for a quick check to make sure you have sequenced what you think you did

26

Sequence Alignment: the beginnings of assembling a genome

• When two sequences are being compared and the similarity is considered statistically significant, it is highly likely that the two sequences are evolutionary related

• Expect differences

– SNPs (e.g. AGTC -> AGTT)

– Insertions or Deletions (indels)

• AGTC

• A_ TC

• AGGTC

• A_ _C

27

High-level overview

Assemblers: ARACHNE, PHRAP, CAP, TIGR, CELERA

Overlap: find potentially overlapping reads

Layout: merge reads into contigs and contigs into supercontigs

Consensus: derive the DNA sequence and correct read errors ..ACGATTACAATAGGTT..

many pieces

to assemble

High coverage:

Low coverage:

A few pieces

to assemble

a few contigs,

a few gaps

many contigs,

many gaps

Input Output

Practical Example

• Isolate Genome = “it was the best of times it was the worst of times it was the age of wisdom it was the age of foolishness”

• Ambiguously placed reads:

• it was the

• times it w

• was the ag

• the age of

• Unambiguously placed reads:

• the best o

• the worst

• age of foo

• f wisdom i

• Reconstructed Genome = “it was the best of times it was the worst of times it was the age of wisdom it was the age of foolishness”

• Isolate Genome = “it was the best of times it was the worst of times it was the age of wisdom it was the age of foolishness”

Practical Example

32

Sequence Alignment Map (SAM)

Standard format for reporting short read alignment data• BAM is compressed version

Header

Alignment info

http://samtools.sourceforge.net/

http://samtools.sourceforge.net/

33

The Data: BAM

Assessing Assembly Quality

Comparing a single isolate’s reads against a reference genome

36

Overview: Identify the genetic differences (called Single Nucleotide Polymorphisms, or SNPs) between genomes

37

Comparing the sequences of different bacterial isolates by their genetic differences

• In this example, the first and the fourth isolates have sequences that are genetically different than the sequences from the second and third isolates which are identical

• Based on this data you could say that isolate 2 and isolate 3 are more closely genetically related because they have all the same nucleotides

TGGAATGTTGGCAGTCGAGGAATGTTGGCAGTCGAGGAATGTTGGCAGTCGAGGAATGTTGGCAGTCC

Isolate 1

Isolate 2

Isolate 3

Isolate 4

37

Reference Genome

• Sequencing reads that reveal a variant (A) when compared to a reference genome

• Note that the reference genome can change and a SNP would not be called if the reference genome chosen had an A instead of a T at this position

39

Reads from a single isolate

Note:

• These slides are for teaching purposes only and have

been collected from images that I have made, from the

CDC and FDA, and from around the web.

• The findings and conclusions in this report are those of

the author and do not necessarily represent the official

position of the Food and Drug Administration

overview analyzing sequence data

Documents