overview analyzing sequence data

47
Overview Analyzing Sequence Data Eric L. Stevens, Ph.D. International Policy Analyst International Affairs Staff [email protected] Center for Food Safety and Applied Nutrition

Upload: others

Post on 02-Apr-2022

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Overview Analyzing Sequence Data

Overview Analyzing Sequence Data

Eric L. Stevens, Ph.D.International Policy AnalystInternational Affairs [email protected]

Center for Food Safety and Applied Nutrition

Page 2: Overview Analyzing Sequence Data

Overview: Genome Assembly

• Understand the process from taking raw sequence reads (either FASTQ or FASTA) and getting a genome assembly (SAM or BAM format)

• Have a conceptual understanding of the difference between alignment and assembly or whatever you hear it described as (i.e. with or without using a reference sequence as a guide)

• Have a conceptual understanding of how reads are aligned/assembled and the pipelines that exist

• Understand how genome sequences are QC’ed and what SAM and BAM format looks like

Page 3: Overview Analyzing Sequence Data

Overview: Variant Detection

• Understand the process from taking aligned reads (or raw reads) to identifying variants

• Understand the rationale behind the different options (kmer, reference free, reference based).

• Have a conceptual understanding of the major concepts underlying variant identification• IMPORTANT FOR EXPLAINING HOW ROBUST THE DATA AND ANALYSIS

IS!!!!!

• Understand how this genetic variation will set us up for phylogenetic tree construction

Page 4: Overview Analyzing Sequence Data

4

What Is a Computer?• Computer

– Does computations and carries out logical decisions– Faster than humans

• Computer program– Set of instructions that a computer uses to do something

• Hardware– Physical components of a computer system

• Software– Computer programs that run on a computer

Page 5: Overview Analyzing Sequence Data

Other Options for Data Analysis

Page 6: Overview Analyzing Sequence Data

6

FastQ• A more compact format to store sequence and qualities

• Normally on 4 lines:

– “@” followed by the sequence ID

– Sequence

– “+”

– The quality score

• Quality score:– ASCII encoding of phred scores

– Sanger has one scale, Illumina has 3 different (…)

• Can be gzip-ped and used as such by some programs

@SEQ_ID

GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT

+

!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

Page 7: Overview Analyzing Sequence Data

7

@EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG

EAS139 the unique instrument name

136 the run id

FC706VJ the flowcell id

2 flowcell lane

2104 tile number within the flowcell lane

15343'x'-coordinate of the cluster within the tile

197393'y'-coordinate of the cluster within the tile

1the member of a pair, 1 or 2 (paired-end or mate-pair reads only)

YY if the read fails filter (read is bad), N otherwise

180 when none of the control bits are on, otherwise it is an even number

ATCACG index sequence

Page 8: Overview Analyzing Sequence Data

8

Phred Score

Phred Quality ScoreProbability of incorrect base call

Base call accuracy

10 1 in 10 90 %

20 1 in 100 99 %

30 1 in 1000 99.9 %

40 1 in 10000 99.99 %

50 1 in 100000 99.999 %

Page 9: Overview Analyzing Sequence Data

9

Fasta• Most basic file format to represent nucleotide or amino-acid

sequences

• Each sequence is represented by:

– A single description line (shouldn’t exceed 80 characters):

• Starts with “>”

• Followed by the sequence ID, and a space, then

• More information (description)

– The sequence, over one or several lines (the number of characters per line is generally 70 or 80, but it doesnt matter)

Page 10: Overview Analyzing Sequence Data

Data Preprocessing

• The bases of sequencing reads are not always correct

• Different sequencing platforms have different errors• Therefore, you must know how the sequencing is performed

• There are several common steps– Removing bad sequences that we have low confidence in

– Remove leftover adaptor sequences

– prune the ends of the reads

– k-mer correction …

Page 11: Overview Analyzing Sequence Data

More could go wrong?

• GC Bias

• Homopolymers

• Low quality bases (especially on the 3’ end)

• Clonal duplicates

• Contamination

• Sequencing artifacts…

Page 12: Overview Analyzing Sequence Data
Page 13: Overview Analyzing Sequence Data
Page 14: Overview Analyzing Sequence Data

14

- Isolate bacterial genome and break it up into many small fragments

- Sequence the fragments using Short or Long-Read Sequencing Technology

- Assemble the thousands to millions of sequenced reads back into the entire genomic sequence of the isolate

- Find genetic differences between isolates (SNPs )

- Use SNPs to create a phylogenetic tree to inform genetic relatedness of foodborne isolates

14

Current WGS Methodology (Simplified)

Page 15: Overview Analyzing Sequence Data

15

Overview:Current WGS methodology involves growing up the bacterial genome

to be sequenced into a pure culture (one bacterial colony) so that there are many copies of the same bacterial genome. Then the bacterial DNA

is extracted.

Isolate and copy bacterial genome

15

Page 16: Overview Analyzing Sequence Data

16

Bacterial Genome• 1 circular genome• 3-5 million

nucleotides (AGCTs)

• May have extra DNA (plasmids)

16

Isolate and copy bacterial genome

Page 17: Overview Analyzing Sequence Data

17

Break up multiple copies of same genome into different pieces

Overview:Current WGS methodology involves cutting multiple copies of the same genome in different positions to generate shorter fragments (of the same length that you can

size select for) that will overlap later on when reassembling the genome (note: this is key to generating a whole-genome sequence)

17

Page 18: Overview Analyzing Sequence Data

18

Genome copy 1

Genome copy 2

Genome copy 3

• Depending on sequencing

chemistry you will have

thousands to millions of

reads

• Short read = up to ~800bp

(can be millions of reads)

and involves Paired-End

Sequencing

• Long read = up to 40kb in

length (thousands of reads)

and fragmentation is

different

18

Page 19: Overview Analyzing Sequence Data

19

Break up multiple copies of same genome into different pieces

19

Page 20: Overview Analyzing Sequence Data

20

Genome Copy

1

1

2

3

1

2

3

1

2

3

1

2

3Genome Copy

2

20

Page 21: Overview Analyzing Sequence Data

21

How Assembly works: (simplified)Overview: Find overlaps (same string of nucleotides) between reads with different

breakpoints.

1

2

3

1

2

3

1

1

2

2

3

3

Bold =

reassembled

genome sequence

21

Page 22: Overview Analyzing Sequence Data

22

Condensing multiple reads into a draft or complete genome

AGGATTGTTGGCAGGGAATGTTGGCAGT

GAATGTTGGCAGTCAATGTTGGCAGTCG

• 4 sequenced reads that have been aligned together from the SAME isolate

• Red letter indicates possible sequencing error that is resolved by looking at other nucleotides from other reads at that position

AGGAATGTTGGCAGTCG • Genomic sequence of the isolate that takes into account the other sequencing reads

22

Page 23: Overview Analyzing Sequence Data

Assembly without a Reference

• Requires mate-pair (read-pair) sequence information• Ideally a combination of the two with different insert sizes

• Analysis can be complicated• Assembly, scaffolding, finishing

• could require some manual steps

• Getting easier all the time• increased read length

• better algorithms

• Necessary if you do not want to be biased by a reference genome (will not assemble novel genomic loci)

Page 24: Overview Analyzing Sequence Data

Assembly with a Reference

• Possess a sequenced genome that is very similar (usually same species or closer)

• Sequencing reads are then aligned to the reference genome

• Can help guide where the reads go• But can also mislead

• Will not align reads that do not match the reference!

• Will miss plasmid, novel genes, amr, etc.

• Can be faster and good for a quick check to make sure you have sequenced what you think you did

Page 25: Overview Analyzing Sequence Data
Page 26: Overview Analyzing Sequence Data

26

Sequence Alignment: the beginnings of assembling a genome

• When two sequences are being compared and the similarity is considered statistically significant, it is highly likely that the two sequences are evolutionary related

• Expect differences

– SNPs (e.g. AGTC -> AGTT)

– Insertions or Deletions (indels)

• AGTC

• A_ TC

• AGGTC

• A_ _C

Page 27: Overview Analyzing Sequence Data

27

High-level overview

Assemblers: ARACHNE, PHRAP, CAP, TIGR, CELERA

Overlap: find potentially overlapping reads

Layout: merge reads into contigs and contigs into supercontigs

Consensus: derive the DNA sequence and correct read errors ..ACGATTACAATAGGTT..

Page 28: Overview Analyzing Sequence Data

many pieces

to assemble

High coverage:

Low coverage:

A few pieces

to assemble

a few contigs,

a few gaps

many contigs,

many gaps

Input Output

Page 29: Overview Analyzing Sequence Data

Practical Example

• Isolate Genome = “it was the best of times it was the worst of times it was the age of wisdom it was the age of foolishness”

• Ambiguously placed reads:

• it was the

• times it w

• was the ag

• the age of

• Unambiguously placed reads:

• the best o

• the worst

• age of foo

• f wisdom i

• Reconstructed Genome = “it was the best of times it was the worst of times it was the age of wisdom it was the age of foolishness”

Page 30: Overview Analyzing Sequence Data

• Isolate Genome = “it was the best of times it was the worst of times it was the age of wisdom it was the age of foolishness”

Practical Example

Page 31: Overview Analyzing Sequence Data
Page 32: Overview Analyzing Sequence Data

32

Sequence Alignment Map (SAM)

Standard format for reporting short read alignment data• BAM is compressed version

Header

Alignment info

http://samtools.sourceforge.net/

Page 33: Overview Analyzing Sequence Data

33

The Data: BAM

Page 34: Overview Analyzing Sequence Data

Assessing Assembly Quality

Page 35: Overview Analyzing Sequence Data
Page 36: Overview Analyzing Sequence Data

Comparing a single isolate’s reads against a reference genome

36

Overview: Identify the genetic differences (called Single Nucleotide Polymorphisms, or SNPs) between genomes

Page 37: Overview Analyzing Sequence Data

37

Comparing the sequences of different bacterial isolates by their genetic differences

• In this example, the first and the fourth isolates have sequences that are genetically different than the sequences from the second and third isolates which are identical

• Based on this data you could say that isolate 2 and isolate 3 are more closely genetically related because they have all the same nucleotides

TGGAATGTTGGCAGTCGAGGAATGTTGGCAGTCGAGGAATGTTGGCAGTCGAGGAATGTTGGCAGTCC

Isolate 1

Isolate 2

Isolate 3

Isolate 4

37

Page 38: Overview Analyzing Sequence Data
Page 39: Overview Analyzing Sequence Data

Reference Genome

• Sequencing reads that reveal a variant (A) when compared to a reference genome

• Note that the reference genome can change and a SNP would not be called if the reference genome chosen had an A instead of a T at this position

39

Reads from a single isolate

Page 40: Overview Analyzing Sequence Data
Page 41: Overview Analyzing Sequence Data
Page 42: Overview Analyzing Sequence Data

42

Page 43: Overview Analyzing Sequence Data

43

Page 44: Overview Analyzing Sequence Data

44

Page 45: Overview Analyzing Sequence Data

45

Page 46: Overview Analyzing Sequence Data

Note:

• These slides are for teaching purposes only and have

been collected from images that I have made, from the

CDC and FDA, and from around the web.

• The findings and conclusions in this report are those of

the author and do not necessarily represent the official

position of the Food and Drug Administration

Page 47: Overview Analyzing Sequence Data