quality control of sequencing data - wordpress.com · 2017. 3. 28. · quality control of...

22
Quality Control of Sequencing Data Surya Saha Sol Genomics Network (SGN) Boyce Thompson Institute, Ithaca, NY [email protected] // Twitter:@ SahaSurya BTI Plant Bioinformatics Course 2017 3/27/2017 BTI Plant Bioinformatics Course 2017 1

Upload: others

Post on 11-Sep-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Quality Control of Sequencing Data - WordPress.com · 2017. 3. 28. · Quality Control of Sequencing Data Surya Saha Sol Genomics Network (SGN) Boyce Thompson Institute, Ithaca, NY

Quality Control of Sequencing Data

Surya Saha

Sol Genomics Network (SGN)

Boyce Thompson Institute, Ithaca, [email protected] // Twitter:@SahaSurya

BTI Plant Bioinformatics Course 2017

3/27/2017 BTI Plant Bioinformatics Course 2017 1

Page 2: Quality Control of Sequencing Data - WordPress.com · 2017. 3. 28. · Quality Control of Sequencing Data Surya Saha Sol Genomics Network (SGN) Boyce Thompson Institute, Ithaca, NY

1. Exploration

2. Evaluation

3. Preprocessing

Quality Control of NGS Data

3/27/2017 BTI Plant Bioinformatics Course 2017 2

Page 3: Quality Control of Sequencing Data - WordPress.com · 2017. 3. 28. · Quality Control of Sequencing Data Surya Saha Sol Genomics Network (SGN) Boyce Thompson Institute, Ithaca, NY

Goal:

Use shell commands and installed tools to explore using the

file from TAIR

Data: /home/bioinfo/Data/TAIR10_pep.fasta.gz

Tools:

(fasta toolkit)

Exploration

3/27/2017 BTI Plant Bioinformatics Course 2017 3

Page 4: Quality Control of Sequencing Data - WordPress.com · 2017. 3. 28. · Quality Control of Sequencing Data Surya Saha Sol Genomics Network (SGN) Boyce Thompson Institute, Ithaca, NY

Exercise 1:

1. Type fasta and TAB key to find all the commands starting

with fasta

2. Uncompress and find the length of sequences in the file: TAIR10_pep.fasta.gz

3. Split the above file into 3 fasta files

Exploration

3/27/2017 BTI Plant Bioinformatics Course 2017 4

Tip: Use gzip and ‘fastalength’

Tip: Use ‘fastasplit’

Page 5: Quality Control of Sequencing Data - WordPress.com · 2017. 3. 28. · Quality Control of Sequencing Data Surya Saha Sol Genomics Network (SGN) Boyce Thompson Institute, Ithaca, NY

Goal:

Learn the use of read evaluation programs keeping

attention in relevant parameters such as quality score and

length distributions and unusual reads duplications.

Data: Illumina data for two tomato ripening stages

/home/bioinfo/Data/Slch04_demo.tar.gz

Tools: tar -zxvf OR –xvf (command line, untar and unzip the files)

head (command line, take a quick look of the files)

mv (command line, change the name of the files)

grep (command line, find/count patterns in files)

FASTX toolkit (command line, process fasta/fastq)

FastQC (gui, to calculate several stats for each file)

Evaluation

3/27/2017 BTI Plant Bioinformatics Course 2017 5

Page 6: Quality Control of Sequencing Data - WordPress.com · 2017. 3. 28. · Quality Control of Sequencing Data Surya Saha Sol Genomics Network (SGN) Boyce Thompson Institute, Ithaca, NY

Exercise 2:

1. Uncompress the file:

/home/bioinfo/Data/Slch04_demo.tar.gz

2. Raw data will be found in 4 files. Print the first 10 lines for

the files: SRR404331_ch4.fq, SRR404333_ch4.fq,

SRR404334_ch4.fq and SRR404336_ch4.fq.

Question 2.1: Do these files have fastq format?

Evaluation

3/27/2017 BTI Plant Bioinformatics Course 2017 6

Page 7: Quality Control of Sequencing Data - WordPress.com · 2017. 3. 28. · Quality Control of Sequencing Data Surya Saha Sol Genomics Network (SGN) Boyce Thompson Institute, Ithaca, NY

Exercise 2:

3. Count number of sequences in each fastq file using

commands you learnt last time.

4. Convert the fastq files to fasta.

5. Explore other tools in the FASTX toolkit.

http://hannonlab.cshl.edu/fastx_toolkit/commandline.html

6. Now count the number of sequences in fasta file and see

if the number of sequences has changed.

Evaluation

Tip: Use ‘grep’

Tip: Use ‘fastq_to_fasta -h’ to see helpUse Google if you are stuck

3/27/2017 BTI Plant Bioinformatics Course 2017 7

Page 8: Quality Control of Sequencing Data - WordPress.com · 2017. 3. 28. · Quality Control of Sequencing Data Surya Saha Sol Genomics Network (SGN) Boyce Thompson Institute, Ithaca, NY

Evaluation: Sequence Quality

Good Illumina dataset

3/27/2017 BTI Plant Bioinformatics Course 2017 8

Page 9: Quality Control of Sequencing Data - WordPress.com · 2017. 3. 28. · Quality Control of Sequencing Data Surya Saha Sol Genomics Network (SGN) Boyce Thompson Institute, Ithaca, NY

Evaluation: Sequence Quality

3/27/2017 BTI Plant Bioinformatics Course 2017 9

Good Illumina dataset

Poor Illumina dataset

Page 10: Quality Control of Sequencing Data - WordPress.com · 2017. 3. 28. · Quality Control of Sequencing Data Surya Saha Sol Genomics Network (SGN) Boyce Thompson Institute, Ithaca, NY

Evaluation: Sequence Quality

3/27/2017 BTI Plant Bioinformatics Course 2017 10

454

Pacific Biosciences

Page 11: Quality Control of Sequencing Data - WordPress.com · 2017. 3. 28. · Quality Control of Sequencing Data Surya Saha Sol Genomics Network (SGN) Boyce Thompson Institute, Ithaca, NY

Evaluation: Sequence Content

Good Illumina dataset

3/27/2017 BTI Plant Bioinformatics Course 2017 11

Page 12: Quality Control of Sequencing Data - WordPress.com · 2017. 3. 28. · Quality Control of Sequencing Data Surya Saha Sol Genomics Network (SGN) Boyce Thompson Institute, Ithaca, NY

Evaluation: Sequence Content

3/27/2017 BTI Plant Bioinformatics Course 2017 12

Good Illumina dataset

Poor Illumina dataset

Page 13: Quality Control of Sequencing Data - WordPress.com · 2017. 3. 28. · Quality Control of Sequencing Data Surya Saha Sol Genomics Network (SGN) Boyce Thompson Institute, Ithaca, NY

Evaluation: Duplication

Good Illumina dataset

3/27/2017 BTI Plant Bioinformatics Course 2017 13

Page 14: Quality Control of Sequencing Data - WordPress.com · 2017. 3. 28. · Quality Control of Sequencing Data Surya Saha Sol Genomics Network (SGN) Boyce Thompson Institute, Ithaca, NY

Evaluation: Duplication

3/27/2017 BTI Plant Bioinformatics Course 2017 14

Good Illumina dataset

Poor Illumina dataset

Page 15: Quality Control of Sequencing Data - WordPress.com · 2017. 3. 28. · Quality Control of Sequencing Data Surya Saha Sol Genomics Network (SGN) Boyce Thompson Institute, Ithaca, NY

Evaluation: Overrepresented Sequences

Good Illumina dataset

3/27/2017 BTI Plant Bioinformatics Course 2017 15

Page 16: Quality Control of Sequencing Data - WordPress.com · 2017. 3. 28. · Quality Control of Sequencing Data Surya Saha Sol Genomics Network (SGN) Boyce Thompson Institute, Ithaca, NY

Evaluation: Overrepresented Sequences

3/27/2017 BTI Plant Bioinformatics Course 2017 16

Good Illumina dataset

Poor Illumina dataset

Page 17: Quality Control of Sequencing Data - WordPress.com · 2017. 3. 28. · Quality Control of Sequencing Data Surya Saha Sol Genomics Network (SGN) Boyce Thompson Institute, Ithaca, NY

Evaluation: Kmer content

Good Illumina dataset

3/27/2017 BTI Plant Bioinformatics Course 2017 17

Page 18: Quality Control of Sequencing Data - WordPress.com · 2017. 3. 28. · Quality Control of Sequencing Data Surya Saha Sol Genomics Network (SGN) Boyce Thompson Institute, Ithaca, NY

Evaluation: Kmer content

3/27/2017 BTI Plant Bioinformatics Course 2017 18

Good Illumina dataset

Poor Illumina dataset

Page 19: Quality Control of Sequencing Data - WordPress.com · 2017. 3. 28. · Quality Control of Sequencing Data Surya Saha Sol Genomics Network (SGN) Boyce Thompson Institute, Ithaca, NY

Question 3.2: How many sequences there are per file in FastQC?

Question 3.3: Which is the length range for these reads?

Question 3.4: Which is the quality score range for these reads? Which

one looks best quality-wise?

Question 3.5: Do these datasets have read overrepresentation?

Question 3.6: Looking into the kmer content, do you think that the samples

have an adaptor?

EvaluationExercise 3:

1.Type ‘fastqc’ to start the FastQC program. Load the four

fastq sequence files in the program.

3/27/2017 BTI Plant Bioinformatics Course 2017 19

Page 20: Quality Control of Sequencing Data - WordPress.com · 2017. 3. 28. · Quality Control of Sequencing Data Surya Saha Sol Genomics Network (SGN) Boyce Thompson Institute, Ithaca, NY

Goal:

Trim the low quality ends of the reads and remove

the short reads.

Data: Illumina data for two tomato ripening stages

/home/bioinfo/Data/Slch04_demo.tar.gz

Tools: fastq-mcf (command line tool to process reads)

FastQC (gui, to calculate several stats for each file)

Preprocessing

3/27/2017 BTI Plant Bioinformatics Course 2017 20

Page 21: Quality Control of Sequencing Data - WordPress.com · 2017. 3. 28. · Quality Control of Sequencing Data Surya Saha Sol Genomics Network (SGN) Boyce Thompson Institute, Ithaca, NY

Exercise 4:

• Download the file: adapters1.fa from ftp://ftp.solgenomics.net/user_requests/aubombarely/courses/RNAseqCorpoica/a

dapters1.fa

• Run the read processing program over each of the datasets

using

• Min. qscore of 30

• Min. length of 40 bp

• Type ‘fastqc’ to start the FastQC program. Load the four

new fastq sequence files. Compare the results with the

previous datasets.

Preprocessing

Tip: Use ‘fastq-mcf -h’ to see help

3/27/2017 BTI Plant Bioinformatics Course 2017 21

Page 22: Quality Control of Sequencing Data - WordPress.com · 2017. 3. 28. · Quality Control of Sequencing Data Surya Saha Sol Genomics Network (SGN) Boyce Thompson Institute, Ithaca, NY

Bonus Material

Here are some simple exercises using the file from TAIR /home/bioinfo/Data/TAIR10_pep.fasta.gz

• Number of unknown proteins per chromosome• Number of proteins that are not unknown proteins

per chromosome and are on the forward strand• Number of proteins which are located within the

first 5MB of the chromosome (awk)• All genes of length greater than 2000bp (awk)• All proteins of length greater than 200aa (awk)

3/27/2017 BTI Plant Bioinformatics Course 2017 22