how to do differential expression analysis from fastq ... · •introduction: •btrim is a fast...
TRANSCRIPT
How to do differential expression analysis from Fastq
format data on HPCWritten by: yiran.zhang
Pipeline
• ShortRead: Quality Assessment, filtering and trimming
• Fastqc: Quality control
• Btrim: Filtering and Trimming
• HISAT2: Align the reads to reference genome
• Htseq-count: Count reads using htseq-count
• DESeq: Differential gene expression analysis based on the negative binomial distribution.
ShortRead
• Introduction:• The ShortRead package provides functionality for working with FASTQ files
from high throughput sequence analysis
• Environment required:• R
• Function:• Quality Assessment
• filtering and trimming
ShortRead
• Write your own function to guarantee the reads can not contain ‘N’
> qaSummary[["baseCalls"]] A C G T Ngu2_read1.fq 21685857 28412307 28130219 21767568 4049gu2_read2.fq 21729895 28722063 27816884 21730591 567gu3_read1.fq 21723444 28346939 28174527 21751407 3683gu3_read2.fq 21734048 28697947 27824570 21742736 699ye1_read1.fq 21675483 28443112 28095517 21781660 4228ye1_read2.fq 21702486 28762839 27785294 21748745 636ye3_read1.fq 21795076 28360347 27968237 21872354 3986ye3_read2.fq 21807964 28695030 27618484 21877864 658
Fastqc: Quality control
• Introduction:• FastQC aims to provide a simple way to do some quality control checks on
raw sequence data coming from high throughput sequencing pipelines.
Fastqc
Btrim
• ShortRead drops the reads containing the ‘N’, but it looks like that the low quality bases still exists, so we decide to filtering and trimming the ShortRead result with Btrim.
• Introduction:• Btrim is a fast and lightweight tool to trim adapters and low quality regions in
reads from ultra high-throughput next-generation sequencing machines.
• Note:• Use fastqc to get the quality control report again, to check whether the
filtered and trimmed reads are reasonable. Just edit your previous command of fastqc.pbs and submit it.
HISAT2
• Introduction:• HISAT2 is a fast and sensitive alignment program for mapping next-generation
sequencing reads (both DNA and RNA) to a population of human genomes (as well as against a single reference genome).
• Advantage: Highly efficient
• Note:• It will create a SAM file which can be directly used in the further work of
htseq.
HISAT2
HTSeq:
• Introduction:
• HTSeq is a Python package that provides infrastructure to process data from high-throughput sequencing assays.
• Require:• htseq-count [options] <alignment_file> <gff_file>
DESeq
• Introduction:• Differential gene expression analysis based on the negative binomial
distribution.
Further work
• Try to connect the whole Pipeline which can make this work in less commands and steps.
• Adjust program to our system to do ambiguous reads mapping
Thank you