max-planck-institut für molekulare genetik software praktikum, 1.2.2013 folie 1 comparing methods...

28
Max-Planck- Institut für molekulare Genetik Software Praktikum, 1.2.2013 Folie 1 Comparing Methods for Identifying Transcription Factor Target Genes Alena van Bömmel (R 3.3.73) Matthew Huska (R 3.3.18) Max Planck Institute for Molecular Genetics

Upload: julian-tobin

Post on 28-Mar-2015

219 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Max-Planck-Institut für molekulare Genetik Software Praktikum, 1.2.2013 Folie 1 Comparing Methods for Identifying Transcription Factor Target Genes Alena

Max-Planck-Institut für molekulare Genetik

Software Praktikum, 1.2.2013 Folie 1

Comparing Methods for Identifying Transcription Factor Target Genes

Alena van Bömmel (R 3.3.73) Matthew Huska (R 3.3.18)Max Planck Institute for Molecular Genetics

Page 2: Max-Planck-Institut für molekulare Genetik Software Praktikum, 1.2.2013 Folie 1 Comparing Methods for Identifying Transcription Factor Target Genes Alena

Max-Planck-Institut für molekulare Genetik

Software Praktikum, 1.2.2013

Transcriptional Regulation

TF not bound = no gene expression

TF bound =gene expression

Page 3: Max-Planck-Institut für molekulare Genetik Software Praktikum, 1.2.2013 Folie 1 Comparing Methods for Identifying Transcription Factor Target Genes Alena

Max-Planck-Institut für molekulare Genetik

Software Praktikum, 1.2.2013

Transcriptional Regulation

TF not bound = no gene expression

TF bound =gene expression

Problem: There are many genes and many TF's, how do we identify the targets of a TF?

Page 4: Max-Planck-Institut für molekulare Genetik Software Praktikum, 1.2.2013 Folie 1 Comparing Methods for Identifying Transcription Factor Target Genes Alena

Max-Planck-Institut für molekulare Genetik

Software Praktikum, 1.2.2013

Methods for Identifying TF Target Genes

Microarray

PWM Genome Scan

ChIP-seq

Page 5: Max-Planck-Institut für molekulare Genetik Software Praktikum, 1.2.2013 Folie 1 Comparing Methods for Identifying Transcription Factor Target Genes Alena

Max-Planck-Institut für molekulare Genetik

Software Praktikum, 1.2.2013

PWM Genome Scan

• Purely computational method• Input:

o position weight matrix for your TFo genomic region(s) of interest

• Pros:o No need to do wet lab experiments

• Cons:o Many false positives, not able to take biological conditions into account

Score threshold

Page 6: Max-Planck-Institut für molekulare Genetik Software Praktikum, 1.2.2013 Folie 1 Comparing Methods for Identifying Transcription Factor Target Genes Alena

Max-Planck-Institut für molekulare Genetik

Software Praktikum, 1.2.2013

PWM genome scan

Folie 6

1) Download the PWMs of your TF of interest from the database (they might include >1 motif)

2) Define the sequences to analyze (promoter sequences)

3) Run the PWM genome scan (hit-based method or affinity prediction method)

4) Rank the genomic sequences by the affinity signal

Suggested Reading:• Roider et al.: Predicting transcription factor

affinities to DNA from a biophysical model. Bioinformatics (2007).

• Thomas-Chollier et al. Transcription factor binding predictions using TRAP for the analysis of ChIP-seq data and regulatory SNPs. Nature Protocols (2011).

Page 7: Max-Planck-Institut für molekulare Genetik Software Praktikum, 1.2.2013 Folie 1 Comparing Methods for Identifying Transcription Factor Target Genes Alena

Max-Planck-Institut für molekulare Genetik

Software Praktikum, 1.2.2013

PWM-PSCM

Folie 7

Page 8: Max-Planck-Institut für molekulare Genetik Software Praktikum, 1.2.2013 Folie 1 Comparing Methods for Identifying Transcription Factor Target Genes Alena

Max-Planck-Institut für molekulare Genetik

Software Praktikum, 1.2.2013

TRAP

Folie 8

1) Convert the PSSM(position specific scoring matrix) to PSEM (position specific energy matrix)

2) Scan the sequences of interest with TRAP

3) Results in 1 score per sequence=binding affinity

4) Doesn’t separate the exact TF binding sites (easier for ranking)

5) Sequences must have the same length!

ANNOTATE=/project/gbrowse/Pipeline/ANNOTATE_v3.02/ReleaseTRAP trap.molgen.mpg.de/cgi-bin/home.cgi

Page 9: Max-Planck-Institut für molekulare Genetik Software Praktikum, 1.2.2013 Folie 1 Comparing Methods for Identifying Transcription Factor Target Genes Alena

Max-Planck-Institut für molekulare Genetik

Software Praktikum, 1.2.2013

Matrix-scan

Folie 9

1) Use directly the PSSM2) Finds all TFBS which exceed a predefined threshold (e.g. p-value)3) More complicated to create ranked lists of genomic sequences (more

hits in the sequence)4) Exact location of the binding site reported

matrix-scan http://rsat.ulb.ac.be/

Page 10: Max-Planck-Institut für molekulare Genetik Software Praktikum, 1.2.2013 Folie 1 Comparing Methods for Identifying Transcription Factor Target Genes Alena

Max-Planck-Institut für molekulare Genetik

Software Praktikum, 1.2.2013

Finding the target genes

Folie 10

• target genes will be the top-ranked genes (promoters)

• which are the top-ranked genes? (top-100,500,1000...?)

• There’s no exact definition of promoters, usually 2000bp upstream, 500bp downstream of the TSS

Page 11: Max-Planck-Institut für molekulare Genetik Software Praktikum, 1.2.2013 Folie 1 Comparing Methods for Identifying Transcription Factor Target Genes Alena

Max-Planck-Institut für molekulare Genetik

Software Praktikum, 1.2.2013

Microarrays

→ R/Bioconductor (details later)

Page 12: Max-Planck-Institut für molekulare Genetik Software Praktikum, 1.2.2013 Folie 1 Comparing Methods for Identifying Transcription Factor Target Genes Alena

Max-Planck-Institut für molekulare Genetik

Software Praktikum, 1.2.2013

Folie 12

Microarrays (2)

• Pros:o There is a lot of microarray data already available (might not have to

generate the data yourself)o Inexpensive and not very difficult to performo Computational workflow is well established

• Cons:o Can not distinguish between indirect regulation and direct regulation

Page 13: Max-Planck-Institut für molekulare Genetik Software Praktikum, 1.2.2013 Folie 1 Comparing Methods for Identifying Transcription Factor Target Genes Alena

Max-Planck-Institut für molekulare Genetik

Software Praktikum, 1.2.2013

ChIP-seqMap reads to the genome

Call peaks to determine most likely TF binding locations

Page 14: Max-Planck-Institut für molekulare Genetik Software Praktikum, 1.2.2013 Folie 1 Comparing Methods for Identifying Transcription Factor Target Genes Alena

Max-Planck-Institut für molekulare Genetik

Software Praktikum, 1.2.2013

Folie 14

ChIP-seq (2)

• Pros:o Direct measure of genome-wide protein-DNA interaction(*)

• Cons:o Don't know whether binding causes changes in gene expressiono More complicated experimentally and in terms of computational

analysiso Most expensiveo Need an antibody against your protein of interesto Biases are not as well understood as with microarrays

Page 15: Max-Planck-Institut für molekulare Genetik Software Praktikum, 1.2.2013 Folie 1 Comparing Methods for Identifying Transcription Factor Target Genes Alena

Max-Planck-Institut für molekulare Genetik

Software Praktikum, 1.2.2013

ChIP-seq analysis

Folie 15

1) Download the reads from given source (experiments and controls)

2) Quality control of the reads and statistics (fastqc)

3) Mapping the reads to the reference genome (bwa/Bowtie)

4) Peak calling (MACS)5) Visualization of the peaks in a

genome browser (genome browser, IGV)

6) Finding the closest genes to the peaks(Bioconductor/ChIPpeakAnno)

Visualised peaks in a genome browser

Suggested Reading:• Bailey et alPractical Guidelines for

the Comprehensive Analysis of ChIP-seq Data. PLoS Comput Biol (2013).

• Thomas-Chollier et al. A complete workflow for the analysis of full-size ChIP-seq (and similar) data sets using peak-motifs. Nature Protocols (2012).

Page 16: Max-Planck-Institut für molekulare Genetik Software Praktikum, 1.2.2013 Folie 1 Comparing Methods for Identifying Transcription Factor Target Genes Alena

Max-Planck-Institut für molekulare Genetik

Software Praktikum, 1.2.2013

Sequencing data

Folie 16

Analysis

1) Quality control with fastqc 2) Filtering of reads with adapter

sequences3) Mapping of the reads to the

reference genome (bwa or Bowtie) Example of fastq data file

• raw data=reads usually very large file (few GB)

• format fastq (ENCODE) or SRA (Sequence Read Archive of NCBI)

Page 17: Max-Planck-Institut für molekulare Genetik Software Praktikum, 1.2.2013 Folie 1 Comparing Methods for Identifying Transcription Factor Target Genes Alena

Max-Planck-Institut für molekulare Genetik

Software Praktikum, 1.2.2013

Quality control with fastqc

• per base quality• sequence quality (avg. > 20)• sequence length• sequence duplication level

(duplication by PCR)• overrepresented

sequences/kmers (adapter sequences)

• produces a html report• manual (read it!)

• software at the MPI

Folie 17

Example of per base seq quality scores

FASTQC=/scratch/ngsvin/bin/chip-seq/fastqc/FastQC/fastqc

Page 18: Max-Planck-Institut für molekulare Genetik Software Praktikum, 1.2.2013 Folie 1 Comparing Methods for Identifying Transcription Factor Target Genes Alena

Max-Planck-Institut für molekulare Genetik

Software Praktikum, 1.2.2013 Folie 18

Mapping with bwa• mapping the sequencing reads to a reference genome• manual (read it!)• map the experiments and the controls1) reference genome in fasta format (hg19)2) create an index of the reference file for faster mapping (only if not

available)3) align the reads (specify parameters e.g. for # of mismatches, read

trimming, threads used...)4) generate alignments in the SAM format (different commands for

single-end and pair-end reads!)

software and data at the MPI: BWA = /scratch/ngsvin/bin/executables/bwa

hg19: /scratch/ngsvin/MappingIndices/hg19.fabwa index: /scratch/ngsvin/MappingIndices/BWA/hg19

Page 19: Max-Planck-Institut für molekulare Genetik Software Praktikum, 1.2.2013 Folie 1 Comparing Methods for Identifying Transcription Factor Target Genes Alena

Max-Planck-Institut für molekulare Genetik

Software Praktikum, 1.2.2013 Folie 19

File manipulation with samtools

• utilities that manipulate SAM/BAM files• manual (read it!)1) merge the replicates in one file (still separate experiment and control)2) convert the SAM file into BAM file (binary version of SAM, smaller)3) sort and index the BAM file

now the sequencing files are ready for further analysis

software at the MPI: SAMTOOLS = /scratch/ngsvin/bin/executables/samtools

Page 20: Max-Planck-Institut für molekulare Genetik Software Praktikum, 1.2.2013 Folie 1 Comparing Methods for Identifying Transcription Factor Target Genes Alena

Max-Planck-Institut für molekulare Genetik

Software Praktikum, 1.2.2013 Folie 20

Peak finding with MACS

• find the peaks, i.e. the regions with a high density of reads, where the studied TF was bound

• manual (read it!)1) call the peaks using the experiment (treatment) data vs. control2) set the parameters e.g. fragment length, treatment of duplication reads3) analyse the MACS results (BED file with peaks/summits)

software at the MPI:

MACS = /scratch/ngsvin/bin/executables/macs

Page 21: Max-Planck-Institut für molekulare Genetik Software Praktikum, 1.2.2013 Folie 1 Comparing Methods for Identifying Transcription Factor Target Genes Alena

Max-Planck-Institut für molekulare Genetik

Software Praktikum, 1.2.2013 Folie 21

Finding the target genes• find the genes which are in the closest distance to the

(significant) peaks • how to define the closest distance? (+- X kb)• use ChIPpeakAnno in Bioconductor or bedtools

Page 22: Max-Planck-Institut für molekulare Genetik Software Praktikum, 1.2.2013 Folie 1 Comparing Methods for Identifying Transcription Factor Target Genes Alena

Max-Planck-Institut für molekulare Genetik

Software Praktikum, 1.2.2013

Methods for Identifying TF Target Genes

Microarray

PWM Genome Scan

ChIP-seq Thresholds

Page 23: Max-Planck-Institut für molekulare Genetik Software Praktikum, 1.2.2013 Folie 1 Comparing Methods for Identifying Transcription Factor Target Genes Alena

Max-Planck-Institut für molekulare Genetik

Software Praktikum, 1.2.2013

Bioinformatics

• Read mapping (Bowtie/bwa)

• Peak Calling (MACS/Bioconductor)

• Peak-Target Analysis (Bioconductor)

Folie 23

• Microarray data analysis (Bioconductor)

• Differential Genes (R)

• GSEA

• PWM Genome Scan (TRAP/MatScan)

• Statistics (R)

• Data Integration (R/Python/Perl)

• Statistical Analysis (R)

Page 24: Max-Planck-Institut für molekulare Genetik Software Praktikum, 1.2.2013 Folie 1 Comparing Methods for Identifying Transcription Factor Target Genes Alena

Max-Planck-Institut für molekulare Genetik

Software Praktikum, 1.2.2013

Bioinformatics tools

• Bowtie bowtie-bio.sourceforge.net/manual.shtml• bwa bio-bwa.sourceforge.net/bwa.shtml• MACS github.com/taoliu/MACS/blob/macs_v1/README.rst• TRAP trap.molgen.mpg.de/cgi-bin/home.cgi• matrix-scan http://rsat.ulb.ac.be/• Bioconductor www.bioconductor.org/ (more info in R course)

Folie 24

READ THE MANUALS!

Databases• GEO www.ncbi.nlm.nih.gov/geo/• ENCODE genome.ucsc.edu/ENCODE/• SRA www.ncbi.nlm.nih.gov/sra• JASPAR http://jaspar.genereg.net/

Page 25: Max-Planck-Institut für molekulare Genetik Software Praktikum, 1.2.2013 Folie 1 Comparing Methods for Identifying Transcription Factor Target Genes Alena

Max-Planck-Institut für molekulare Genetik

Software Praktikum, 1.2.2013

Schedule

• 03.03. Introduction lecture, R course • 04.03. R & Bioconductor homework submission• 11.03. Presentation of the detailed plan of each group

(which TF, cell line, tools, data, data integration, team work ) 10:30am, 11:30am

• every Tuesday 10:30am, 11:30am progress meetings• 17.04. Final report deadline• 24.04. (tentative) Presentations• 28.04. Final meeting, discussion of final reports

Folie 25

Page 26: Max-Planck-Institut für molekulare Genetik Software Praktikum, 1.2.2013 Folie 1 Comparing Methods for Identifying Transcription Factor Target Genes Alena

Max-Planck-Institut für molekulare Genetik

Software Praktikum, 1.2.2013

GR Group

• Expression and ChIP-seq data: Luca F, Maranville JC, et al., PLoS ONE, 2013

• PWM database: jaspar.genereg.net

Folie 26

Page 27: Max-Planck-Institut für molekulare Genetik Software Praktikum, 1.2.2013 Folie 1 Comparing Methods for Identifying Transcription Factor Target Genes Alena

Max-Planck-Institut für molekulare Genetik

Software Praktikum, 1.2.2013

c-Myc Group

• Expression data: Cappellen, Schlange, Bauer et al., EMBO reports, 2007

• Musgrove et al., PLoS One, 2008• ChIP-seq data: ENCODE Project • PWM database: jaspar.genereg.net

Folie 27

Page 28: Max-Planck-Institut für molekulare Genetik Software Praktikum, 1.2.2013 Folie 1 Comparing Methods for Identifying Transcription Factor Target Genes Alena

Max-Planck-Institut für molekulare Genetik

Software Praktikum, 1.2.2013 Folie 28

Additional analysisBinding motifs• are the overrepresented

motifs in the ChIP-peak regions different?

• do we find any co-factors?

Recommended tool: RSAT rsat.ulb.ac.be

binding motifs

binding motifsbinding motifs