motif finding workshop project

30
1 MF workshop 08 © Ron Shamir Motif Finding Workshop Project Chaim Linhart January 2008

Upload: bertha

Post on 17-Mar-2016

20 views

Category:

Documents


0 download

DESCRIPTION

Motif Finding Workshop Project. Chaim Linhart January 2008. Outline. 1. Some background again… 2. The project. 1. Background. Slides with Ron Shamir and Adi Akavia. Gene: from DNA to protein. Pre-mRNA. Mature mRNA. DNA. protein. transcription. splicing. translation. DNA. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Motif Finding Workshop Project

1 MF workshop 08 © Ron Shamir

Motif Finding WorkshopProject

Chaim LinhartJanuary 2008

Page 2: Motif Finding Workshop Project

2 MF workshop 08 © Ron Shamir

Outline

1. Some background again…2. The project

Page 3: Motif Finding Workshop Project

3 MF workshop 08 © Ron Shamir

1. Background

Slides with Ron Shamir and Adi Akavia

Page 4: Motif Finding Workshop Project

4 MF workshop 08 © Ron Shamir

DNA Pre-mRNA protein

transcription translation

Mature

mRNA

splicing

Gene: from DNA to protein

Page 5: Motif Finding Workshop Project

5 MF workshop 08 © Ron Shamir

DNA• DNA: a “string” over the alphabet of 4 bases (nucleotides): { A, C, G, T }• Resides in chromosomes• Complementary strands: A-T ; C-G Forward/sense strand: AACTTGCG Reverse-complement/anti-sense strand: TTGAACGC• Directional: from 5’ to 3’: (upstream) AACTTGCGATACTCCTA (downstream)5’ end 3’ end

Page 6: Motif Finding Workshop Project

6 MF workshop 08 © Ron Shamir

Gene structure (eukaryotes)

Transcription start site (TSS)

Promoter

Transcription (RNA polymerase)

DNA

Pre-mRNAExon ExonIntron

Splicing (spliceosome)

Mature mRNA

5’ UTR 3’ UTR

Start codon Stop codonCoding region

Translation (ribosome)

Protein

Coding strand

Page 7: Motif Finding Workshop Project

7 MF workshop 08 © Ron Shamir

Translation• Codon - a triplet of bases, codes a specific

amino acid (except the stop codons); many-to-1 relation

• Stop codons - signal termination of the protein synthesis process

http://ntri.tamuk.edu/cell/ribosomes.html

Page 8: Motif Finding Workshop Project

8 MF workshop 08 © Ron Shamir

Genome sequences• Many genomes have been sequences,

including those of viruses, microbes, plants and animals.

• Human: – 23 pairs of chromosomes– 3+ Gbps (bps = base pairs) , only ~3% are

genes– ~25,000 genes

• Yeast:– 16 chromosomes– 20 Mbps– 6,500 genes

Page 9: Motif Finding Workshop Project

9 MF workshop 08 © Ron Shamir

Regulation of Expression

• Each cell contains an identical copy of the whole genome - but utilizes only a subset of the genes to perform diverse, unique tasks

• Most genes are highly regulated – their expression is limited to specific tissues, developmental stages, physiological condition

• Main regulatory mechanism – transcriptional regulation

Page 10: Motif Finding Workshop Project

10 MF workshop 08 © Ron Shamir

•Transcription is regulated primarily by transcription factors (TFs) – proteins that bind to DNA subsequences, called binding sites (BSs)

•TFBSs are located mainly (not always!) in the gene’s promoter – the DNA sequence upstream the gene’s transcription start site (TSS)

•BSs of a particular TF share a common pattern, or motif

•Some TFs operate together – TF modules

TFTFGene5’ 3’

BSBSTSS

Transcriptional regulation

Page 11: Motif Finding Workshop Project

11 MF workshop 08 © Ron Shamir

•Consensus (“degenerate”) string:TFBS motif models

gene 7

gene 9

gene 5

gene 3gene 2

gene 4

gene 6

gene 8

gene 10

gene 1AACTGT

CACTGTCACTCT

CACTGT

AACTGT

AC ACT

CGT

•Statistical models…•Motif logo representation

Page 12: Motif Finding Workshop Project

12 MF workshop 08 © Ron Shamir

Human G2+M cell-cycle genes:The CHR – NF-Y module

CDCA3 (trigger of mitotic entry 1)CTCAGCCAATAGGGTCAGGGCAGGGGGCGTGGCGGGAAGTTTGAAACT -18

CDCA8 (cell division cycle associated 8)TTGTGATTGGATGTTGTGGGA…[25bp]…TGACTGTGGAGTTTGAATTGG +23

CDC2 (cell division control protein 2 homolog)CTCTGATTGGCTGCTTTGAAAGTCTACGGGCTACCCGATTGGTGAATCCGGGGCCCTTTAGCGCGGTGAGTTTGAAACTGCT 0

CDC42EP4 (cdc42 effector protein 4)GCTTTCAGTTTGAACCGAGGA…[25bp]…CGACGGCCATTGGCTGCTGC -110

CCNB1 (G2/mitotic-specific cyclin B1)AGCCGCCAATGGGAAGGGAG…[30bp]…AGCAGTGCGGGGTTTAAATCT +45

CCNB2 (G2/mitotic-specific cyclin B2)TTCAGCCAATGAGAGT…[15bp]…GTGTTGGCCAATGAGAAC…[15bp]…GGGCCGCCCAATGGGGCGCAAGCGACGCGGTATTTGAATCCTGGA +10

BS’s are short, non-specific, hiding in both strands and at various locations along the promotersTFs: NF-Y , CHR

Page 13: Motif Finding Workshop Project

13 MF workshop 08 © Ron Shamir

The computational challenge

• Given a set of co-regulated genes (e.g., from gene expression chips)

• Find a motif that is over-represented (occurs unusually often) in their promoters

• This may be the TF binding site motif• Find TF modules – over-represented

motifs that tend to co-occur

Page 14: Motif Finding Workshop Project

14 MF workshop 08 © Ron Shamir

The computational challenge (II)

• Motifs can also be found w/o a given target-set – “genome-wide”

• Find a motif that is localized - occurs more often neat the TSS of genes

• Find a motif with a strand bias – occurs more often on the genes’ coding strand

• Find TF modules with biases in their order / orientation / distance

Page 15: Motif Finding Workshop Project

15 MF workshop 08 © Ron Shamir

Motif finding algorithms• >100 motif finding algs• Main differences between them:

– Type of analysis & input: • Target-set vs. genome-wide• Single vs. multi-species (conservation)• Single motifs vs. modules

– Motif model– Score for evaluating motif– Motif search technique:

• Combinatorial (enumeration) vs. Statistical optimization

Page 16: Motif Finding Workshop Project

16 MF workshop 08 © Ron Shamir

Over-represented motifs in the promoters of genes expressed in the G2 and G2/M phases of the human cell cycle:

Example - Amadeus

CHR

NF-Y

Page 17: Motif Finding Workshop Project

17 MF workshop 08 © Ron Shamir

2. The project

Page 18: Motif Finding Workshop Project

18 MF workshop 08 © Ron Shamir

General goals• Develop software from A-Z:

– Design– Implementation– (Optimization) – Execution & analysis of real data

• A taste of bioinformatics• Have fun• Get credit…

Page 19: Motif Finding Workshop Project

19 MF workshop 08 © Ron Shamir

The computational task• Given a set of DNA sequences• Find “interesting” pairs of motifs:

– Order bias– Other scores…

• Main challenges:– Performance (time, memory)– Output redundancy

Page 20: Motif Finding Workshop Project

20 MF workshop 08 © Ron Shamir

InputFile with DNA sequences in “fasta” format:

>sequence-name1 <space> [header1]ACCCGNNNNTCGGAAATGANNCGGAGTAAAATATGCGAGCGT>sequence-name2 <space> [header2]cggattnnnaccgcannnnnnnnaccgtga>sequence-name3 <space> [header3]agtttagactgctagctcgatcgctagcggatnggctannnnnatctag

Page 21: Motif Finding Workshop Project

21 MF workshop 08 © Ron Shamir

Input (II)• Ignore the header lines• Sequence may span multiple lines

or one long line• Sequence contains the characters

A,C,G,T,N in upper or lower case• “N” means unknown or masked

base• Sample input files will be supplied

Page 22: Motif Finding Workshop Project

22 MF workshop 08 © Ron Shamir

(don’t count overlaps, e.g. AAAAAA)

Input (III)• Search parameters:

– Length of motifs (between 5-10)– Min. + Max. distance between the motifs:

ACGGATTGATNNNTGGATGCCAT distance=9

– Single vs. two strands search– Min. number of occurrences (hits) of pair:

GCGGATTCAGTGATGCCANGNATGCCTCAGGATTGNAATGCCA hit hit hit

– Max. p-value– Additional parameters…

Page 23: Motif Finding Workshop Project

23 MF workshop 08 © Ron Shamir

OutputA. A list of the string pairs with the

best order-bias score (smallest p-values):

Motif A Motif B A→B B→A p-valueACGTT GGATT 97 17 4.3E-15ACGTT GATTC 87 16 2.7E-13

TTAAC CAGCC 31 114 1.2E-12

B. A non-redundant list of motif pairs (motif = consensus string):logos, # of hits, additional scores

Page 24: Motif Finding Workshop Project

24 MF workshop 08 © Ron Shamir

Part A: String pairs with order bias

• nA = # of A→B ; nB = # of B→A• WLOG, nA > nB• n = nA + nB• H0 = random order: nA ~ B(n, 0.5)• p-value = prob for at least nA occurrences

of A→B = tail of B(n, 0.5) • Normal approximation (central limit thm.)• Fix for multiple testing: x2

( , , ) (1 )n

j n j

j k

nBinomial tail n p k p p

j

Page 25: Motif Finding Workshop Project

25 MF workshop 08 © Ron Shamir

• Collect similar strings to motif with better score: (motif = consensus)String pair (p-value) Motif pairACGTT , GGATT (4.3E-15)ACGAT , GGATT (2.4E-11)AGGAT , GGTTT (1.7E-5)AGGTT , GGTTT (5.9E-5)

• Don’t report similar motif pairs:– Motifs that consist of similar strings – Motif pairs that are small shifts of one another– Palindromes

Part B: Non-redundant list

of motif pairs

, (8.1E-31)

Page 26: Motif Finding Workshop Project

26 MF workshop 08 © Ron Shamir

Option I: Co-occurrence rateN = total # of sequencessA = # of sequences that contain motif AsAB = # of sequences that contain motifs A and BH0 = motifs occur independently and randomlyp-value = prob for at least joint occurrences, given the number of hits of each single motif= tail of hypergeometric distribution

Part B (cont.): Additional score

min( , )( , , , )

A B

AB

BB

AA B AB

A

s s

i s

N sss ii

HG tail N s s sNs

Page 27: Motif Finding Workshop Project

27 MF workshop 08 © Ron Shamir

Option II: Distance biasIs the distance between the two motifs uniform (H0), or are there specific distances that are very common?

Option III: Gap variabilityAre the sequences between the motifs conserved (H0),or are they highly variable?

Other options??

Part B (cont.): Additional score

Page 28: Motif Finding Workshop Project

28 MF workshop 08 © Ron Shamir

Implementation• Java (Eclipse) ; Linux• GUI: Simple graphical user interface for

supplying the input parameters and reporting the results

• Packages for motif logo and statistical scores will be supplied

• Time performance will be measured only for part A

• Reasonable documentation• Separate packages for data-structures,

scores, GUI, I/O, etc.

Page 29: Motif Finding Workshop Project

29 MF workshop 08 © Ron Shamir

Design document• Due in 3 weeks (Feb 24)• 3-5 pages (Word), Hebrew/English• Briefly describe main goal, input

and output of program• Describe main data structures,

algorithms, and scores for parts A+B

• Meet with me before submission

Page 30: Motif Finding Workshop Project

30 MF workshop 08 © Ron Shamir

Fin