i529 - lab 6 go+meme + twinscan

I529 - Lab 6GO+MEME + TwinScan

AI : Kwangmin Choi

Today’s topics• Gene Ontology prediction/mapping

– AmiGo• http://amigo.geneontology.org/cgi-bin/amigo/go.cgi

– PFP• http://dragon.bio.purdue.edu/pfp/

– GOtcha• http://www.compbio.dundee.ac.uk/gotcha/

• Pathway prediction/mapping– KAAS

• http://www.genome.jp/kegg/kaas• MEME• TwinScan

http://amigo.geneontology.org/cgi-bin/amigo/go.cgi

http://dragon.bio.purdue.edu/pfp/

http://www.compbio.dundee.ac.uk/gotcha/

http://www.genome.jp/kegg/kaas

Gene Ontology

• In a species-independent manner., the GO project has developed three structured controlled vocabularies (ontologies) that describe gene products in terms of their associated

GO:biological process• A biological process is series of events accomplished by one

or more ordered assemblies of molecular functions. – E.g. cellular physiological process or signal transduction. – E.g. pyrimidine metabolic process or alpha-glucoside transport.

• It can be difficult to distinguish between a biological process and a molecular function, but the general rule is that a process must have more than one distinct steps.

• A biological process is not equivalent to a pathway; at present, GO does not try to represent the dynamics or dependencies that would be required to fully describe a pathway.

GO: molecular functions• Molecular function describes activities, such as catalytic or

binding activities, that occur at the molecular level.

• GO molecular function terms represent activities rather than the entities (molecules or complexes) that perform the actions,

• GO milecular function terms do not specify where or when, or in what context, the action takes place. – E..g. (general) catalytic activity, transporter activity, or binding

etc.– E.g. (specific) adenylate cyclase activity, Toll receptor binding

etc.

GO: cellular components• A cellular component is just that, a component of a

cell, but with the proviso that it is part of some larger object;

• Less informative

• This may be an anatomical structure – e.g. rough endoplasmic reticulum or nucleus

• or a gene product group – e.g. ribosome, proteasome or a protein dimer

AmiGO• URL http://amigo.geneontology.org/cgi-bin/amigo/go.cgi

• AmiGO is the official tool for searching and browsing the Gene Ontology database

• Simple blast search is provided (not useful)

• AmiGO consists of a controlled vocabulary of terms covering biological concepts, and a large number of genes or gene products whose attributes have been annotated using GO terms.

http://amigo.geneontology.org/cgi-bin/amigo/go.cgi

PFP (Automated Protein Function Prediction Server)

• Hawkins, T., Luban, S. and Kihara, D. 2006. Enhanced Automated Function Prediction Using Distantly Related Sequences and Contextual Association by PFP. Protein Science 15: 1550-6.

• The PFP algorithm has been shown to increase coverage of sequence-based function annotation more than fivefold by extending a PSI-BLAST search to extract and score GO terms individually

• It applies the Function Association Matrix (FAM), to score significantly associating pairs of annotations.

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=pubmed&dopt=Abstract&list_uids=16672240&query_hl=21&itool=pubmed_docsum



PFP method• PFP uses a scoring scheme to rank GO

annotations assigned to all of the most similar sequences according to – (1) their frequency of occurrence in those sequences – (2) the degree of similarity of the originating

sequence to the query.

• This is similar to the scoring basis for the R-value used by the GOtcha method to score annotations from pairwise alignment matches (Martin et al. 2004)

http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pubmed&pubmedid=16672240#b08

PFP method

• A GO term, fa

• s(fa) is the final score assigned to the GO term, fa • N is the number of the similar sequences retrieved by PSI-BLAST • E_value(i) is the E-value given to the sequence I• b = 2 (or log10[100]) to allow the use of sequence matches to an E-value of 100.• Function Association Matrix (FAM),

– fj is a GO term assigned to the sequence i. – P(fa | fj) is the conditional probability that fa is associated with fj, – c(fa, fj) is number of times fa and fj are assigned simultaneously to each sequence in UniProt – c(fj) is the total number of times fj appeared in UniProt, – μ is the size of one dimension of the FAM (i.e., the total number of unique GO terms)– ɛ is the pseudo-count.

PFP

• Web server http://dragon.bio.purdue.edu/pfp/queue/1168_kw.f.result.html

• Local installation– http://dragon.bio.purdue.edu/pfp/dist– You need to specify the path of blastp – And also need BLOSUM62

http://dragon.bio.purdue.edu/pfp/queue/1168_kw.f.result.html

http://dragon.bio.purdue.edu/pfp/queue/1168_kw.f.result.html

http://dragon.bio.purdue.edu/pfp/dist

PFP (Automated Protein Function Prediction Server)

• PFP output– http://darwin.informatics.indiana.edu/col/courses/I529-11/Lab/Lab5/Data/pfp_data/0008.out

• Columns– 1: predicted GO term– 2: GO category (f/p/c)– 3: raw term score– 4: term p-value– 5: rank (by p-value)– 6: confidence to be exact match– 7: rank (by column 7)– 8: confidence within 2 edges on the GO DAG– 9: rank (by column 8)– 10: confidence within 4 edges on the GO DAG– 11: rank (by column 10)– 12: GO term short definition

GOtcha

• The GOtcha method – Martin et al. BMC Bioinformatics (2004) 5:178.

• GOtcha assigns functional terms transitively based upon sequence similarity.

• These terms are ranked by probability and displayed graphically on a subtree of Gene Ontology.

http://www.biomedcentral.com/1471-2105/5/178





• GOtcha performs a BLAST search of the query sequence against individual well annotated genomes.

• Annotations are transitively assigned from all hits, with a score corresponding to the E-value, individual GO-terms receiving cumulative scores from multiple sequence similarity matches.

• Cumulative scores are normalized and, for each term, two scores are obtained

– the I-score which is normalized to the root node,

– the C-score which is the cumulative score at the root node.

• For each GO-term a precomputed scoring table is used to establish the assignment likelihood for that term given that I-score and that C-score. This is represented as a probability

Gotcha method

Results: http://darwin.informatics.indiana.edu/col/courses/I529-11/Lab/Lab5/Data/Gotcha/gotcha.4451/

KAAS• KAAS (KEGG Automatic Annotation Server) provides

functional annotation of genes in a genome by BLAST comparisons against a manually curated set of ortholog groups in KEGG GENES.

• The result contains KO (KEGG Orthology) assignments and automatically generated KEGG pathways.

• Moriya, Y., Itoh, M., Okuda, S., Yoshizawa, A., and Kanehisa, M.; KAAS: an automatic genome annotation and pathway reconstruction server. Nucleic Acids Res. 35, W182-W185 (2007). [NAR]

http://nar.oxfordjournals.org/cgi/content/full/35/suppl_2/W182

KAAS• Web server: http://www.genome.jp/kegg/kaas/

• KAAS works best when a complete set of genes in a genome is known. Prepare query amino acid sequences and use the BBH (bi-directional best hit) method to assign orthologs.

• KAAS can also be used for a limited number of genes. Prepare query amino acid sequences and use the SBH (single-directional best hit) method to assign orthologs.

• When ESTs are comprehensive enough, a set of consensus contigs can be generated by the EGassembler server and used as a gene set for KAAS with the BBH method. Otherwise, use ESTs as they are with the SBH method.

http://www.genome.jp/kegg/kaas/

http://egassembler.hgc.jp/

KAAS workflow

Pathway mapping

• KAAS returns – KO list– KEGG Atlas Metabolism map [Create atlas]– Pathway maps [Create all maps]– Hierarchy files

• You can highlight KEGG maps using KEGG API– http://www.genome.jp/kegg/soap/doc/keggapi_man

ual.html– See: color_pathway_by_objects

http://www.genome.jp/kegg/kaas/files/dl/1235141399/query.ko

http://www.genome.jp/kaas-bin/kaas_main?mode=markatlas&id=1235141399&[email protected]

http://www.genome.jp/kaas-bin/kaas_main?mode=markmap&id=1235141399&[email protected]

http://www.genome.jp/kegg/kaas/files/dl/1235141399/hier.tar.gz

http://www.genome.jp/kegg/soap/doc/keggapi_manual.html

http://www.genome.jp/kegg/soap/doc/keggapi_manual.html

21

• Discover (conserved) motifs in a group of unaligned and related sequences (DNA or protein)

• Unsupervised: Automatically choose the following (with little or no prior knowledge)– Best width of motifs– Number of occurrences in each sequence– Composition of each motif

MEME• MEME (Multiple EM for Motif Elicitation) discovers patterns in

nucleotide and amino acid sequences. • MEME paper • Few tutorials - 1( Pg 6 ), 2, 3, 4

• MEME represents motifs as position-dependent letter-probability matrices which describe the probability of each possible letter at each position in the pattern. – MEME uses both alternately EM to find model parameters that

maximize the likelihood of the model

• Individual MEME motifs do not contain gaps. Patterns with variable-length gaps are split by MEME into two or more separate motifs.

http://www.sdsc.edu/~tbailey/papers/ismb94.pdf

http://ai.stanford.edu/~serafim/CS262_2007/notes/lecture18.pdf

Motif finding using the EM algorithm MEME (Bailey & Elkan 1995)

http://meme.sdsc.edu/meme/intro.html

• MEME works by iteratively refining matrix and identifying sites:– Estimate motif model

• Start with an n-mer seed (random or specified)• Build a matrix by incorporating some of background frequencies

– Identify examples of the model• For every n-mer in the input set, identify its probability given the

matrix model– Re-estimate the motif model

• Calculate a new matrix, based on the weighted frequencies of all n-mers in the set

– Iteratively refine the matrix and identify sites until convergence.

24

Expectation Maximization

1. Expectation step – initial guess about the location of a (variable) sequence pattern in a set of sequences

2. Maximization step – improve/update pattern. Iterative search

25

Expectation Maximization Algorithm

dataset - unaligned set of sequences (training data) S1, S2, …, Si, …, Sn each of length L

W - width of motifp - matrix of probabilities that the

motif starts in position j in Si Z - matrix representing the

probability of character c in column k (the character c will be A, C, G, or T for DNA sequences or one of the 20 protein characters)

e - epsilon value

26

MEME Algorithm

S1: GGCTATTGCAGATGACGAGATGAGGCCCAGACC

S2: GGATGACTTATATAAAGGACGATAAGAGATGAC

S3: CTAGCTCGTAGCTCGTTGAGATGCGCTCCCCGCTC

S4: GATGACGGAGTATTAAAGACTCGATGAGTTATACGA

1. MEME uses an initial EM heuristic to estimate the bestStarting-point matrix:

G 0.26 0.24 0.18 0.26 0.25 0.26A 0.24 0.26 0.28 0.24 0.25 0.22T 0.25 0.23 0.30 0.25 0.25 0.25C 0.25 0.27 0.24 0.25 0.25 0.27

Problem: find a 6-mer motif in 4 sequences

GGCTATTGCATATGACGAGATGAGGCCCAGACC

GGATGACTTATATAAAGGACCGTGATAAGAGATTAC

CTAGCTCGTAGCTCGTTGAGATGCGCTCCCCGCTC

GATGACGGAGTATTAAAGACTCGATGAGTTATACGA

2. MEME scores the match of all 6-mers to current matrixHere, just consider the

underlined 6-mers, Although in reality all

6-mers are scored





2. MEME scores the match of all 6-mers to current matrix

3. Reestimate the matrix based on the weighted contribution of all 6 mers

G 0.29 0.24 0.17 0.27 0.24 0.30A 0.22 0.26 0.27 0.22 0.28 0.18T 0.24 0.23 0.33 0.23 0.24 0.28C 0.24 0.27 0.23 0.28 0.24 0.24

The height of the basesabove corresponds tohow much that 6-mer counts in calculatingthe new matrix





MEME scores the match of all 6-mers to current matrix





Reestimate the matrix based on the weighted contribution of all 6 mers

G 0.40 0.20 0.15 0.42 0.24 0.30A 0.30 0.30 0.20 0.24 0.46 0.18T 0.15 0.30 0.45 0.16 0.15 0.28C 0.15 0.20 0.20 0.16 0.15 0.24

The height of the basesabove corresponds tohow much that 6-mer counts in calculatingthe new matrix





MEME scores the match of all 6-mers to current matrix

Iterations continue until convergence (ie. numbers don’t change much between iterations)

Final motif

G 0.85 0.05 0.10 0.80 0.20 0.35A 0.05 0.60 0.10 0.05 0.60 0.10T 0.05 0.30 0.70 0.05 0.20 0.10C 0.05 0.05 0.10 0.10 0.10 0.35

34

Expectation Maximization Idea

MEME web server• MEME website

• Protein motif• Sample Input • Fill in email address• Keep all else default

• DNA motif• Sample Input • Minimum motif width: 4• Number of different motifs: 2

36

Types of Possible Motif Models

1. OOPS• One occurrence per sequence of the

motif in the dataset

2. ZOOPS• Zero or one motif occurrences per

dataset sequence• TCM• Motif to appear any number of times

in a sequence (two-component mixture)

A simple DNA examplememe dna.fasta -dna -mod oops -pal -o output_dir

• MEME looks for a single motif in the file dna.fasta which contains DNA sequences in FASTA format.

• The OOPS model is used so MEME assumes that every sequence contains exactly one occurrence of the motif.

• The palindrome switch is given so the motif model (PSPM) is converted into a palindrome by combining corresponding frequency columns.

• If MEME decides that a motif is a palindrome, it averages the letter frequencies in corresponding columns together.

• For instance, if the width of the motif is 10, columns 1 and 10, 2 and 9, 3 and 8, etc., are averaged together.

• MEME automatically chooses the best width for the motif in this example since no width was specified.

Searching for motifs on both DNA strands

meme dna.fasta -dna -mod oops -revcomp -o output_dir

• The -revcomp switch tells MEME to consider both DNA strands, and

• The -pal switch is absent so the palindrome conversion is omitted.

• When DNA uses both DNA strands, motif occurrences on the two strands may not overlap. That is, any position in the sequence given in the training set may be contained in an occurrence of a motif on the positive strand or the negative strand, but not both.

A simple protein examplememe protein.fasta -mod oops -maxw 20 -nmotifs 2 -o output_dir

• The -dna switch is absent, so MEME assumes the input file as protein sequences.

• Each motif is assumed to occur in each of the sequences because the OOPS model is specified.

• Specifying -maxw 20 makes MEME run faster since it does not have to consider motifs longer than 20.

• MEME searches for two motifs each of width less than or equal to 20.

MEME on BigRed

• MEME is installed on BigRed.• KB: http://kb.iu.edu/data/awwa.html

TWINSCAN

TwinScan• TwinScan finds genes in a "target" genomic sequence by simultaneously

maximizing the probability of the gene structure in the target and the evolutionary conservation derived from "informant" genomic sequences.

• The target sequence (i.e. the sequence to be annotated) should generally be of draft or finished quality.

• The informant can range from a single sequence to a whole genome in any condition from raw shotgun reads to finished assembly.

• References– http://mblab.wustl.edu/media/publications/Hu-Brent-2003-Proof.pdf– http://mrw.interscience.wiley.com/emrw/9780471250951/cp/cpbi/article/

bi0408/current/abstract– http://bioinformatics.oxfordjournals.org/cgi/content/abstract/17/suppl_1/

S140

Requirements• TwinScan/ Nscan 4.0 executable

• DNA sequence in FASTA format

• Twinscan parameter file (/parameters/twinscan_parameters) – Each filename contains the name of the target organism: eg maize_twinscan.zhmm.

• Conservation sequence (see examples/example.conseq)– A symbolic representation of the the best alignments between the target and informant sequences. – To create this conservation sequence, WU-BLAST (http://blast.wustl.edu) is used. – For NCBI BLAST, the input parameters need to be changed. (see parameters/examples/example_blast_parameters.txt)– xdformat (WU-BLAST package) formats the informant sequences to create a blast database. – After running BLAST, the output must be formatted with conseq, which is included in this package.– Example (using WU-BLAST)

• xdformat -n informant.fa• Blast M=1 N=-1 Q=5 R=1 B=10000 V=100 -cpus=1 -warnings -lcfilter filter=seg filter=dust topcomboN=1 informant.fa target.fa >

blast.out• conseq target.fa blast.out > conseq.fa

– Note: The runTwinscan2 script will run these steps without user intervention (see below).

• EST sequence (optional)– EST sequence is a symbolic representation of evidence from ESTs that align to the target sequence. – The estseq script included in the distribution creates EST sequence when given a DNA sequence and a (set of) BLAT reports of

the the ESTs aligned to the target. – For downloading BLAT, go to http://genome.ucsc.edu/FAQ/FAQblat.html#blat3

Web service and Installation• Web service

– http://mblab.wustl.edu/nscan/submit/

• Local installation– http://mblab.wustl.edu/software/twinscan/

• Installation step$ tar xvzf iscan-4.1.2.tar.gz$ cd N-SCAN$ make linux$ ./test-executable

How to run

• Usagetwinscan <parameter file> <masked sequence file> -c=<conseq file> [-

e=estseq_file] > <outputfile>

• Example:./bin/iscan ./parameters/twinscan_parameters/human_twinscan.zhmm

./examples/example.fa.masked -c=./examples/example.conseq > mySequence.gtf

Running Twinscan using the runTwinscan2 script

• In summary, there are 5 steps required to run Twinscan:– Step 1: Mask target sequence with RepeatMasker– Step 2: Create informant BLAST database– Step 3: Run BLAST– Step 4: Create conservation sequence– (Step 4b: Create EST sequence)– Step 5: Run Twinscan

• These five steps are all contained in the example script runTwinscan2 (see ./bin)

• The default BLAST parameters used by runTwinscan2 are those for C.elegans (see parameters/blast_params/Celegans.blast.param).

– This can and should be changed for any other species with the -B option to the runTwinscan2 script.

• The file example.output in the /examples directory contains the output from runTwincan2 using the BLAST parameters found in the script.

Local environment seeting for runTwinscan2

• Example– ../bin/runTwinscan2 -r ../parameters/twinscan_parameters/human_twinscan.zhmm -d

output -B ../parameters/blast_params/Hsapiens.blast.param example.fa.masked informant.fa

• After running you can find output files in the newly created /output directory.

• Several programs must be installed on your system to run runTwinscan.

• You may need to change runTwinscan to point it to these programs. – my $REPEATMASKER = "RepeatMasker"; # Format for local environment– my $BLASTN = "blastn"; # Format for local environment– my $BLAT = "blat"; # Format for local environment– my $XDFORMAT = "xdformat"; # Format for local environment– my $PRESSDB = "pressdb"; # Format for local environment

i529 - lab 6 go+meme + twinscan

Documents

molecular function terms

milecular function terms

automated function prediction

cellular physiological

based function annotation

gene products

biological concepts

binding activities