introduction to the functional annotation · 2010. 6. 24. · • annotation contained in the go...

32
Lecture 4 – Introduction to Functional Annotation José Luis Mosquera Computational Lab on Microarrays Data Analysis Special Topics in Computer Science Institute of Bioinformatics – Johannes Kepler University June 2010

Upload: others

Post on 07-Sep-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to the Functional Annotation · 2010. 6. 24. · • Annotation contained in the GO database consist of two essential parts • It highlighted different challenges 1

Lecture 4 – Introduction to Functional Annotation

José Luis Mosquera

Computational Lab on Microarrays Data AnalysisSpecial Topics in Computer Science

Institute of Bioinformatics – Johannes Kepler UniversityJune 2010

Page 2: Introduction to the Functional Annotation · 2010. 6. 24. · • Annotation contained in the GO database consist of two essential parts • It highlighted different challenges 1

Outline

1. Introduction1. Biological significance2. The Gene Ontology

2. Methods1. Some approaches to find biological meaning2. Hypergeometric and related approaches

3. Tools1. Evolution of the GO Tools2. SerbGO: Searching for the best GO Tool3. FatiGO

Page 3: Introduction to the Functional Annotation · 2010. 6. 24. · • Annotation contained in the GO database consist of two essential parts • It highlighted different challenges 1

Biological Significance (1/5)

• With the advent of genomic technologies it has become possible to perform high throughput biological experiments in a routinely manner.

• It highlighted different challenges

1. The experiment itself2. Statistical analysis of the results3. Biological interpretation

• These experiments often yield lists of identifiers (genes, peptides,...) which are selected using some specific criteria to assign them statistical significance.

High throughput experiments

Page 4: Introduction to the Functional Annotation · 2010. 6. 24. · • Annotation contained in the GO database consist of two essential parts • It highlighted different challenges 1

Biological Significance (2/5)

• Sometimes the selected items either

1. as being statistical significant is very high, or2. do not show any statistical significance

• Whatever the reason it is expected they “mean something” biologically.

High throughput experiments

Synthesis

What the list means from the biological point of view.

Page 5: Introduction to the Functional Annotation · 2010. 6. 24. · • Annotation contained in the GO database consist of two essential parts • It highlighted different challenges 1

Biological Significance (3/5)

• The usual (reasonable?) way to proceed is to shift the focus from “statistical” to “biological” significance.

• Whereas there is a clear agreement about what means statistical significance...

• There is no consensus definition of biological significance,

• Although everyone talks about it…

If biological significance is the answer, what was the question?

Page 6: Introduction to the Functional Annotation · 2010. 6. 24. · • Annotation contained in the GO database consist of two essential parts • It highlighted different challenges 1

Biological Significance (4/5)

• Interestingly biological significance is often re-casted in terms of statistical significance

Biological significance means Statistical significance

R. Díaz-Uriarte, CAMDA 2002

...to understand the biological relevance of statisticaldifferences in gene expression data...by examining significant differences in the distribution of (GO)terms related to biological processes or molecular function.

Page 7: Introduction to the Functional Annotation · 2010. 6. 24. · • Annotation contained in the GO database consist of two essential parts • It highlighted different challenges 1

Biological Significance (5/5)

• Although it is not necessarily so

Biological significance does not mean Statistical significance

GeneSifter website

... to characterize the biology involved in a particularexperiment, and to identify particular genes of interest...combining the identification of broad biological themes with theability to focus on a particular gene..

Page 8: Introduction to the Functional Annotation · 2010. 6. 24. · • Annotation contained in the GO database consist of two essential parts • It highlighted different challenges 1

The Gene Ontology (1/12)

Let it be clear what is…

The Challange

How could we attribute to the large lists of genes (identifiers) a biological interpretation?

Page 9: Introduction to the Functional Annotation · 2010. 6. 24. · • Annotation contained in the GO database consist of two essential parts • It highlighted different challenges 1

The Gene Ontology (2/12)

• Looking for the existing annotations contained in databases that help to relate the selected genes with the biological knowledge.

• Bioinformatic resources often store data in a scientific natural language.

Rationale

Drawback

Annotation in this way is human readable and understandable, but it is difficult to interpret computationally.

Page 10: Introduction to the Functional Annotation · 2010. 6. 24. · • Annotation contained in the GO database consist of two essential parts • It highlighted different challenges 1

The Gene Ontology (3/12)

What’s a cell?

• The same name can be used to describe different concepts.

• A concept can be described using different names.

• Comparison is difficult, especially across species or databases

Page 11: Introduction to the Functional Annotation · 2010. 6. 24. · • Annotation contained in the GO database consist of two essential parts • It highlighted different challenges 1

The Gene Ontology (4/12)

• The most important thing you want to know is what the genes products are concerned with, i.e. their function.

• The best functional annotation systems use human beings who read the literature before assigning a function to a gene.

Functional Annotation

Some difficulties

• Different people use different words for the same function• They mean different things by the same word.• The context in which a gene was found may not be associated with its function.

• Inference of a function from sequence alone is error-prone and sometimes unreliable.

Page 12: Introduction to the Functional Annotation · 2010. 6. 24. · • Annotation contained in the GO database consist of two essential parts • It highlighted different challenges 1

The Gene Ontology (5/12)

Functional Annotation

What can we do?

Attempts to overcome some of these drawbacks some useful annotation systems are the ontologies.

Page 13: Introduction to the Functional Annotation · 2010. 6. 24. · • Annotation contained in the GO database consist of two essential parts • It highlighted different challenges 1

The Gene Ontology (6/12)

What’s an ontology?

Definition

An ontology is an entity that provides a set of vocabulary terms covering aconceptual domain.

• These terms must1. have an exhaustive and rigorous definition

2. be placed within a (hierarchical data) structure of relationships.

• The terms may be linked with two kinds of relationships1. “is-a” between parent and child.

2. “part-of” between part and the whole.

• They may have one or more

Page 14: Introduction to the Functional Annotation · 2010. 6. 24. · • Annotation contained in the GO database consist of two essential parts • It highlighted different challenges 1

The Gene Ontology (7/12)

What’s an ontology?

But… What about the bilogical field?

A powerful ontology to perform biological interpretation of “our” experiments is the Gene Ontology (usually named GO)

Page 15: Introduction to the Functional Annotation · 2010. 6. 24. · • Annotation contained in the GO database consist of two essential parts • It highlighted different challenges 1

The Gene Ontology (8/12)

• The GO project is a cooperative effort addressed to deal with the need for consistent descriptions of gene products in different databases.

• It is developed and maintained be the Gene Ontology Consortium.

• The GO is organized around three basic ontologies

GO ontologies

Ontology Number of Terms1

Molecular Functions (MF) 7220

Biological Process (BP) 9529

Cellular Component (CC) 1536

Total GO terms 18235

Gene Ontology

Molecular Function Biological Process Cellular Component

1 May 2005

Page 16: Introduction to the Functional Annotation · 2010. 6. 24. · • Annotation contained in the GO database consist of two essential parts • It highlighted different challenges 1

The Gene Ontology (9/12)

GO graph

Page 17: Introduction to the Functional Annotation · 2010. 6. 24. · • Annotation contained in the GO database consist of two essential parts • It highlighted different challenges 1

The Gene Ontology (10/12)

• Annotation contained in the GO database consist of two essential parts

• It highlighted different challenges

1. The ontologies that provide a structured vocabulary.2. The annotations that link the gene products to the

associated terms that define their function

• GO database attributes annotation in a species-independent way.

• Most important databases have cross-references with the GO database.

GO database

Page 18: Introduction to the Functional Annotation · 2010. 6. 24. · • Annotation contained in the GO database consist of two essential parts • It highlighted different challenges 1

The Gene Ontology (11/12)

A given gene product may

• represent one or more molecular functions,

• be used in one or more biological processes and

• appear in one or more cellular components.

GO database

Page 19: Introduction to the Functional Annotation · 2010. 6. 24. · • Annotation contained in the GO database consist of two essential parts • It highlighted different challenges 1

The Gene Ontology (12/12)

GO AnnotationsEvidence Codes

IEA Inferred from Electronic Annotation

ISS Inferred from Sequence Similarity

IEP Inferred from Expression Pattern

IMP Inferred from Mutant Phenotype

IGI Inferred from Genetic Interaction

IPI Inferred from Physical Interaction

IDA Inferred from Direct Assay

RCA Inferred from Reviewed Computational Analysis

TAS Traceable Author Statement

NAS Non-traceable Author Statement

IC Inferred by Curator

ND No biological Data available

Page 20: Introduction to the Functional Annotation · 2010. 6. 24. · • Annotation contained in the GO database consist of two essential parts • It highlighted different challenges 1

Some Approaches To Find Biological Meaning

• Annotating results using an appropriate biological databases

• Rely on some form of grouping methods

1. Genes Set Enrichment: Hypergeometric tests, Fisher's Exact, GSEA,...2. Holistac Approaches: Category, globaltest, GlobalAncova,...3. Minimal Acceptance Strength4. ...

• Take a more global approach, relying on some type of

1. Graph-theoretic analysis, or2. Pathway analysis.

• Or the most (up-to-date) global approach: Systems biology

Quick overview

Page 21: Introduction to the Functional Annotation · 2010. 6. 24. · • Annotation contained in the GO database consist of two essential parts • It highlighted different challenges 1

Gene Set Enrichment (1)

Consider the following frame

• N genes on a microarray

• M genes belong to a given GO term category (A),

• M − N do not belong to it (category Ac )

• K of these N genes are selected and assigned to a given class (e.g. regulated genes)

• x of these K genes will be in A

Hypergeometric test

Statistical Hypotessis

H0 : GO category A is equally represented in the microarray than in the class of differentially regulated genesH1 : GO category A is more (or less) represented in the microarray than in the class of differentially regulated genes

Example

Page 22: Introduction to the Functional Annotation · 2010. 6. 24. · • Annotation contained in the GO database consist of two essential parts • It highlighted different challenges 1

Gene Set Enrichment (2)

• The probability that certain category occurs x times just by chance in the list of differentially regulated genes is modeled by a hypergeometric distribution with parameters (N , M , K ).

Hypergeometric distribution

Question

Assuming sampling without replacement, what is the probability of having exactly x genes of category A?

Page 23: Introduction to the Functional Annotation · 2010. 6. 24. · • Annotation contained in the GO database consist of two essential parts • It highlighted different challenges 1

Gene Set Enrichment (3)

• The preceding model allows a user to compute a p-value for the test in which rejecting the null hypothesis corresponds to deciding that the category which is being tested is over-represented

• To test for under-representation we would use 1-p-values.

Hypergeometric distribution

Page 24: Introduction to the Functional Annotation · 2010. 6. 24. · • Annotation contained in the GO database consist of two essential parts • It highlighted different challenges 1

Gene Set Enrichment (5)

Hypergeometric distribution

Some considerations…

•Different programs use slightly different approaches, most of which are equivalent in some sense

1. Fishers exact test2. Chi-Squared test3. Binomial test4. …

•It is necessary, and most programs do, to make some form of multiple testing adjustment due to the fact that one may be doing dozens or more tests simultaneously.

Page 25: Introduction to the Functional Annotation · 2010. 6. 24. · • Annotation contained in the GO database consist of two essential parts • It highlighted different challenges 1

Gene Set Enrichment (6)

• Purposed by Subramanian et al. (2005)

STEP 1

● Compute a gene-wise measure (e.g. absolute t-statistics)● Rank genes according to this measure

STEP 2

● Assign labels A to genes belonging to a gene group of interest and B to all the other genes

● If group A is enriched with interesting genes, many of it’s genes will have high ranks and we will observe a separation in the ordered list

A B A A B A A A B A B B B A B B B B A B B B

Gene Set Enrichment Analysis (GSEA)

Page 26: Introduction to the Functional Annotation · 2010. 6. 24. · • Annotation contained in the GO database consist of two essential parts • It highlighted different challenges 1

• Assign score nB to all genes A and −n

A to all genes B

• Draw the cumulative sum of these scores

• Is the maximum M of the cumulative sum unusually high? (Kolmogorov-Smirnov test)

Gene Set Enrichment Analysis (GSEA)

Gene Set Enrichment (7)

Page 27: Introduction to the Functional Annotation · 2010. 6. 24. · • Annotation contained in the GO database consist of two essential parts • It highlighted different challenges 1

Evolution of the GO Tools

• In recent years many similar tools to analyze biological significance using the GO have been published and made available.

• Draghici (2005) reviews 15 of them.

• Huang (2008) classify 68 tools.

From the initial gap to the crowd

Page 28: Introduction to the Functional Annotation · 2010. 6. 24. · • Annotation contained in the GO database consist of two essential parts • It highlighted different challenges 1

SerbGO: Searching for the best GO Tool (1/4)

• There are many tools to do not-so-many things.

• It is a bidirectional application. The user can...

1. ask for some features to get the appropriate tools for their interests

2. compare tools to check which cqapabilities are implemented in each one.

The project…

SerbGO Tool

It is intended to assist us in determine which microarray tools for gene expression analysis that make use of the GO ontologies are best suited to their projects.

Page 29: Introduction to the Functional Annotation · 2010. 6. 24. · • Annotation contained in the GO database consist of two essential parts • It highlighted different challenges 1

SerbGO: Searching for the best GO Tool (2/4)

http://estbioinfo.stat.ub.es/apli/serbgo

Page 30: Introduction to the Functional Annotation · 2010. 6. 24. · • Annotation contained in the GO database consist of two essential parts • It highlighted different challenges 1

SerbGO: Searching for the best GO Tool (3/4)

Which tools perform what tasks?

Many functionalities are available

Check your options in the form and move forward

Page 31: Introduction to the Functional Annotation · 2010. 6. 24. · • Annotation contained in the GO database consist of two essential parts • It highlighted different challenges 1

SerbGO: Searching for the best GO Tool (4/4)

Comparing GO tools by their capabilities

Tick the tools that you want to compare

Page 32: Introduction to the Functional Annotation · 2010. 6. 24. · • Annotation contained in the GO database consist of two essential parts • It highlighted different challenges 1

Appendix

Example

A Ac

K = 200 genes differentially expressed

N = 10000 genes on microarrays

M = 500 in the GO category A

N - M = 9500 belong to Ac

x = 25 genes are in A

Back