annotation-based meta-analysis of microarray experiments

Annotation-based meta-analysis of microarray experiments

Chris Stoeckert

Yale Biostatistics Seminar Series

Feb. 26, 2008

QuickTime™ and a decompressor

are needed to see this picture.



Data Integration at CBIL

http://www.cbil.upenn.edu

*

Databases

*

Knowledge Representation

*QuickTime™ and a

TIFF (Uncompressed) decompressorare needed to see this picture.

Database schemas (GUS) and Data standards (MGED, OBI)

Data Modeling

*

Integrative tools for ortholog identification, expression analysis, chromosomal aberrations, TF regulatory networks, protein interaction networks

***

*


• Meta-analysis – Examples illustrating information gained and

problems caused by incomplete annotations

• Standards for annotating experiments– Standards from the MGED Society and multi-

community standards (e.g., OBI).

• Computing with Annotations– Dissimilarity measures to quantitatively compare

experiments and assays based on annotations– Sample applications using dissimilarity measures

The Problem• Analysis of microarray datasets has led to new

challenges in statistics (many genes, few samples).• Focus of the analysis has been on the genes

– Look for correlations, differences in expression– Look for greater than expected associations in types of genes

• What can be learned from an analysis of sample characteristics and experimental parameters?– If experiments were better annotated, what would we be able

to do? What are the benefits of better annotation?– What statistical measures and tests can be applied for this

purpose?

Meta-analysis of Microarray Datasets• Meta-analyses have been performed using microarray data

from different experiments studying similar conditions to identify genes with significant signatures in those conditions.

• Generally, these analyses look for robust signals that overcome experiment-specific biases in sample types, collection, and treatment and rely on the fact that with enough experiments these effects will wash out in the noise.

• Detailed information about both the biological intent and context of a study is crucial for meaningful selection of experiments to be input into a meta- analysis. Meta-analysis is complicated by differences in experimental technologies, data post-processing, database formats, and inconsistent gene and sample annotations.

Meta-analysis example: “Creation and implications of a phenome-genome network”

Butte and Kohane. Nat Biotech. 2006

Meta-analysis example: “Creation and implications of a phenome-genome network”

Butte and Kohane. Nat Biotech. 2006

• Clustered experiments based on mapping concepts found in sample annotations to UMLS meta-thesaurus.

• Relationships found between phenotype (e.g., aging), disease (e.g., leukemia), environmental (e.g., injury) and experimental (e.g., muscle cells) factors and genes with differential expression.

• “the ease and accuracy of automating inferences across data are crucially dependent on the accuracy and consistency of the human annotation process, which will only happen when every investigator has a better prospective understanding of the long- term value of the time invested in improving annotations.”

Another Example: CAMDA 2007 Dataset• ~ 6000 arrays of diseased and normal

human samples and cell lines collected from ArrayExpress and GEO.

• Meta-analysis on large scale open to many but …

• Issues relating to annotation remain. On what basis do you compare:– “nasal_epithelium”– “nasal_epithelium,

pulmonary_disease_cystic_fibrosis”



Provided by ArrayExpress

What is MGED?

• The Microarray and Gene Expression Data Society.

• A grass roots organization started in 1999 to develop standards for the sharing and storing of microarray data

• A society with participants from academia, industry, government, and journals

• A series of meetings to showcase cutting edge work and promote standards.

The MGED Society Mission

The Microarray and Gene Expression Data (MGED) Society is an international organization of biologists, computer scientists, and data analysts that aims to facilitate the sharing of data generated using the microarray and other functional genomics technologies for a variety of applications including expression profiling. The scope of MGED includes data generated using any technology when applied to genome-scale studies of gene expression, binding, modification and other related applications.The focus is on establishing standards for data quality, management, annotation and exchange; facilitating the creation of tools that leverage these standards; working with other standards organizations and promoting the sharing of high quality, well annotated data within the life sciences and biomedical communities.

http://www.mged.org

MGED Standards

• What information is needed for a microarray experiment?– MIAME: Minimal Information About a Microarray

Experiment

• How do you “code up” microarray data?– MAGE-OM: MicroArray Gene Expression Object

Model

• What words do you use to describe a microarray experiment?– MO: MGED Ontology

MIAME: Minimal Information About a Microarray Experiment

• Brazma et al. Nature Genetics. 2001• Version 2.0 proposal available

– The raw data for each hybridisation– The final processed data for the set of hybridisations in the

experiment (study)– The essential sample annotation, including experimental

factors and their values– The experiment design including sample data relationships– Sufficient annotation of the array design– Essential experimental and data processing protocols

hybridisationlabelled

nucleic acidarray

RNA extract

Sample

Array design


nucleic acidarray

RNA extract

Sample


nucleic acidarray

RNA extract

Sample


nucleic acidarray

RNA extract

Sample


nucleic acidMicroarray

RNA extract

SampleExperiment

Gene expression data matrix

normalization

integration

ProtocolProtocolProtocolProtocolProtocolProtocol

genes

MIAME in a nutshell

Stoeckert et al. Drug Discovery Today TARGETS 2004

MAGE-OM: MicroArray Gene Expression Object Model

• MAGE-ML– XML version of MAGE-OM– Spellman et al. Genome Biology 2002– Version 1.1– V2.0 will be part of FuGE: Functional Genomics standard with participation

from HUPO (Human Proteome Organization), the Metabolomics Society, and other communities.

• Jones et al. Nature Biotech. 2007

• MAGE-TAB– Tab-delimited – Rayner et al. BMC Bioinformatics 2006– Investigation Description Format (IDF)– Sample and Data Relationship Format (SDRF)– Array Design Format (ADF)

MGED Ontology

• Whetzel et al. Bioinformatics 2006• Now in version 1.3.1• Version 2 will be derived from OBI (Ontology for

Biomedical Investigations). – Like FuGE, OBI is a standard resource being built by

multiple communities.

Ecosystem of Biomedical Standards

Many Communities: MGED PSI MSI OBO BIRN CaBIG …

Many CommunityStandards: MIAME MIAPE CIMR GO

MAGE-ML GelML spML ChEBIMAGE-TAB mzDataXML PATO

MGED Ontology PSI-MI sepCV NMR Ontology

Integrative Standards: FuGE MIBBI OBI

QuickTime™ and aTIFF (Uncompressed) decompressor


OBI – Ontology for Biomedical Investigations

Diverse background• Omics standardization effort people (MGED, PSI, MSI)• People ‘running’ (public) repositories, primary + secondary databases

- Software engineers, modellers, biologists, data-miners• People from the semantic web technology• Vendors and manufacturers (new)

Different maturity stages• Some needs to ‘rebuild’, e.g. MGED Ontology (microarry)• Some are starting now, e.g. MSI (metabolomics), EnvO (environment)

Plurality of (prospective) usage• Driving data entry and annotation

- Indexing of experimental data, minimal information lists, x-db queries

• Applying it to text-mining- Benchmarking, enrichment, annotation

• Encoding facts from literature- Building knowledge bases relying on RDF triple stores

http://obi.sf.net

OBI – Communities and Structure

1. Coordination Committee (CC): Representatives of the communities -> Monthly conferences

2. Developers WG: CC and other communities’ members

Weekly conferences calls

3. Advisors:

-> National Center for Biomedical Ontologies

http://obi.sf.net

QuickTime™ and aTIFF (Uncompressed) decompressor


Sending terms to other

OBI branches or external resources,

e.g.

OBI – Main Activities and Timelines

Continue branch activities (iterative process)• Branches editors working on submitted terms

- Normalize terms, add metadata tags (e.g. definition and source)- Bin terms into the relevant top level classes and identify relationships- Sort terms by relevance to one or other branch, or to other ontologies

First evaluation of OBI draft + Release OBI 0.1 (Feb 08)Review branches and merge with the trunk into a coreApply use cases and competency questions

Evaluate how the ontology perform, also what is missing, what is wrong

5th and 6th face-2-face meeting for Coordinators and Developers

•BBCCRC, Vancouver, Canada, (Jan/Feb 08), self funded + MGED sponsor

•EBI, Cambridge, UK (Summer 08), BBSRC funds

http://obi.sf.netQuickTime™ and a

TIFF (Uncompressed) decompressorare needed to see this picture.








Elisabetta Manduchi, Junmin Liu

Potential General Applications

• Identification of experiments/assays of interest from large DBs

• Organization of web resources• QC (annotation and ontology development)• Meta-analysis assessments• Guidance in data QC and pre-processing

Computing with Annotations

• Ideal situation: complete, accurate, consistent annotations

• Example of Application (from CAMDA/ AE use case):– Assess assay quality using NUSE (normalized unscaled standard

errors)– Need to work with appropriate groups of assays

Annotated AssaysCompute Dissimilarities

ClusterApplications

• Actual situation: – incomplete annotation (missing values)– heterogeneous granularity– variation in ontologies used for a given

annotation field

• Flow:

Annotated AssaysCompute Dissimilarities

ClusterQC

Refine Annotation

Applications

Test Case: EPConDB• 24 published EPConDB experiments

manually classified into 5 classes:• Preliminary study computing with

experiment annotation– intent– context

• Explored dissimilarity measures between experiments– Identified 5 annotation

components, which could be weighted as desired

– For each component, defined component-wise dissimilarity

– Took weighted average of component-wise dissimilarities

Pancreatic Growth after Partial Pancreatectomy and Exendin-4 Treatment (De Leon et al., 2006).

• Manually classified under “Pancreas development and growth” • Experiment Design Types: MO terms providing a high level description for the experiment. For this

experiment, these are:– MethodologicalDesign.time_series_design,– PerturbationalDesign.compound_treatment_design,– PerturbationalDesign.stimulus_or_stress_design

• Experimental Factor Types: MO terms describing the type of factors under test in the experiment. In our example these are:

– ComplexAction.compound_based_treatment (samples were treated with either Extendin-4 or Vehicle, or nothing),

– ComplexAction.specified_biomaterial_action (some samples had a pancreatectomy, others did not, others had a sham operation),

– ComplexAction.timepoint (samples treated with a compound were treated for different amounts of hours).

• Organisms: the organisms to which the biosources used in the various assays belong. In our example this was just “Mus musculus”.

• Other Biomaterial Characteristics of these biosources. In our example: – Age.birth (initial time point for computing the age), Age.weeks (unit of measure for age);– DevelopmentalStage.Theiler Stage 28, – OrganismPart.pancreas, – Sex.male, – StrainOrLine.BALB/c.

• Treatment Types: the types of treatments applied to the original biosources which led to the final labeled extracts hybridized to the array. In our example, these were:

– ComplexAction.specified_biomaterial_action, – ComplexAction.compound_based_treatment, – ComplexAction.split, ComplexAction.nucleic_acid_extraction, – ComplexAction.labeling.

Component-wise experiment dissimilarity

• For each of the 5 annotation components and for each pair of experiments – We have two sets of terms, one per experiment: say A and B– Define the component-wise dissimilarity between these two

experiments using the Jaccard or the Kulczynski distance

– Choose one or the other according to how you want to weigh containments

• With these distances, iteratively (leave one out) classified each experiment based on smallest distance to other experiments.

Jaccard Kulczynski

⎟⎟⎠

⎞⎜⎜⎝

⎛−

||

||1

BA

BA

UI

⎟⎟⎠

⎞⎜⎜⎝

⎛+−

||

||

||

||

2

11

B

BA

A

BA II

Automated vs. Manual Classification of EPConDB Studies

Distance Weights % Correct %Ambiguous % Incorrect

either 1,1,1,1,1 62.5 0 37.5

either 0,1,1,1,1 45.83 0 54.17

either 1,0,1,1,1 50 0 50

Kulczynski 1,1,0,1,1 75 0 25

Jaccard 1,1,0,1,1 66.67 0 33.33

either 1,1,1,0,1 62.5 0 37.5

either 1,1,1,1,0 58.33 0 41.67

Kulczynski 1,1,0,0,1 75 0 25

Jaccard 1,1,0,0,1 70.83 0 29.17

• 5 Annotations used– Experiment design– Experimental factors– Organism– Other Biomaterial Characteristics– Treatment

• Manual Classifications– Pancreas development and growth– Differentiation of insulin-producing cells– Islet/ beta-cell stimulation/injury– Tissue surveys– Targets and roles of transcriptional regulators

• Result: achieved ~ 75% correct classification using all but organism. Note this reflects manual classification based on intent - other classifications might have different optimal weights. Need to retry with more (~75 now)

Test Case: RAD

• Clustering of 62 public experiments from RAD– No predefined

classifications• Tried PAM and k

from 5 to 20• Use silhouettes to

determine optimal clusters

– Manual assessment• QC value

Silhouettes

• Rousseeuw P.J. (1987), J. Comput. Appl. Math., 20, 53–65

• For each study i,:– a(i) = average dissimilarity between i and all other study of the

cluster to which i belongs – d(i,C) = average dissimilarity of i to all assays in cluster C. – b(i) = minC d(i,C): dissimilarity between i and its “neighbor”

cluster– s(i) = ( b(i) - a(i) ) / max( a(i), b(i) )– If i is in a singleton cluster, then s(i)=0.

• large s(i) (almost 1): very well clustered• small s(i) (around 0): the experiment lies between two

clusters• negative s(i): probably placed in the wrong cluster.

Unsupervised classification of microarray experiments in RAD



• Best silhouette ave. value was 0.36 with Kulczynski with weights 1,1,0,0,1 and PAM with k=8 or weights 1,1,1,0,1 and PAM with k=14

• Singleton and odd clusters revealed misannotated studies (QC)

• Not optimized but gives us a sense for the whether there is sufficient signal in the annotations (at least in our database) to usefully organize

CAMDA 2007 Dataset = E-TABM-185

• ~ 6000 arrays of diseased and normal human samples and cell lines all on Affymetrix HG-U133A collected from ArrayExpress and GEO.

• Available at ArrayExpress as E-TABM-185– http://www.ebi.ac.uk/microarray-as/aer/?#ae-browse/q=E-

TABM-185[2]

• Real use case for identifying quality issues (R. Irizarry) that required appropriate groups to distinguish biological from technical factors



Provided by ArrayExpress

Partial view of Annotations from E-TABM-185.sdrf

•Snippet from MAGE-TAB (courtesy Helen Parkinson, EBI)•Ten distinct annotations to choose from•Drawn from multiple studies so not annotated by the same person•Many missing values

What gains would organizing E-TABM-185 provide?

• Compare studies at the assay level (individual samples). How should we define dissimilarity measures between assays?

• Improve power. Group related assays based on all relevant annotations - not just on one or two.

• Make relevant comparisons. The way a sample is processed can affect expression as much as what tissue it came from so grouping on one or two annotations can add variability if chosen poorly.

• Interpret clusters. Just as overenrichment of GO terms can help interpret clusters of genes, overenrichment of specific annotations may help interpret biclusters

Dissimilarity between Assays• First need to select which annotation fields are of interest• Typically these are all “context” fields as “intent” refers to

an experiment as a whole• Original approach applied to assays:

– Choose annotation fields of interest: : e.g., organism part, disease, etc.

– Pull them together into one annotation set – Compute dissimilarities based on the overlap of the annotation

sets (Kulczynski or Jaccard).

Issues with original approachExample: OrganismPart and DiseaseState:• Suppose A1 and A2 have the following annotations:A1: nasal_epithelium, --

A2: nasal_epithelium, --

and A3 and A4 have the following annotations: A3: nasal_epithelium, pulmonary_disease_cystic_fibrosis

A4: nasal_epithelium, pulmonary_disease_cystic_fibrosis

• These 2 pairs have the same Jaccard and Kulczynski distances, equal to 0.

• Shouldn’t the 2nd pair be considered “closer” since we have more info indicating that?

Try again• Penalize missing values

– In principle we might want to penalize differently missing values due to incomplete, as opposed to not applicable, annotation…

• Group annotational fields when appropriate. For our use case of comparing Affy probes want to see where they differ. This will be due to cross-hybridization and degradation that are dependent on:– CellType, OrganismPart, CellLine– DiseaseState, DiseaseStage

• Weigh groups and annotational fields within a group

Dissimilarity revised

• Base it on Hamming distance idea– Number of annotation fields where the annotations

differ

• Layer it by groups and add weights– Provide a configuration file, e.g.

3 {3:CellType | 2:OrganismPart | 1:CellLine}

1 {3:DiseaseState | 1:BioSourceType | 2:DiseaseStage}

Weights wiSubweights sj

Given assays:A=(a1,1, a1,2, …, a1,n1; a2,1,a2,2, …, a2,n2; …; am,1, am,2, …, am,nm)

B=(b1,1, b1,2, …, b1,n1; b2,1,b2,2, …, b2,n2; …; bm,1, bm,2, …, bm,nm)

weights:w1, w2, …,wm

and subweights:(s1,1, s1,2, …, s1,n1; s2,1,s2,2, …, s2,n2; …; sm,1, sm,2, …, sm,nm)

Define:diss(A,B) = where

Ii,j is 1 if either one of ai,j or bi,j is missing or ai,j≠bi,j,

and Ii,j is 0 otherwise;

W is the sum of the wi’s;

Si is the sum of the si,j’s.

annotation group

ji

ni

jji

m

i i

i IsS

w

W ,1

,1

1 ∑∑==

Assay Dissimilarity• Example with:

– A={-|nasal_epithelium|-}{-|-|-}– B={-|nasal_epithelium|-}{pulmonary_disease_cystic_fibrosis |-|-},

• Then for: w1=3 {s1,1=3:CellType|s1,2=2:OrganismPart |s1,3=1:CellLine}

w2=1 {s2,1=3:DiseaseState|s2,2=1:BioSourceType| s2,3=2:DiseaseStage}

we have

diss(A,B)= 4

31

6

11

6

21

6

3

4

11

6

10

6

21

6

3

4

3=⎟

⎠

⎞⎜⎝

⎛ ⋅+⋅+⋅+⎟⎠

⎞⎜⎝

⎛ ⋅+⋅+⋅

Self-dissimilarity• Note that, in the presence of missing values, the

dissimilarity of an assay to itself will be non-zero with this definition

• E.g. with w1=3 {s1,1=3:CellType|s1,2=2:OrganismPart |s1,3=1:CellLine}

w2=1 {s2,1=3:DiseaseState|s2,2=1:BioSourceType| s2,3=2:DiseaseStage}

and A={-|nasal_epithelium|-}{-||-}, we have

diss(A,A)= 4

31

6

11

6

21

6

3

4

11

6

10

6

21

6

3

4

3=⎟

⎠

⎞⎜⎝

⎛ ⋅+⋅+⋅+⎟⎠

⎞⎜⎝

⎛ ⋅+⋅+⋅

Hierarchical Clustering of Assays

• Use clustering to evaluate utility of measures– Are we gaining anything?

• Clustered with the PHYLIP neighbor software and the UPGMA method (agglomerative, average linkage method)– Note that our starting point is NOT a gene-

expression dataset, rather a dissimilarity matrix which limits choice of tools

Clustering the E-TABM-185 Assays Based on Annotations

• We had 2 annotation files:– The original one– One with higher-level OrganismPart terms

(manually curated by Helen Parkinson @ EBI)

• For each, we built a dissimilarity matrix and then a tree

• Cut clusters and ran silhouettes to partition and evaluate

• For each, we generated clusterings with n varying from 100 to 600 in steps of 10

Forester ATV (http://www.phylosoft.org/atv/)

How did we do?• Try to use some ‘gold-standard’

{-|human_universal_reference|-}{-|frozen_sample|-}

– select the smallest n where these (24) assays constitute all the assays in a single cluster

• n=220 and n=140 respectively

• Use silhouette measure to pick the best– Original annotation: n=260, s=0.22– Manually curated annotation: n=150, s=0.21

• Conclusion: we were able to automatically partition the assays in a meaningful way but need to improve

Issues: synonyms

• The current dissimilarity considers as different terms which are synonyms, e.g. – “frontal cortex” and “frontal lobe”– “malignant neoplasm” and “cancer”

• Improvement ideas:– Map terms to a thesaurus, e.g. NCI

Metathesaurus (same spirit as Butte and Kohane with UMLS but do this in a directed and automated fashion)

Issues: term hierarchies• The current dissimilarity considers related terms

as different as unrelated terms, e.g.– “muscle” and “skeletal muscle”

• Improvement ideas:– Text mining: hard

• “left arm” is more different from “left lobe” than “muscle” is from “skeletal muscle”

– Ontologies are richer than controlled vocabularies: use the relationships encoded in them to define a distance between terms

– Incorporate ontology distance between terms into the dissimilarity measure (i.e. expand possible values for Ii,j)

Issues: missing annotation

• Lots of missing values for the fields• Improvement ideas:

– Minimize incomplete fields• Manual curation: not a scalable solution• Text mining: hard

– Optimize cfg file by picking and weighing annotation fields according to % of missing values

– Maybe also weigh according to entropy?

Computing with Annotations Future steps

• Classification and clustering of studies– Take advantage of having 75 studies

instead of 24 classified for EPConDB.

• Need address issues of synonyms, hierarchies, missing annotations. – Repeat analysis of E-TABM-185

• Generate software modules

Framework for Computing with Annotations



Summary

• Inconsistent and missing annotations are an obstacle to meta-analysis

• Standards are in place to guide and encourage appropriate annotations with terms drawn from ontologies

• Initial steps to automate usage of annotations for clustering studies and assays are promising ….

• But much more needs to be done (and suggestions are welcome!)

www.mgedmeeting.org

Keynote speakers: Naama Barkai, Weizmann Institute of ScienceEwan Birney, EMBL-EBIJoe Gray, UCSF/ Lawrence Berkeley Nat'l LaboratorySteve Oliver, Cambridge University

annotation-based meta-analysis of microarray experiments

Documents