annotation-based meta-analysis of microarray experiments
DESCRIPTION
Annotation-based meta-analysis of microarray experiments. Chris Stoeckert Yale Biostatistics Seminar Series Feb. 26, 2008. *. Data Integration at CBIL. http://www.cbil.upenn.edu. Databases. *. Knowledge Representation. *. Database schemas (GUS) and Data standards (MGED, OBI). - PowerPoint PPT PresentationTRANSCRIPT
Annotation-based meta-analysis of microarray experiments
Chris Stoeckert
Yale Biostatistics Seminar Series
Feb. 26, 2008
QuickTime™ and a decompressor
are needed to see this picture.
QuickTime™ and a decompressor
are needed to see this picture.
Data Integration at CBIL
http://www.cbil.upenn.edu
*
Databases
*
Knowledge Representation
*QuickTime™ and a
TIFF (Uncompressed) decompressorare needed to see this picture.
Database schemas (GUS) and Data standards (MGED, OBI)
Data Modeling
*
Integrative tools for ortholog identification, expression analysis, chromosomal aberrations, TF regulatory networks, protein interaction networks
***
*
Annotation-based meta-analysis of microarray experiments
• Meta-analysis – Examples illustrating information gained and
problems caused by incomplete annotations
• Standards for annotating experiments– Standards from the MGED Society and multi-
community standards (e.g., OBI).
• Computing with Annotations– Dissimilarity measures to quantitatively compare
experiments and assays based on annotations– Sample applications using dissimilarity measures
The Problem• Analysis of microarray datasets has led to new
challenges in statistics (many genes, few samples).• Focus of the analysis has been on the genes
– Look for correlations, differences in expression– Look for greater than expected associations in types of genes
• What can be learned from an analysis of sample characteristics and experimental parameters?– If experiments were better annotated, what would we be able
to do? What are the benefits of better annotation?– What statistical measures and tests can be applied for this
purpose?
Meta-analysis of Microarray Datasets• Meta-analyses have been performed using microarray data
from different experiments studying similar conditions to identify genes with significant signatures in those conditions.
• Generally, these analyses look for robust signals that overcome experiment-specific biases in sample types, collection, and treatment and rely on the fact that with enough experiments these effects will wash out in the noise.
• Detailed information about both the biological intent and context of a study is crucial for meaningful selection of experiments to be input into a meta- analysis. Meta-analysis is complicated by differences in experimental technologies, data post-processing, database formats, and inconsistent gene and sample annotations.
Meta-analysis example: “Creation and implications of a phenome-genome network”
Butte and Kohane. Nat Biotech. 2006
Meta-analysis example: “Creation and implications of a phenome-genome network”
Butte and Kohane. Nat Biotech. 2006
• Clustered experiments based on mapping concepts found in sample annotations to UMLS meta-thesaurus.
• Relationships found between phenotype (e.g., aging), disease (e.g., leukemia), environmental (e.g., injury) and experimental (e.g., muscle cells) factors and genes with differential expression.
• “the ease and accuracy of automating inferences across data are crucially dependent on the accuracy and consistency of the human annotation process, which will only happen when every investigator has a better prospective understanding of the long- term value of the time invested in improving annotations.”
Another Example: CAMDA 2007 Dataset• ~ 6000 arrays of diseased and normal
human samples and cell lines collected from ArrayExpress and GEO.
• Meta-analysis on large scale open to many but …
• Issues relating to annotation remain. On what basis do you compare:– “nasal_epithelium”– “nasal_epithelium,
pulmonary_disease_cystic_fibrosis”
QuickTime™ and a decompressor
are needed to see this picture.
Provided by ArrayExpress
Annotation-based meta-analysis of microarray experiments
• Meta-analysis – Examples illustrating information gained and
problems caused by incomplete annotations
• Standards for annotating experiments– Standards from the MGED Society and multi-
community standards (e.g., OBI).
• Computing with Annotations– Dissimilarity measures to quantitatively compare
experiments and assays based on annotations– Sample applications using dissimilarity measures
What is MGED?
• The Microarray and Gene Expression Data Society.
• A grass roots organization started in 1999 to develop standards for the sharing and storing of microarray data
• A society with participants from academia, industry, government, and journals
• A series of meetings to showcase cutting edge work and promote standards.
The MGED Society Mission
The Microarray and Gene Expression Data (MGED) Society is an international organization of biologists, computer scientists, and data analysts that aims to facilitate the sharing of data generated using the microarray and other functional genomics technologies for a variety of applications including expression profiling. The scope of MGED includes data generated using any technology when applied to genome-scale studies of gene expression, binding, modification and other related applications.The focus is on establishing standards for data quality, management, annotation and exchange; facilitating the creation of tools that leverage these standards; working with other standards organizations and promoting the sharing of high quality, well annotated data within the life sciences and biomedical communities.
http://www.mged.org
MGED Standards
• What information is needed for a microarray experiment?– MIAME: Minimal Information About a Microarray
Experiment
• How do you “code up” microarray data?– MAGE-OM: MicroArray Gene Expression Object
Model
• What words do you use to describe a microarray experiment?– MO: MGED Ontology
MIAME: Minimal Information About a Microarray Experiment
• Brazma et al. Nature Genetics. 2001• Version 2.0 proposal available
– The raw data for each hybridisation– The final processed data for the set of hybridisations in the
experiment (study)– The essential sample annotation, including experimental
factors and their values– The experiment design including sample data relationships– Sufficient annotation of the array design– Essential experimental and data processing protocols
hybridisationlabelled
nucleic acidarray
RNA extract
Sample
Array design
hybridisationlabelled
nucleic acidarray
RNA extract
Sample
hybridisationlabelled
nucleic acidarray
RNA extract
Sample
hybridisationlabelled
nucleic acidarray
RNA extract
Sample
hybridisationlabelled
nucleic acidMicroarray
RNA extract
SampleExperiment
Gene expression data matrix
normalization
integration
ProtocolProtocolProtocolProtocolProtocolProtocol
genes
MIAME in a nutshell
Stoeckert et al. Drug Discovery Today TARGETS 2004
MAGE-OM: MicroArray Gene Expression Object Model
• MAGE-ML– XML version of MAGE-OM– Spellman et al. Genome Biology 2002– Version 1.1– V2.0 will be part of FuGE: Functional Genomics standard with participation
from HUPO (Human Proteome Organization), the Metabolomics Society, and other communities.
• Jones et al. Nature Biotech. 2007
• MAGE-TAB– Tab-delimited – Rayner et al. BMC Bioinformatics 2006– Investigation Description Format (IDF)– Sample and Data Relationship Format (SDRF)– Array Design Format (ADF)
MGED Ontology
• Whetzel et al. Bioinformatics 2006• Now in version 1.3.1• Version 2 will be derived from OBI (Ontology for
Biomedical Investigations). – Like FuGE, OBI is a standard resource being built by
multiple communities.
Ecosystem of Biomedical Standards
Many Communities: MGED PSI MSI OBO BIRN CaBIG …
Many CommunityStandards: MIAME MIAPE CIMR GO
MAGE-ML GelML spML ChEBIMAGE-TAB mzDataXML PATO
MGED Ontology PSI-MI sepCV NMR Ontology
Integrative Standards: FuGE MIBBI OBI
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
OBI – Ontology for Biomedical Investigations
Diverse background• Omics standardization effort people (MGED, PSI, MSI)• People ‘running’ (public) repositories, primary + secondary databases
- Software engineers, modellers, biologists, data-miners• People from the semantic web technology• Vendors and manufacturers (new)
Different maturity stages• Some needs to ‘rebuild’, e.g. MGED Ontology (microarry)• Some are starting now, e.g. MSI (metabolomics), EnvO (environment)
Plurality of (prospective) usage• Driving data entry and annotation
- Indexing of experimental data, minimal information lists, x-db queries
• Applying it to text-mining- Benchmarking, enrichment, annotation
• Encoding facts from literature- Building knowledge bases relying on RDF triple stores
http://obi.sf.net
OBI – Communities and Structure
1. Coordination Committee (CC): Representatives of the communities -> Monthly conferences
2. Developers WG: CC and other communities’ members
Weekly conferences calls
3. Advisors:
-> National Center for Biomedical Ontologies
http://obi.sf.net
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Sending terms to other
OBI branches or external resources,
e.g.
OBI – Main Activities and Timelines
Continue branch activities (iterative process)• Branches editors working on submitted terms
- Normalize terms, add metadata tags (e.g. definition and source)- Bin terms into the relevant top level classes and identify relationships- Sort terms by relevance to one or other branch, or to other ontologies
First evaluation of OBI draft + Release OBI 0.1 (Feb 08)Review branches and merge with the trunk into a coreApply use cases and competency questions
Evaluate how the ontology perform, also what is missing, what is wrong
5th and 6th face-2-face meeting for Coordinators and Developers
•BBCCRC, Vancouver, Canada, (Jan/Feb 08), self funded + MGED sponsor
•EBI, Cambridge, UK (Summer 08), BBSRC funds
http://obi.sf.netQuickTime™ and a
TIFF (Uncompressed) decompressorare needed to see this picture.
Annotation-based meta-analysis of microarray experiments
• Meta-analysis – Examples illustrating information gained and
problems caused by incomplete annotations
• Standards for annotating experiments– Standards from the MGED Society and multi-
community standards (e.g., OBI).
• Computing with Annotations– Dissimilarity measures to quantitatively compare
experiments and assays based on annotations– Sample applications using dissimilarity measures
Elisabetta Manduchi, Junmin Liu
Potential General Applications
• Identification of experiments/assays of interest from large DBs
• Organization of web resources• QC (annotation and ontology development)• Meta-analysis assessments• Guidance in data QC and pre-processing
Computing with Annotations
• Ideal situation: complete, accurate, consistent annotations
• Example of Application (from CAMDA/ AE use case):– Assess assay quality using NUSE (normalized unscaled standard
errors)– Need to work with appropriate groups of assays
Annotated AssaysCompute Dissimilarities
ClusterApplications
• Actual situation: – incomplete annotation (missing values)– heterogeneous granularity– variation in ontologies used for a given
annotation field
• Flow:
Annotated AssaysCompute Dissimilarities
ClusterQC
Refine Annotation
Applications
Test Case: EPConDB• 24 published EPConDB experiments
manually classified into 5 classes:• Preliminary study computing with
experiment annotation– intent– context
• Explored dissimilarity measures between experiments– Identified 5 annotation
components, which could be weighted as desired
– For each component, defined component-wise dissimilarity
– Took weighted average of component-wise dissimilarities
Pancreatic Growth after Partial Pancreatectomy and Exendin-4 Treatment (De Leon et al., 2006).
• Manually classified under “Pancreas development and growth” • Experiment Design Types: MO terms providing a high level description for the experiment. For this
experiment, these are:– MethodologicalDesign.time_series_design,– PerturbationalDesign.compound_treatment_design,– PerturbationalDesign.stimulus_or_stress_design
• Experimental Factor Types: MO terms describing the type of factors under test in the experiment. In our example these are:
– ComplexAction.compound_based_treatment (samples were treated with either Extendin-4 or Vehicle, or nothing),
– ComplexAction.specified_biomaterial_action (some samples had a pancreatectomy, others did not, others had a sham operation),
– ComplexAction.timepoint (samples treated with a compound were treated for different amounts of hours).
• Organisms: the organisms to which the biosources used in the various assays belong. In our example this was just “Mus musculus”.
• Other Biomaterial Characteristics of these biosources. In our example: – Age.birth (initial time point for computing the age), Age.weeks (unit of measure for age);– DevelopmentalStage.Theiler Stage 28, – OrganismPart.pancreas, – Sex.male, – StrainOrLine.BALB/c.
• Treatment Types: the types of treatments applied to the original biosources which led to the final labeled extracts hybridized to the array. In our example, these were:
– ComplexAction.specified_biomaterial_action, – ComplexAction.compound_based_treatment, – ComplexAction.split, ComplexAction.nucleic_acid_extraction, – ComplexAction.labeling.
Component-wise experiment dissimilarity
• For each of the 5 annotation components and for each pair of experiments – We have two sets of terms, one per experiment: say A and B– Define the component-wise dissimilarity between these two
experiments using the Jaccard or the Kulczynski distance
– Choose one or the other according to how you want to weigh containments
• With these distances, iteratively (leave one out) classified each experiment based on smallest distance to other experiments.
Jaccard Kulczynski
⎟⎟⎠
⎞⎜⎜⎝
⎛−
||
||1
BA
BA
UI
⎟⎟⎠
⎞⎜⎜⎝
⎛+−
||
||
||
||
2
11
B
BA
A
BA II
Automated vs. Manual Classification of EPConDB Studies
Distance Weights % Correct %Ambiguous % Incorrect
either 1,1,1,1,1 62.5 0 37.5
either 0,1,1,1,1 45.83 0 54.17
either 1,0,1,1,1 50 0 50
Kulczynski 1,1,0,1,1 75 0 25
Jaccard 1,1,0,1,1 66.67 0 33.33
either 1,1,1,0,1 62.5 0 37.5
either 1,1,1,1,0 58.33 0 41.67
Kulczynski 1,1,0,0,1 75 0 25
Jaccard 1,1,0,0,1 70.83 0 29.17
• 5 Annotations used– Experiment design– Experimental factors– Organism– Other Biomaterial Characteristics– Treatment
• Manual Classifications– Pancreas development and growth– Differentiation of insulin-producing cells– Islet/ beta-cell stimulation/injury– Tissue surveys– Targets and roles of transcriptional regulators
• Result: achieved ~ 75% correct classification using all but organism. Note this reflects manual classification based on intent - other classifications might have different optimal weights. Need to retry with more (~75 now)
Test Case: RAD
• Clustering of 62 public experiments from RAD– No predefined
classifications• Tried PAM and k
from 5 to 20• Use silhouettes to
determine optimal clusters
– Manual assessment• QC value
Silhouettes
• Rousseeuw P.J. (1987), J. Comput. Appl. Math., 20, 53–65
• For each study i,:– a(i) = average dissimilarity between i and all other study of the
cluster to which i belongs – d(i,C) = average dissimilarity of i to all assays in cluster C. – b(i) = minC d(i,C): dissimilarity between i and its “neighbor”
cluster– s(i) = ( b(i) - a(i) ) / max( a(i), b(i) )– If i is in a singleton cluster, then s(i)=0.
• large s(i) (almost 1): very well clustered• small s(i) (around 0): the experiment lies between two
clusters• negative s(i): probably placed in the wrong cluster.
Unsupervised classification of microarray experiments in RAD
QuickTime™ and a decompressor
are needed to see this picture.
• Best silhouette ave. value was 0.36 with Kulczynski with weights 1,1,0,0,1 and PAM with k=8 or weights 1,1,1,0,1 and PAM with k=14
• Singleton and odd clusters revealed misannotated studies (QC)
• Not optimized but gives us a sense for the whether there is sufficient signal in the annotations (at least in our database) to usefully organize
CAMDA 2007 Dataset = E-TABM-185
• ~ 6000 arrays of diseased and normal human samples and cell lines all on Affymetrix HG-U133A collected from ArrayExpress and GEO.
• Available at ArrayExpress as E-TABM-185– http://www.ebi.ac.uk/microarray-as/aer/?#ae-browse/q=E-
TABM-185[2]
• Real use case for identifying quality issues (R. Irizarry) that required appropriate groups to distinguish biological from technical factors
QuickTime™ and a decompressor
are needed to see this picture.
Provided by ArrayExpress
Partial view of Annotations from E-TABM-185.sdrf
•Snippet from MAGE-TAB (courtesy Helen Parkinson, EBI)•Ten distinct annotations to choose from•Drawn from multiple studies so not annotated by the same person•Many missing values
What gains would organizing E-TABM-185 provide?
• Compare studies at the assay level (individual samples). How should we define dissimilarity measures between assays?
• Improve power. Group related assays based on all relevant annotations - not just on one or two.
• Make relevant comparisons. The way a sample is processed can affect expression as much as what tissue it came from so grouping on one or two annotations can add variability if chosen poorly.
• Interpret clusters. Just as overenrichment of GO terms can help interpret clusters of genes, overenrichment of specific annotations may help interpret biclusters
Dissimilarity between Assays• First need to select which annotation fields are of interest• Typically these are all “context” fields as “intent” refers to
an experiment as a whole• Original approach applied to assays:
– Choose annotation fields of interest: : e.g., organism part, disease, etc.
– Pull them together into one annotation set – Compute dissimilarities based on the overlap of the annotation
sets (Kulczynski or Jaccard).
Issues with original approachExample: OrganismPart and DiseaseState:• Suppose A1 and A2 have the following annotations:A1: nasal_epithelium, --
A2: nasal_epithelium, --
and A3 and A4 have the following annotations: A3: nasal_epithelium, pulmonary_disease_cystic_fibrosis
A4: nasal_epithelium, pulmonary_disease_cystic_fibrosis
• These 2 pairs have the same Jaccard and Kulczynski distances, equal to 0.
• Shouldn’t the 2nd pair be considered “closer” since we have more info indicating that?
Try again• Penalize missing values
– In principle we might want to penalize differently missing values due to incomplete, as opposed to not applicable, annotation…
• Group annotational fields when appropriate. For our use case of comparing Affy probes want to see where they differ. This will be due to cross-hybridization and degradation that are dependent on:– CellType, OrganismPart, CellLine– DiseaseState, DiseaseStage
• Weigh groups and annotational fields within a group
Dissimilarity revised
• Base it on Hamming distance idea– Number of annotation fields where the annotations
differ
• Layer it by groups and add weights– Provide a configuration file, e.g.
3 {3:CellType | 2:OrganismPart | 1:CellLine}
1 {3:DiseaseState | 1:BioSourceType | 2:DiseaseStage}
Weights wiSubweights sj
Given assays:A=(a1,1, a1,2, …, a1,n1; a2,1,a2,2, …, a2,n2; …; am,1, am,2, …, am,nm)
B=(b1,1, b1,2, …, b1,n1; b2,1,b2,2, …, b2,n2; …; bm,1, bm,2, …, bm,nm)
weights:w1, w2, …,wm
and subweights:(s1,1, s1,2, …, s1,n1; s2,1,s2,2, …, s2,n2; …; sm,1, sm,2, …, sm,nm)
Define:diss(A,B) = where
Ii,j is 1 if either one of ai,j or bi,j is missing or ai,j≠bi,j,
and Ii,j is 0 otherwise;
W is the sum of the wi’s;
Si is the sum of the si,j’s.
annotation group
ji
ni
jji
m
i i
i IsS
w
W ,1
,1
1 ∑∑==
Assay Dissimilarity• Example with:
– A={-|nasal_epithelium|-}{-|-|-}– B={-|nasal_epithelium|-}{pulmonary_disease_cystic_fibrosis |-|-},
• Then for: w1=3 {s1,1=3:CellType|s1,2=2:OrganismPart |s1,3=1:CellLine}
w2=1 {s2,1=3:DiseaseState|s2,2=1:BioSourceType| s2,3=2:DiseaseStage}
we have
diss(A,B)= 4
31
6
11
6
21
6
3
4
11
6
10
6
21
6
3
4
3=⎟
⎠
⎞⎜⎝
⎛ ⋅+⋅+⋅+⎟⎠
⎞⎜⎝
⎛ ⋅+⋅+⋅
Self-dissimilarity• Note that, in the presence of missing values, the
dissimilarity of an assay to itself will be non-zero with this definition
• E.g. with w1=3 {s1,1=3:CellType|s1,2=2:OrganismPart |s1,3=1:CellLine}
w2=1 {s2,1=3:DiseaseState|s2,2=1:BioSourceType| s2,3=2:DiseaseStage}
and A={-|nasal_epithelium|-}{-||-}, we have
diss(A,A)= 4
31
6
11
6
21
6
3
4
11
6
10
6
21
6
3
4
3=⎟
⎠
⎞⎜⎝
⎛ ⋅+⋅+⋅+⎟⎠
⎞⎜⎝
⎛ ⋅+⋅+⋅
Hierarchical Clustering of Assays
• Use clustering to evaluate utility of measures– Are we gaining anything?
• Clustered with the PHYLIP neighbor software and the UPGMA method (agglomerative, average linkage method)– Note that our starting point is NOT a gene-
expression dataset, rather a dissimilarity matrix which limits choice of tools
Clustering the E-TABM-185 Assays Based on Annotations
• We had 2 annotation files:– The original one– One with higher-level OrganismPart terms
(manually curated by Helen Parkinson @ EBI)
• For each, we built a dissimilarity matrix and then a tree
• Cut clusters and ran silhouettes to partition and evaluate
• For each, we generated clusterings with n varying from 100 to 600 in steps of 10
Forester ATV (http://www.phylosoft.org/atv/)
How did we do?• Try to use some ‘gold-standard’
{-|human_universal_reference|-}{-|frozen_sample|-}
– select the smallest n where these (24) assays constitute all the assays in a single cluster
• n=220 and n=140 respectively
• Use silhouette measure to pick the best– Original annotation: n=260, s=0.22– Manually curated annotation: n=150, s=0.21
• Conclusion: we were able to automatically partition the assays in a meaningful way but need to improve
Issues: synonyms
• The current dissimilarity considers as different terms which are synonyms, e.g. – “frontal cortex” and “frontal lobe”– “malignant neoplasm” and “cancer”
• Improvement ideas:– Map terms to a thesaurus, e.g. NCI
Metathesaurus (same spirit as Butte and Kohane with UMLS but do this in a directed and automated fashion)
Issues: term hierarchies• The current dissimilarity considers related terms
as different as unrelated terms, e.g.– “muscle” and “skeletal muscle”
• Improvement ideas:– Text mining: hard
• “left arm” is more different from “left lobe” than “muscle” is from “skeletal muscle”
– Ontologies are richer than controlled vocabularies: use the relationships encoded in them to define a distance between terms
– Incorporate ontology distance between terms into the dissimilarity measure (i.e. expand possible values for Ii,j)
Issues: missing annotation
• Lots of missing values for the fields• Improvement ideas:
– Minimize incomplete fields• Manual curation: not a scalable solution• Text mining: hard
– Optimize cfg file by picking and weighing annotation fields according to % of missing values
– Maybe also weigh according to entropy?
Computing with Annotations Future steps
• Classification and clustering of studies– Take advantage of having 75 studies
instead of 24 classified for EPConDB.
• Need address issues of synonyms, hierarchies, missing annotations. – Repeat analysis of E-TABM-185
• Generate software modules
Framework for Computing with Annotations
QuickTime™ and a decompressor
are needed to see this picture.
Summary
• Inconsistent and missing annotations are an obstacle to meta-analysis
• Standards are in place to guide and encourage appropriate annotations with terms drawn from ontologies
• Initial steps to automate usage of annotations for clustering studies and assays are promising ….
• But much more needs to be done (and suggestions are welcome!)
www.mgedmeeting.org
Keynote speakers: Naama Barkai, Weizmann Institute of ScienceEwan Birney, EMBL-EBIJoe Gray, UCSF/ Lawrence Berkeley Nat'l LaboratorySteve Oliver, Cambridge University