1 miame the miame website: © 2002 norman morrison for manchester bioinformatics
TRANSCRIPT
1
MIAME
The MIAME website:
http://www.mged.org
© 2002 Norman Morrison for Manchester Bioinformatics.
2
Overview
• Why capture meta-data?• The data capture challenges
– What to capture?– How to capture it?– Who agrees what to capture?
3
Post-genome data
bioinformatics
genometranscriptome
proteome
interactome
metabolometextome
mobileome
phenome
4
Why meta-data?
• Genome data is static• Post-genome is very state-dependant
– Transcriptome = no. of cell types * no. no of environmental conditions– Annotation matters– Data comparisons matter– Learn from the gene debacle
• Protein-tyrosine phosphatase, non-receptor type 6, Protein-tyrosine phosphatase 1C, PTP-1C, Hematopoietic cell protein-tyrosine phosphatase, SH-PTP1, Protein-tyrosine phosphatase SHP-1
• LARD, death receptor 3 beta, WSL-1R protein, lymphocyte associated receptor of death, death receptor 3
• We need repositories
5
Microarray Repositories
• A repository is a primary source of data generated by experimentalists. Its main role is to enforce standards and quality thresholds and to make data widely available.
• Needs standards.
6
Microarray Repositories II
• Repositories allow for easier data exchange between groups
• Ensure that key details are kept
• BUT:
– What should be captured and how
• Requires international cooperation
– Minimal Information for the Annotation of Microarray experiments (MIAME)
– Developed within MGED
7
MIAME – Major Sections
• Array design– Reporters– Features– Control elements
• Experimental design– Experiment type– Sample details– Hybridisations– Measurements
8
The Six Parts of MIAME
1. Experimental design: the set of hybridization experiments as a whole
2. Array design: each array used and each element (spot, feature) on the array
3. Samples: samples used, extract preparation and labeling
4. Hybridizations: procedures and parameters
5. Measurements: images, quantification and specifications
6. Normalization controls: types, values and specifications
10
Value of audit
• Based on (qualifier, value, source)
• Qualifier: cell type• Value: epithelial• Source: Gray’s anatomy (38th ed.)
or• Qualifier: treatment• Value: 15heat shock• Source: Smith and Jones, Nature Genet. (1992)
11
MIAME definitions
• Available from www.mged.org• A minimum document to be read• All details mentioned in MIAME should be captured somewhere
– Know where they are
• Latest draft: Version 1.1 (Draft 5, March 5, 2002)– Discussed at MGED IV
• See also: A. Brazma, et al., Nature Genetics, vol 29 (December 2001), pp 365 - 371
12
MIAME part 1:– array description
• In principle this is someone else’s problem– (e.g. Affymetrix, Clonetech, etc.)
• Three levels of array design elements: – feature – the location on the array– reporter – the nucleotide sequence present in a particular location on the array– composite sequence – a set of reporters used collectively to measure an expression
of a particular gene, exon, or splice-variant
• Array design has 5 parts:1.1 Array related information1.2 Reporter information1.3 Feature information1.4 Composite sequences1.5 Control elements
13
MIAME part 2:Experimental design
• This is your problem• Experimental design has four parts
2.1 Experimental design
2.2 Sample
2.3 Hybridisation
2.4 Measurements
14
2.1 Experimental design
• Design and purpose of the set of hybridisations• Author, lab and contact• Experiment type• Experimental factors• Number of hybridisations• Common reference• QC steps• Experiment description (plus refs)• Anything else
15
2.2 Sample
• Biosource properties – organism, contact, cell type, sex,….• Biomaterial manipulation – growth conditions, in vivo treatment,
compound• Sample labelling – label used, amount, method• Spiked controls – feature, type• Anything else
16
2.3 Hybridisation
• Relationship between samples and arrays• Protocol – full description • Anything else
17
2.4 Measurement
• Raw data – scanner files, scanning protocol• Scanning protocol – parameter settings• Analysis and quantification – analysis output, protocol – e.g.
algorithms• Normalisation – strategies and algorithms, final gene expression
table• Anything else
18
MAGE, ontologies and maxd
The MAGE website:
http://www.mged.org
© 2002 Norman Morrison for Manchester Bioinformatics.
19
Outline
• MIAME is useful, but …..– How can we represent it computationally?– How can we use it to share and exchange data?– The wonderful world of XML– The evil that is free text
• ontologies and controlled vocabularies
• Maxd – MIAME supportive, MAGE-ML compliant analysis of microarray data.
MAGE-ML, MAGE-OM
• MIAME sets a standard for what knowledge (meta-data) to capture
• But how to do it?• Need a knowledge model – a schema to represent the
knowledge and the relationships between them.
Knowledge capture
• UML – Universal Modelling Language provides a methodology for capturing knowledge in ways that are computationally tractable (cf database schemas)
• MAGE-OM is the MGED approved UML model which attempts to capture the concepts in MIAME
XML
• A UML diagram is not useful by itself• MAGE-ML is an attempt to capture MAGE-OM in XML
(eXtended Markup Language) – the next generation HTML• MAGE-ML provides a structure for a text document (marked up
with tags) which describes a microarray experiment
MAGE-ML
• MAGE-ML is not nice!– Complex– Not easily human readable– Needs software tools to help create it– Very rich
• MAGE-ML is the standard we have to work to.
ArrayExpress
• ArrayExpress is the new public microarray data repository based at the EBI
• Provides tools to help create MAGE-ML• Experiments will not be entered unless the annotation is of a
high quality
Making MAGE useable
• For a repository we need a relational database – not an object model
• We have created a relational implementation of the MAGE-OM which is MIAME compliant (based on an early UML diagram for arrayexpress) - maxdSQL
Outstanding issues – free text
• MAGE provides a structure for the knowledge – not a prescription for what gets put in
• How to control what people put in the free text areas of MIAME/MAGE (the mickey mouse / . problem)
• How do we define what is meant in ways that other people/software understand
Solution 1
• Controlled vocabularies– Agreed lists of terms (and definitions) that a community agree to use
• Pros: technical simple, easy to implement• Cons: limiting, how to get agreement?, terms on there own are
not very descriptive
Solution 2
• Ontologies– Can be thought of as a set of agreed terms and the relationships between
them (a taxonomy is a simple ontology in which the only relationship allowed is an is-a relationship)
• Pros: a very rich and powerful infrastructure• Cons: complex• Many developments – a space to watch
– Chris Stoeckert and Helen Parkinson http://www.cbil.upenn.edu/Ontology/MGED_ontology.html