gsc-brc metadata standards
Post on 23-Feb-2016
86 Views
Preview:
DESCRIPTION
TRANSCRIPT
GSC-BRC Metadata Standards
Richard H. ScheuermannU.T. Southwestern Medical Center
Metadata Inconsistencies
• Each project was providing different types of metadata
• No consistent nomenclature being used• Impossible to perform reliable comparative
genomics analysis
Dengue Clinical Metadata
Virus Isolate Information
Complex Query Interface
Additional Clinical Characteristics
GSC-BRC Metadata Standards Working Group
• NIAID assembled a group of representatives from their three Genome Sequencing Centers for Infectious Diseases (Broad, JCVI, UMD) and five Bioinformatics Resource Centers (EuPathDB, IRD, PATRIC, VectorBase, ViPR) programs
• Develop metadata standards for pathogen isolate sequencing projects
Metadata Standards Process• Divide into pathogen subgroups – viruses, bacteria, eukaryotic pathogens and vectors• Collect example metadata sets from sequencing project white papers and other project
sources (e.g. CEIRS)• Identify data fields that appear to be common across projects within a pathogen subgroup
(core) and data fields that appear to be project specific• For each data field, provide definitions, synonyms, allowed value sets preferably using
controlled vocabularies, expected syntax, examples, data categories and data providers• Merge subgroup core elements into a common set of core metadata fields and attributes• Assemble metadata fields into a semantic network• Harmonize semantic network with the Ontology of Biomedical Investigation (OBI)• Compare, harmonize, map to other relevant initiatives, including MIGS, MIMS,
BioProjects, BioSamples• Develop data submission spreadsheets to be used for all white paper and BRC-associated
projects
GSC-BRC Metadata Working Groups
Example Metadata
Virus Core Metadata Sheet
Metadata Merge
data transformations –image processing
assemblysequencing assay
specimen source – organism or environmental
specimencollector
input sample
reagents
technician
equipment
type ID qualities
temporal-spatialregion
data transformations –variant detection
serotype marker detect.gene detection
primarydata
sequencedata
genotype/serotype/gene data
specimen
microorganism
enrichedNA sample
microorganismgenomic NA
specimen isolationprocess
isolationprotocol
sample processing
data archivingprocess
sequencedata record
has_input
has_output
has_output
has_specification has_part has_part
is_about
has_input
has_output
has_input
has_input
has_input
has_output
has_output
has_output
is_about
GenBankID
denotes
located_in
denotes
- independent continuant- dependent continuant- occurrent- temporal-spatial region
ital - relations
has_input
has_qualityinstance_of
temporal-spatialregion
located_in
Network Overview
data transformations –image processing
assemblysequencing assay
specimen source – organism or environmental
specimencollector
input sample
reagents
technician
equipment
type ID qualities
temporal-spatialregion
data transformations –variant detection
serotype marker detect.gene detection
primarydata
sequencedata
genotype/serotype/gene data
specimen
microorganism
enrichedNA sample
microorganismgenomic NA
specimen isolationprocess
isolationprotocol
sample processing
data archivingprocess
sequencedata record
has_input
has_output
has_output
has_specification has_part has_part
is_about
has_input
has_output
has_input
has_input
has_input
has_output
has_output
has_output
is_about
GenBankID
denotes
located_in
denotes
has_input
has_qualityinstance_of
temporal-spatialregion
located_in
Specimen Isolation
Material Processing
Data ProcessingSequencing Assay
Investigation
Metadata Categories
• Investigation• Host/Source Characterization• Specimen Isolation• Pathogen Detection• Pathogen Isolation• Pathogen Characterization• Specimen Processing• Sample Shipment• Sequencing Sample Preparation• Sequencing Assay• Data Transformation
organism
environmentalmaterial
specimensource role
species/strain
organismID
age, gender,symptom
specimen isolationprocedure X
has_input
plays
commonname
denotes
denotes
has_qualityinstance_of
v10
v12
v11
v13
Host/Source Characterization
temporal-spatialregion
spatialregion
temporalinterval
GPSlocation
date/time
has_partdenotes
spatialregion
geographiclocation
denoteslocated_in
located_in
vX – row X in virus sheet- independent continuant- dependent continuant- occurrent- temporal-spatial region
ital - relations
b14 b15b16 b17
b19 b20
organism
environmentalmaterial
equipment
person
specimensource role
specimencapture role
specimencollector role
temporal-spatialregion
spatialregion
temporalinterval
GPSlocation
date/time
specimen Xspecimen isolationprocedure X
isolationprotocol
has_input
has_output
plays
plays
has_specification
has_partdenotes
located_in
name
denotes
spatialregion
geographiclocation
denoteslocated_in
affiliation
has_affiliation
ID
v2
v5-6
v3-4
v7v8
v15
v16
denotes
specimen typeinsta
nce_of
specimen isolationprocedure type
instance_of
Specimen Isolation
plays
has_input
Comments
????
v9
organism parthypothesis v17
is_about
IRB/IACUCapproval
has_authorization
v19v18
b18
b22environmenthas_quality
b23
b24
b28 b29
b25 b26 b27
b30
temporal-spatialregion
spatialregion
temporalinterval
GPSlocation
date/time
specimen X
microorganism X
has_part
has_part
located_in
spatialregion
geographiclocation
species/strain
instance_of
IDv15
v16
v27
Pathogen Detection
pathogen detectionprocess X
has_input
has_specification
data aboutpathogen presence
specimentype
amount
denotes
instance_of
has_quality
located_in
pathogen detectionmethod
instance_of
denotes denotes denotes
pathogen detectionprotocol
has_output
v28
is_about
b21
specimen X
microorganism X
has_part
species/strain
instance_of
IDv15
v16
Pathogen Isolation
specimentype
amount
denotes
instance_of
has_quality
v34
temporal-spatialregion
spatialregion
temporalinterval
GPSlocation
date/time
has_part
located_in
spatialregion
geographiclocation
pathogen isolationprocess X
located_in
pathogen isolationmethod
denotes denotes denotes
pathogen isolationprotocol
has_input
instance_
of
has_specific
ation
pathogenisolate X
IDpathogen
typeamount
denotes
instance_ofhas_quality
has_output
v26
specimen X
microorganism X
has_part
species/strain
instance_of
IDv15
v16
v27
PathogenCharacterization
specimentype
amount
denotes
instance_of
has_quality
v34
temporal-spatialregion
spatialregion
temporalinterval
GPSlocation
date/time
has_part
located_in
spatialregion
geographiclocation
pathogen isolationprocess X
located_in
pathogen isolationmethod
denotes denotes denotes
pathogen isolationprotocol
has_input
instance_
of
has_specific
ation
pathogenisolate X
IDpathogen
typeamount
denotes
instance_ofhas_quality
has_outputb2
b3
b4
biological characteristicassay X
antigenic characteristicassay X
pathologic characteristicassay X
genetic characteristicassay X
chromosome/plasmidassay X
biovarcharacteristic
serovarcharacteristic
pathovarcharacteristic
genotypecharacteristic
chromosome/plasmidcharacteristic
antibiotic sensitivityassay X
antibody sensitivitycharacteristic
has_input is_about
genus/species/straindetermination assay X
genus/species/straincharacteristic
b5
b6
b7
b8
b11
b13
b10
b9
b12
has_outputv27
v29v30
v31v32
temporal-spatialregion
spatialregion
temporalinterval
GPSlocation
date/time
specimen X
microorganism X
sampleset X
sample setassembly process X
sample setassembly protocol
has_outputhas_part
has_specification
has_part
located_in
spatialregion
geographiclocation
species/strain
instance_of
ID
v15
v16
v27
SpecimenProcessing
aliquotingprocess X
aliquotingprotocol
has_input
has_output
has_specification
specimen Xaliquot Y
specimentypeamount
denotes
instance_ofhas_quality
IDspecimen
typeamount
denotes
instance_ofhas_quality
IDspecimen
typeamount
denotes
instance_ofhas_quality
located_in located_in
sample setassembly process
aliquotingprocess
instance_of instance_of
denotes denotes denotes
temporal-spatialregion
spatialregion
temporalinterval
GPSlocation
date/time
has_part
located_in
spatialregion
geographiclocation
denotes denotes denotes
specimen Aaliquot B
specimen Maliquot N
specimen Taliquot U
has_input
v20v22
v23
b40
repositoryspecimen X
IDspecimen
typeinformationrecord
denotes
instance_ofhas_quality
repository depositionprocess X
has_input
has_output
specimenrepository
located_in
b41 b43b42
sample set Xat GSC
sample set Xin transit
sample shipmentprocess X
sample shipmentprotocol
sample receiptprocess X
sample receiptprotocol
has_input
has_input
has_output
has_output
has_specification has_specification
Sample Shipment
sampleset X
IDsample set
typeamount
denotes
instance_ofhas_quality
IDsample set
typeamount
denotes
instance_ofhas_quality
IDsample set
typeamount
denotes
instance_ofhas_quality
located_in located_insample shipmentprocess
sample receiptprocess
instance_of instance_of
temporal-spatialregion
spatialregion
temporalinterval
GPSlocation
date/time
has_part
located_in
spatialregion
geographiclocation
denotes denotes denotes
temporal-spatialregion
spatialregion
temporalinterval
GPSlocation
date/time
has_part
located_in
spatialregion
geographiclocation
denotes denotes denotes
v21
sample Xat GSC
IDsample
typeamount
denotes
instance_ofhas_quality
has_part
v24
v25
temporal-spatialregion
spatialregion
temporalinterval
GPSlocation
date/time
NA amplifiedsample Xspecimen X
microorganism X
enrichedNA sample X
microorganismgenomic NA
NA enrichmentprocess X
NA enrichmentprotocol
NA amplificationprocess X
NA amplificationprotocol
has_input
has_input
has_output
has_outputhas_part
has_specification
has_part
has_specification
has_part
located_in
spatialregion
geographiclocation
species/strain
instance_of
ID
ID
v15
v16
v27
Sequencing Sample Preparation
aliquotingprocess X
aliquotingprotocol
has_input
has_output
has_specification
specimenaliquot X
specimentypeamount
denotes
instance_ofhas_quality
IDspecimen
typeamount
denotes
instance_ofhas_quality
IDspecimen
typeamount
denotes
instance_ofhas_quality
IDspecimen
typeamount
denotes
instance_ofhas_quality
located_in located_in located_in
NA enrichmentprocess
NA amplificationprocess
aliquotingprocess
instance_of instance_of instance_of
denotes denotes denotes
temporal-spatialregion
spatialregion
temporalinterval
GPSlocation
date/time
has_part
located_in
spatialregion
geographiclocation
denotes denotes denotes
temporal-spatialregion
spatialregion
temporalinterval
GPSlocation
date/time
has_part
located_in
spatialregion
geographiclocation
denotes denotes denotes
v35
v36
v37
v38
v39
v33
b31
b32
library constructionprotocol
b33
sequencing assay X
samplematerial X
material X
person X
equipment X
lot #
primarydata
sequencingprotocol
temporal-spatialregion
has_input
located_in
has_specification
has_output
v40
plays
spatialregion
temporalinterval
GPSlocationdate/time
spatialregion
geographiclocation
Sequencing Assay
has_part
located_indenotes denotes
runID
sequencingassay type
denotes
insatnce_of
reagentrole
reagenttype
instance_
of
denotes
sample ID
playstemplate
role
sampletype
instance_
of
denotes
name
playssequencing
tech. role
species
instance_
of
denotes
serial #
playssignal
detection role
equipmenttype
instance_
of
denotes
has_input
has_input
has_input
v14
v41
objectives – coverage,genome type targeted,
finishing
has_part
b34
b38
data transformations –image processing
assembly X
data transformations –variant detection
primarydata
sequencedata
genotype data
microorganism X
microorganismgenomic NA
algorithm
data archivingprocess
sequencedata record
has_input
instance_
of
has_specification
has_input
has_outpu
t
has_output
is_about
GenBankID
denotes
software
has_input
data transferprotocol
has_specification
species/strain
has_output
has_input
temporal-spatialregion
located_in
spatialregion
temporalinterval
GPSlocationdate/time
spatialregion
geographiclocation
has_part
located_indenotes denotes
person Xname
plays
bioinformaticstech. role
species
instance_
of
denotes
runID
denoteslocated_in
data transformations –serotype marker
detection
serotype data
data transformations –gene detection
gene data
part_of
has_output
has_output
is_about
has_input
has_input
Data Transformationstemporal-spatial
region
spatialregion
temporalinterval
GPSlocationdate/time
spatialregion
geographiclocation
has_part
located_indenotes denotes
v29
v43
v31
v32
v42
v30
v44
v45 v46
v47
b35
b36
finishingstatus
has_quality
b37
b39
assay X
samplematerial X
material X
person X
equipment X
lot #
primarydata
assayprotocol
temporal-spatialregion
has_input
located_in
has_specification
has_output
plays
spatialregion
temporalinterval
GPSlocationdate/time
spatialregion
geographiclocation
Generic Assay
has_part
located_indenotes denotes
runID
assaytype
denotes
instance_of
reagentrole
reagenttype
instance_
of
denotes
sample ID
playstarget
role
sampletype
instance_
of
denotes
name
playstechnician
role
species
instance_
of
denotes
serial #
playssignal
detection role
equipmenttype
instance_
of
denotes
has_input
has_input
has_input
objectives
has_part
analyte X
has_part
quality x
has_quality
input samplematerial X
is_about
materialtransformation X
samplematerial X
material X
person X
equipment X
lot #
outputmaterial X
material transformationprotocol
temporal-spatialregion
has_input
located_in
has_specification
has_output
plays
spatialregion
temporalinterval
GPSlocationdate/time
spatialregion
geographiclocation
Generic Material Transformation
has_part
located_indenotes denotes
runID
material transformationtype
denotes
instance_of
reagentrole
reagenttype
instance_
of
denotes
sample ID
playstarget
role
sampletype
instance_
of
denotes
name
playstechnician
role
species
instance_
of
denotes
serial #
playssignal
detection role
equipmenttype
instance_
of
denotes
has_input
has_input
has_input
objectives
has_part
quality x
has_quality
quality x
materialtype
has_quality
instance_of
sample IDdenotes
data transformation Xinputdata
outputdata
material X
algorithm
has_specification
has_output
is_about
software
has_input
located_in
person Xname
data analystrole
denotes
runID
denotes
Generic Data Transformation
temporal-spatialregion
spatialregion
temporalinterval
GPSlocationdate/time
spatialregion
geographiclocation
has_part
located_indenotes denotes
data transformationtype
instance_of
plays
Generic Material (IC)
material X
ID
materialtype
quality x
has_quality
material Y
has_part
material Z
has_part
quality y
has_quality
denotes
instance_of
temporal-spatialregion
spatialregion
temporalinterval
GPSlocation
date/time
has_part
located_in
spatialregion
geographiclocation
denotes denotes denotes
temporal-spatialregion
spatialregion
temporalinterval
GPSlocation
date/time
has_part
located_in
spatialregion
geographiclocation
denotes denotes denotes
located_in located_in
OBI specimen creation
organism (for ‘collecting specimen from an organism’)
human being
synonym
individual organism identifier
quality
geographic location
specimen
infectious agent
specimen creation
protocol
has_specifie
d_output
realizes
unfolds_in
denotes has_quality
is_about
located_in
has_specified_input
geographic location
time measurement datum
is_duration_of
material entity (for ‘environmental material
collection’)
has_participant
organization
is_member_of_organization
e21
written name
denotes
e22CRID symboldenotes
e24
textual entity
is_about
document
measurement datum
is_about
anatomical entity (‘portion of body substance’ or ’ portion of tissue’)
is_a
specimen creation objective
achieves_planned_objective
infectious agent
is_about
e17 e18
synonym e19
is_about
organization
has_supplier
quality
has_quality
e26
measurement datum
e23
is_quality_measured_as
infectious agent
e25
e27
e29 e30
e31
e32
e33
located_in
growth environment
e35
e36
e40 e41 e42
e44
treatment
material_entity
has_participant
has_participant
e43
genetic characteristics information
is_about
e37
genetic characteristics information
is_about
e20
e39
e38
located_in
located_in
e45 e46
e47 e50
e14
e16
e15
information content entity
denotes
has_agent
Status
• Core metadata merge process nearly complete• Comprehensive semantic networks developed• Begun the OBI harmonization process• Begun the MIGS/MIMS harmonization process• Still need to:– Compare, harmonize, map with BioProjects and BioSamples– Decide what to do about metadata fields that appear to be
project specific– Develop metadata submission templates– Report process and results
top related