data integration in the microbedb.jp using semantic web ... iccc13microbedb.jp.pdf · data...
TRANSCRIPT
Data integration in the MicrobeDB.jpusing Semantic Web technology
Hiroshi Mori, Ikuo Uchiyama, Yasukazu Nakamura, Hideaki Sugawara,
Ken Kurokawa, and MicrobeDB.jp Project Team
ICCC13, September 26, 2013, Beijing, ChinaWDCM and CODATA Joint Workshop
1
1, 2, 31
2 3 3
1) Tokyo Institute of Technology, 2) National Institute for Basic Biology,3) National Institute of Genetics, Japan
Ortholog Taxonomy
Pathogen
Gene Function
Metagenome
Genome Culture Collection
Which DBs should we use?
Many microbial databases (DBs) exist …
From National Research Council (USA)
Microbes inhabit almost everywhere on Earth and interact with their environments.
Knowledge of microbes will have high potential scientific and commercial applications.
Promoting the Integrated Use of Life Science Databases in Japan
・ FY 2007-2010 “Integrated Database Project”→ Database Center for Life Science (DBCLS)
・ FY 2011-→ National Bioscience Database Center (NBDC)
About NBDC・ Established in April 2011・ As part of the Japan Science and Technology Agency (JST), a
funding agency supported by MEXT
URL: http://biosciencedbc.jp/?lng=en
Activities by NBDC1. Formulation of strategies related to coordination and integration of
DBs, and international cooperation
2. Creation and management of a portal website from existing life science DBs http://biosciencedbc.jp/?lng=en
3. Funding of R&D of new technology necessary for organizing and linking life science DBs
4. Funding of R&D that coordinate existing and emerging DBs in specific research fields
Includes microbes (PI: Ken KUROKAWA)
Aim ofto integrate several microbial data (include omics, taxonomy/cultures, habitats) using semantic web technology
integrates lots of data related to microbes.
Especially, we integrates the microbial data that can be linked to genomes.
Ortholog: MBGD
Genome: GTPS/RefSeq
Annotation: TogoAnnotation
Culture Collection:NBRC/JCM
Metadata: INSDC SRA
Metagenome: INSDC SRA
Taxonomy: NCBI Taxonomy
http://microbedb.jp/
Gene Taxon Environment
Red color indicates our collaborators.
Other dataOther data
Other data
How to simplify the process of integration of other domain’s data?
Gene1has
FunctionGO:000370
0
RDF is a standard data model of Semantic Web technology
Genome1 organismEscherichia
coli
Search
RDF (Resource Description Framework)Data model which uses Triples (Subject – Predicate – Object) Gene1
hasFunction
GO:0003700Gene1
hasFunction
GO:0003700
Genome1 organismEscherichia
coliGenome1 organismEscherichia
coli
Organism1has
GenomeGenome1Organism1
hasGenome
Genome1Organism1
hasGenome
Genome1
Organism1 inhabit LakeOrganism1 inhabit LakeOrganism1 inhabit Lake
RDF
OntologyTriple store
SPARQL
S P O
gtps:Gene1 rdfs:label “16S rRNA gene”
KO:03043
<URI> <URI> <URI>/Literal
URI node can be linked to other nodes
S P O/S P O
S P O ×
To prepare data in RDF, the database management system automatically recognize same resources (same URI).
Gene1has
FunctionGO:000370
0
Genome1 organismEscherichia
coli
Gene1has
FunctionGO:000370
0Gene1has
FunctionGO:000370
0
Genome1 organismEscherichia
coliGenome1 organism Organism 1
Organism1has
GenomeGenome1Organism1
hasGenome
Genome1Organism1
hasGenome
Genome1
Organism1 inhabit LakeOrganism1 inhabit LakeOrganism1 inhabit Lake
DB 1
Gene1 hasFunction
GO:0003700Organism 1 can
ProduceEnzyme 1
Genome1 organismEscherichia
coliEnzyme 1canUse
Compound 1
Organism1has
GenomeGenome1
Organism 1can
GrowMedium 1
DB 2
owl:sameAs
1. When two DBs use same URI, already two DB’s data are integrated.2. If not, you can integrate two DB’s data by adding one Triple (db1:A owl:sameAs db2:B).
How to integrate the data from two different DBs?
How can we discriminate whether two DB’s resources are same or not?
You don’t need to place all of these data in one DB managenement system.
You should describe your resource by using some Ontologies
Ontology is a structured controlled vocabulary to describe properties and types of resources.
MEO (Microbes Environmental Ontology) PDO (Pathogenic Disease Ontology)
MCCV (Microbial Culture Collection Vocabulary)
MSV (Metagenome Sample Vocabulary)
MPO (Microbial Phenotype Ontology)
MBGD Ortholog Ontology
Most of them can be obtained from
For example, to answer: What is soil? What is a relationship between soil and sand?
Ortholog: MBGD
Genome: GTPS/RefSeq
Annotation: TogoAnnotation
Culture Collection:NBRC/JCM
Metadata: INSDC SRA
Metagenome: INSDC SRA
Taxonomy: NCBI Taxonomy
http://microbedb.jp/
Gene Taxon Environment
Red color indicates our collaborators.
We have converted most of our data to RDF, developed many ontologies, and developed a RDFized microbial DB.
More than 1 billion Triples!
JCM/NBRC Culture Collection data1. Strain_Number
2. Other_Collection_Numbers
3. Name
4. Organism_Type
5. History_of_Deposit
6. Date_of_Isolation
7. Isolated_from
8. Geographic_Origin
9. Status
10. Optimum_Temperature_for_Growth
11. Maximum_Temperature_for_Growth
12. Minimum_Temperature_for_Growth
13. Medium
14. Application
15. Literature
RDF conversion example
nbrc:NBRC_12841
rdf:type
:MCCV_000001(Culture)
<http://www.dsmz.de/catalogues/details/culture
/DSM-40226.html>
:MCCV_000025
:MCCV_000012
“Streptomyces griseus subsp. griseus (Krainsky 1914) Waksman and Henrici 1948”
:MCCV_000014“Optimal growth temperature”
<http://identifiers.org/taxonomy/67263>
<http://www.ncbi.nlm.nih.gov/taxonomy/67263>
<http://purl.uniprot.org/taxonomy/67263>
” DSM 40226”
#
:MCCV_000026
“28"^^<http://www.w3.org/2001/XMLSchema#integer>
:MCCV_00018“Strain Number”
nbrcmedium:NBRC_227
:MCCV_000033” Application"
"Thienamycins production ; Vitamin B12 (Cyanocobalamine) production ; Steroid conversion"
<http://identifiers.org/taxonomy/67274>
<http://www.ncbi.nlm.nih.gov/taxonomy/67274>
<http://purl.uniprot.org/taxonomy/67274>
“IFO 12841 <-- SAJ <-- OWU (ISP 5226) <-- Squibb &
Sons (F. Arnow, MD 2428, ETH 24234, NIHJ 501)”
:MCCV_000027”History of deposit”
“Soil”
:MCCV_000028
“Isolated from” #
meo:MEO_0000007
rdfs:label
dc:identifier
"false"^^xsd:boolean
:MCCV_000017”Type Strain "
Example of NBRC Culture Collection RDF data
:MCCV_00023
:MCCV_00022
Overall data structure of MicrobeDB.jp
http://microbedb.jp/
Keyword example: lake
Taxonomic compositio
n of 16S amplicon
sequencing which
sampled from lake
Metagenome
samples obtained
from lake
MEO hierarchi
cal structure
lake meo:pond is_a meo:lake Strain_A mccv:isolation_source meo:pond Strain_A
Abundant Orthologs in metagenome samples
obtained from lake
JCM/NBRC Strains isolated from lake
Genome sequenced
strains isolated
from lake
MicrobeDB.jp will facilitate the exploration of the existing scattered information of microbes.
・ Ken Kurokawa (Tokyo Institute of Technology)Junichi Takehara, Koji Yoshino, Nozomi Yamamoto, Takuji Yamada, Fumikazu Konishi
・ Yasukazu Nakamura (National Institute of Genetics, DDBJ)Takatomo Fujisawa, Eri Kaminuma, Hideaki Sugawara
・ Ikuo Uchiyama (National Institute for Basic Biology)Hirokazu Chiba, Hiroyo Nishide
Advisor (DataBase Center for Life Science)Shinobu Okamoto, Shuichi Kawashima, Toshiaki Katayama, Yasunori Yamamoto, Shoko Kawamoto
NBRC Culture Collection dataKen’ichiro Suzuki, Masami Ichihara, Natsuko Ichikawa
JCM Culture Collection dataMoriya Ohkuma, Takuji Kudo
Funding
Acknowledgementshttp://microbedb.jp/