retrieving factual data and documents using imgt-ml in the ... · imgt, the international...
TRANSCRIPT
NETTAB 2005, Naples
Retrieving factual data and documents using IMGT-ML in the
IMGT information system ®
Denys Chaume, CNRS/IGH
NETTAB 2005, Naples
Table of Contents
• The IMGT information system®• IMGT-ONTOLOGY• IMGT-ML• IMGT-ML a query language• Use examples :
– Database queries (LIGM-DB)– Tool queries (Junction Analysis)– Document queries (SEFID)
• Conclusion
NETTAB 2005, Naples
Part 1 summary
• IMGT, the international ImMunoGeneTics information system®
• Relations between subsystems • IMGT-ONTOLOGY• IMGT-ML Schemas• IMGT-ML : seqData• IMGT-ML Architecture
NETTAB 2005, Naples
IMGT, the international ImMunoGeneTics information system®
• A high quality integrated knowledge resource specialized in the immunoglobulins, T cell receptors, major histocompatibilty complex and related proteins of the immune system of human and other vertebrates.
• Created in 1989 by M.-P. Lefranc (CNRS/UM II),• On the WEB since 1995, • 90.000 sequences, 110 species• The international reference for immunogenetics
NETTAB 2005, Naples
IMGT information system®IMGT/LIGM-DB
IMGT/GENE-DB
IMGT/PRIMER-DB
IMGT/PROTEIN-DB
IMGT/3Dstructure-DB
6 databases 10 processing tools
IMGT/MHC-DB
IMGT Scientific ChartIMGT RepertoireIMGT IndexIMGT Bloc-notes
~8000 HTML documents
IMGT/V-QUEST
IMGT/JunctionAnalysis
IMGT/GeneView
IMGT/GeneSearch
IMGT/LocusView
IMGT/Allele-Align
IMGT/PhyloGene
IMGT/StructuralQuery
IMGT/GeneFrequency
IMGT/GeneInfo
NETTAB 2005, Naples
Relations between subsystems
IMGT/LIGM-DB
IMGT/GENE-DB
IMGT/PRIMER-DB
IMGT/PROTEIN-DB
IMGT/3Dstructure-DB
IMGT/V-QUEST IMGT/JunctionAnalysis
IMGT/GeneView
IMGT/GeneSearchIMGT/LocusView
IMGT/Allele-Align
IMGT/PhyloGene IMGT/StructuralQuery
IMGT/GeneFrequency
prim
er/se
quen
ces
gene
s/re
fere
nce
sequ
ence
s
sequence
s/mutatio
ns
mutation/genes
genes/phylogeny
genes/3D structures
3D st
ruct
ures
que
ry
prot
ein/
3D st
ruct
ures
nucleotidic/protein sequences
gene localizatio
n
gene localization
sequences/specificity
gene and allele identification
analyse of junction
gene
and
alle
le id
entif
icat
ion
NETTAB 2005, Naples
IMGT-ONTOLOGY
Identification
Description
Classification
Obtention
Numerotation
Characteristics
Annotation
Genome
Origin, methodology
IMGT unique numbering
NETTAB 2005, Naples
domain
IMGT-ML Schemas
IMGTOntology
Identification
imgt
Numerotation
Obtention
Classification
IMGTData
knowledge
seqDataDescription
(biological & structural)
External schemas
IMGTQuery
querySeqData
queryKnowledge
IMGTData defines elements using types
defined by IMGTOntology schemas
IMGTOntology defines simpleTypes and
complexTypes
responseTemplate
IMGTQuery defines complementary elements
to formulate queries
Namespace : http://www.imgt.org/IMGT-ML
NETTAB 2005, Naples
IMGT-ML : seqData
Numerotation
catalogue simpleCatalogEntry
classification
identification
annotation
sequence
seqData
•
?
?
?
•
opt@moddate
req@credate
req@id
opt@name
req@numacc
req@id
references extRef
opt@refid
req@reltype
opt@secid
req@reftype
req@dbid? +
req@seqid
opt@complement
req@seqlen
Literature
Numerotationkeywords
?
?
NETTAB 2005, Naples
IMGT-ML Architecture
XML Schema(simpleTypes &complexTypes)
IMGT-ONTOLOGY
Documentation
Biologicalexpertize
Data Modeling
Controlled vocabulary
(nomenclature)
Distributionformat
Data consistency
Web serviceinteractions
XSLT
NETTAB 2005, Naples
Part 2 summary
• IMGT-ML : a database query language– Why it works
• IMGTQuery package• Examples
– IMGT/LIGM-DB query– Junction Analysis tool
NETTAB 2005, Naples
A very simple database
>1970printable charactersstarts with a uppercase character
intvarchar 255 char 10
1997IMGT, the internat…. Chaume
yeartitleauthor
Data are vectors : (author, title, year)
primitive types
user types
actual data
NETTAB 2005, Naples
Data definition domainsPrimitive domain
User domain (rules)
Actual domain (data)
NETTAB 2005, Naples
Actual data XML representation
<litRefList> <litRef> <author name="Chaume" /> <title>IMGT, the internat…. </title> <date year="1997" /> </litRef> …</litRefList>
NETTAB 2005, Naples
XML schema <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" > <xs:element name="litRefList"> <xs:complexType> <xs:sequence> <xs:element ref="litRef" maxOccurs="unbounded"/> </xs:sequence> </xs:complexType> </xs:element> <xs:element name="litRef"> <xs:complexType> <xs:sequence> <xs:element ref="author" /> <xs:element ref="title" /> <xs:element ref="date" /> </xs:sequence> </xs:complexType> </xs:element> .....
NETTAB 2005, Naples
XML schema 1..... <xs:element name="author"> <xs:complexType> <xs:attribute name="name" type="xs:string"/> </xs:complexType></xs:element> <xs:element name="title" type="xs:string" /><xs:element name="date"> <xs:complexType> <xs:attribute name="year" type="xs:integer"/> </xs:complexType></xs:element>.....
NETTAB 2005, Naples
Schema 1 definition domainSchema1 domain
NETTAB 2005, Naples
XML schema 2..... <xs:element name="author"> <xs:complexType> <xs:attribute name="name" > <xs:simpleType> <xs:restriction base="xs:string"> <xs:maxLength value="10" /> <xs:pattern value="^[A-Z][a-z]+"/> </xs:restriction> </xs:simpleType> </xs:attribute> </xs:complexType></xs:element> <xs:element name="title"> <xs:simpleType> <xs:restriction base="xs:string"> <xs:maxLength value="255" /> </xs:restriction> </xs:simpleType></xs:element>
<xs:element name="date"> <xs:complexType> <xs:attribute name="year" > <xs:simpleType> <xs:restriction base="xs:string"> <xs:minInclusive value="1970"/> <xs:maxExclusive value="10000"/> </xs:restriction> </xs:simpleType> </xs:attribute> </xs:complexType></xs:element>.....
NETTAB 2005, Naples
Schema 2 definition domainsSchema 2 domain
NETTAB 2005, Naples
XML instanceand instance
domain
<litRefList> <litRef> <author name="Chaume" /> </litRef></litRefList>
..... <xs:element name="author"> <xs:complexType> <xs:attribute name="name"> <xs:simpleType> <xs:restriction base="xs:string"> <xs:enumeration value="Chaume" /> </xs:restriction> </xs:simpleType> </xs:attribute> </xs:complexType></xs:element> <xs:element name="title" type="xs:string" /><xs:element name="date"> <xs:complexType> <xs:attribute name="year" type="xs:integer"/> </xs:complexType></xs:element>.....
NETTAB 2005, Naples
Instance and query result domainsThe result of the query is the intersection of the "instance domain" and the "actual domain"
Instance domain
NETTAB 2005, Naples
IMGTQuery package
querySeqDataqueryKnowledge
domain
@data :labelListkeywordListfunctionalityListchainTypeListspecificityListmoleculeTypeListconfigurationList
seqData (+)
enumeration
minInclusive
maxInclusive
pattern
minExclusive
maxExclusive
responseTemplate
seqData
In any seqData sub-element
@complement (boolean)@of : @xxxx text()
Namespace : http://www.imgt.org/IMGT-ML/IMGTQuery
NETTAB 2005, Naples
Part 3 summary
• Example of IMGT/LIGM-DB queries• Request from Identification concept• AND operator (request)• AND operator (result)• OR operator• Request with domain restriction• Result control
NETTAB 2005, Naples
Example of IMGT/LIGM-DB queries
<q:querySeqData> <seqData>
<simpleCatalogEntry numacc="M26678"
/> </seqData></q:querySeqData>
<seqDataList> <seqData> <catalogEntry id="M26678" > <simpleCatalogEntry numacc="M26678" name="MMIGKZZZ"/> </catalogEntry> <identification> <partIdent moleculeType="DNA" configuration="rearranged"> <taxon taxonName="Mus musculus"/> </partIdent> </identification> <classification> <group name="IGKV"> <subgroup name="IGKV8"/> </group> </classification> …. </seqData></seqDataList>
NETTAB 2005, Naples
Request from Identification concept<q:querySeqData> <seqData> <partIdent chainType="Ig-Light"> <taxon taxonName="Homo sapiens"/> </partIdent> </seqData></q:querySeqData>
<seqDataList> <seqData id="AF001788" > <catalogue > ….
</catalogue> <identification>
<partIdent chainType="Ig-Light"> <taxon taxonName="Homo sapiens"/>
</partIdent> ….
</identification> </seqData> <seqData id="AF001799" > <catalogue > …. </seqData></seqDataList>
NETTAB 2005, Naples
AND operator (request)<q:querySeqData> <seqData> <annotation> <entity name="C-GENE"/> <region name="D-REGION"/> </annotation></seqData></q:querySeqData>
AND=intersection
NETTAB 2005, Naples
AND operator (result)
…. <entity name="C-GENE" partial="true"> <start location="4334"/> …. <end location="5440"/> </entity> <end location="5440"/> </cluster></annotation> ….
<seqDataList> <seqData id="U97590" > <catalogue> …. </catalogue> <identification> …. </identification> <annotation> <cluster name="D-J-C-CLUSTER"> <start location="1"/> <entity name="D-GENE"> <start location="1"/> <region name="D-REGION"> <start location="464"/> <end location="475"/> </region> <end location="600"/> </entity> ….
NETTAB 2005, Naples
OR operator<request> <seqData> <partIdent chainType="Ig-Light-Kappa" functionality="pseudogene"> <taxon taxonName="Homo sapiens"/> </partIdent> </seqData> <seqData> <partIdent chainType="Ig-Light-Lambda" functionality="pseudogene"> <taxon taxonName="Homo sapiens"/> </partIdent> </seqData></request>
<response> <seqData> <catalogEntry id="A25907" > …. </seqData> <seqData> <catalogEntry id="AF026482" > …. </seqData> <seqData> <catalogEntry id="AF026483" > …. </seqData> <seqData> <catalogEntry id="AF026484" > …. </seqData>
…</response>
OU=union
NETTAB 2005, Naples
<q:querySeqData> <seqData> <sequence>
<q:domain of="@length"> <q:maxInclusive value="400"/> <q:minInclusive value="100"/>
</q:restriction> </sequence> </seqData></q:querySeqData>
<q:querySeqData> <seqData> <sequence length="300"/> </seqData></q:querySeqData>
Request with domain restriction
NETTAB 2005, Naples
Result control<q:querySeqData> <seqData> <partIdent chainType="Ig" moleculeType="cDNA" configuration="rearranged"
specificity="anti-thyroid peroxidase (TPO)" > <taxon taxonName="Homo sapiens" /> </partIdent>
</seqData></q:querySeqData>
<q:responseTemplate><seqData> <classification><gene/></classification></seqData></q:responseTemplate>
Expected result :
<seqDataList nb="27"> <seqData id="AF306350"> <classification>
<gene name="IGHV1-69"/> <gene name="IGHD3-10"/>
<gene name="IGHJ6"/> </classification> </seqData> <seqData id="AF306376"> <classification>
<gene name="IGHV1-3"/> <gene name="IGHD4-4"/>
<gene name="IGHJ4"/> </classification> </seqData> .. ..
NETTAB 2005, Naples
Tools : Junction Analysis• Input : a sequence list, each sequence has a V gene and
a J gene.• Output : each sequence is annotated with locations of V,
D, J genes and N and P regions and sequences are aligned
seqDataList
seqDataclassification
seqDataList
seqData
annotation
alignement
proSequence
IMGTjcta
NETTAB 2005, Naples
Literature document queries (ORIEL project)
• Authors do not use this vocabulary• GO terms are too poor to describe
genes involved in immunoglobulin and T receptor synthesis
• Existing search engines do not index text with this vocabulary
How use IMGT-ONTOLOGY vocabulary to index literature documents ?
NETTAB 2005, Naples
SEFID Prototype search engine pipeline
IMGT-MLquery
IMGT/LIGM-DB SOAP server at EBI
Doc2Loc
Location (NCBI, INIST, SUDOC…)
Text analysis (LT-POS, …)
Statistics analysisOriel server
General purpose search engine
(E-BioSci, Collexis, … Full text publications
Doc2Loc
Literature ref.
E-BioSci
Accession #s Medline #s Abstract locations
AbstractsUseful wordsWord signatures
IMGT SOAP server
EBI SOAP serverCan be IMGT or DDBJ SOAP
server as well
E-BioSci Doc2Loc SOAP server
E-BioSci SOAP server
Oriel SOAP server implementing LT-Chunk from LTG (Edinburgh)
(home development)
Oriel SOAP server Word frequencies
(fingerprint)(home development)
E-BioSci SOAP serverOr any other search engine
(google)
NETTAB 2005, Naples
Conclusion
• IMGT-ML, very close to IMGT-ONTOLOGY, is compliant with Biology
• Using the same IMGT-ML format as input and output of modules (Web services) allows their chaining
• This makes easier the development of IMGT-CHOREOGRAPHY which is our near futur development.
NETTAB 2005, Naples
People• Kora Combres (CNRS/IGH, ORIEL project)
• Véronique Giudicelli (CNRS/IGH, IMGT-ONTOLOGY),
• Professeur Marie-Paule Lefranc (UM II, CNRS/IGH, IMGT project)
• Denys Chaume (CNRS/IGH, IMGT, ORIEL)