the c 2 m system. 2 paul van der vet, peter geurts, theo huibers, hans roosendaal, sjoerd van...
TRANSCRIPT
The CThe C22M systemM system
2
The CThe C22M systemM system
Paul van der Vet, Peter Geurts, Theo Huibers,Paul van der Vet, Peter Geurts, Theo Huibers,
Hans Roosendaal, Sjoerd van TongerenHans Roosendaal, Sjoerd van Tongeren
ECCIECCI
CTIT, University of Twente, NetherlandsCTIT, University of Twente, Netherlands
[email protected]@utwente.nl
3
SettingSetting
• Scientist working with multiple, Scientist working with multiple, heterogeneous resources likeheterogeneous resources likeº DatabasesDatabasesº Knowledge basesKnowledge basesº ProgramsPrograms
• Task requires co-operation of resourcesTask requires co-operation of resources• Resources in-house or remote makes no Resources in-house or remote makes no
differencedifference
4
SciDashboardSciDashboard™™
• Long-term vision: scientist’s dashboardLong-term vision: scientist’s dashboard• SciDashboard™ allows scientist to visually:SciDashboard™ allows scientist to visually:
º Select resourcesSelect resourcesº Connect resourcesConnect resourcesº Identify sources and sinksIdentify sources and sinksº Specify data transformations underwaySpecify data transformations underway
• CC22M first step towards SciDashboard™M first step towards SciDashboard™
5
Co-operating resourcesCo-operating resources
• First problem: format multiplicityFirst problem: format multiplicity• Format multiplicity is unavoidableFormat multiplicity is unavoidable
º Standardisation social process with high stakesStandardisation social process with high stakesº No format caters for all needsNo format caters for all needs
• Second problem: combining resourcesSecond problem: combining resourcesº Merging, comparing, deduplicatingMerging, comparing, deduplicating
6
Format multiplicityFormat multiplicity
• Chemical example: Chemical example: molecular structure molecular structure filesfiles
O
N
NH
NH
O
O
7
Molecular structure filesMolecular structure files
• About 20 formats in daily use, for example:About 20 formats in daily use, for example:º MDL Molfile (MOL)MDL Molfile (MOL)º Connection table (CT)Connection table (CT)º Standard Molecular Description file (SMD)Standard Molecular Description file (SMD)
• Almost all formats specify plaintext files with Almost all formats specify plaintext files with record-field structurerecord-field structure
• Delimiters often space and newline Delimiters often space and newline characterscharacters
8
CT-file ethanol CHCT-file ethanol CH33CHCH22OHOH
ethanol.ctethanol.ct
3 23 2
-0.8667 -0.2500 0.0000 C-0.8667 -0.2500 0.0000 C
0.0000 0.2500 0.0000 C0.0000 0.2500 0.0000 C
0.8667 -0.2500 0.0000 O0.8667 -0.2500 0.0000 O
1 2 1 11 2 1 1
2 3 1 12 3 1 1
9
CT-file ethanol CHCT-file ethanol CH33CHCH22OHOH
ethanol.ctethanol.ct
3 23 2
-0.8667 -0.2500 0.0000 C-0.8667 -0.2500 0.0000 C
0.0000 0.2500 0.0000 C0.0000 0.2500 0.0000 C
0.8667 -0.2500 0.0000 O0.8667 -0.2500 0.0000 O
1 2 1 11 2 1 1
2 3 1 12 3 1 1
10
CT-file ethanol CHCT-file ethanol CH33CHCH22OHOH
ethanol.ctethanol.ct
3 23 2
-0.8667 -0.2500 0.0000 C -0.8667 -0.2500 0.0000 C (1)(1)
0.0000 0.2500 0.0000 C 0.0000 0.2500 0.0000 C (2)(2)
0.8667 -0.2500 0.0000 O 0.8667 -0.2500 0.0000 O (3)(3)
1 2 1 11 2 1 1
2 3 1 12 3 1 1
11
MOL-file ethanol CHMOL-file ethanol CH33CHCH22OHOH
ethanol.molethanol.mol ChemDraw03070310372DChemDraw03070310372D
3 2 0 0 0 0 0 0 0 0999 V20003 2 0 0 0 0 0 0 0 0999 V2000 -1.2975 -0.3750 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0-1.2975 -0.3750 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 0.0025 0.3750 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 00.0025 0.3750 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 1.3000 -0.3750 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 01.3000 -0.3750 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 1 2 1 0 0 0 01 2 1 0 0 0 0 2 3 1 0 0 0 02 3 1 0 0 0 0M ENDM END
12
Solving format multiplicity: Solving format multiplicity: WrappersWrappers
User desktop
wrapper
wrapper
wrapperwrapper
wrap
per
13
WrappersWrappers• Wrapper tools exist such asWrapper tools exist such as
º Chemistry: Babel, ChemDrawChemistry: Babel, ChemDrawº Molecular biology: SRSMolecular biology: SRSº Bibliography management: EndNotes, bpBibliography management: EndNotes, bp
• Disadvantage: adding new format impossible Disadvantage: adding new format impossible or very difficultor very difficult
• ““Roll your own” wrappers: Roll your own” wrappers: awk, perlawk, perl• Difficult to maintainDifficult to maintain
14
Wrapper generatorsWrapper generators
• Basic idea: produce wrapper from high-level Basic idea: produce wrapper from high-level description of formatsdescription of formats
• Often two-step process: Often two-step process: A A → → R → BR → B with with RR an internal representationan internal representation
• Obvious argument: two-step process takes Obvious argument: two-step process takes fewer converters than direct conversionfewer converters than direct conversion
• Disadvantage: Disadvantage: R R fixed and dedicatedfixed and dedicated
15
Preparing for middlewarePreparing for middleware
• Keyword: modularisationKeyword: modularisation• Stakeholders are responsible for their own Stakeholders are responsible for their own
specifications, for example:specifications, for example:º Content provider offers syntactic format descriptionContent provider offers syntactic format descriptionº User determines internal representationUser determines internal representation
• Internal representation allows combination of Internal representation allows combination of resourcesresources
16
The CThe C22M systemM system• CC22M: chemical configurable middlewareM: chemical configurable middleware• Implemented in Quintus PrologImplemented in Quintus Prolog• Current state: a wrapper generatorCurrent state: a wrapper generator• Wrappers produced from high-level Wrappers produced from high-level
specifications of formats and internal specifications of formats and internal representationrepresentation
• Internal representation chosen by user, if Internal representation chosen by user, if desired per taskdesired per task
• CC22M can be extended to middlewareM can be extended to middleware
17
Current CCurrent C22M is …M is …• a a specificationspecification language language
º for specifying the format of foreign filesfor specifying the format of foreign filesº for specifying the internal representationfor specifying the internal representation
• a a programmingprogramming language languageº for programming wrappers by means of specificationsfor programming wrappers by means of specificationsº for inserting copious documentationfor inserting copious documentation
• a a systemsystemº for producing wrappers and their documentationfor producing wrappers and their documentation
18
CC22M system overviewM system overview
docs
humansfor
code
converter
core
runtime
system
compiler
specscode
generator
documenter
modules
19
File conversion by CFile conversion by C22MM
format Aspec
ontologyspec
format Bspec
file in file outinternal
reprread write
20
CC22M specificationsM specifications
• Two kinds of specifications:Two kinds of specifications:º Specification of internal representationSpecification of internal representationº Specification of file formatSpecification of file formateach in a file of its owneach in a file of its own
• Internal representation: ontologyInternal representation: ontology• File format specification: read-only, write-File format specification: read-only, write-
only, or both read and writeonly, or both read and write
21
Language design principlesLanguage design principles• Adhere to well-known designsAdhere to well-known designs
º HTML (tags and tag attributes)HTML (tags and tag attributes)º context-free grammar (as in BNF)context-free grammar (as in BNF)º functionsfunctions
• Use or mimic well-known symbolsUse or mimic well-known symbolsº grammar rules: grammar rules: llhs -> rhs1 rhs2 rhs3 hs -> rhs1 rhs2 rhs3 ((→)→)
or or llhs hs ::=::= rhs1 rhs2 rhs3 rhs1 rhs2 rhs3 ((as in BNF)as in BNF)º instantiation: instantiation: lhs <- funct(arg1, arg2)lhs <- funct(arg1, arg2) (←) (←)
22
OntologyOntology
• Frame system Frame system • Tree structure with concepts and attributesTree structure with concepts and attributes• Three kinds of concepts:Three kinds of concepts:
º concept1 = concept2 concept3 concept4concept1 = concept2 concept3 concept4º concept1 = repeated(concept2)concept1 = repeated(concept2)º primitive concepts (leaves)primitive concepts (leaves)
• Leaves hold informationLeaves hold information
23
Ontology exampleOntology example
<C2M-SPECIFICATION type=“ontology”<C2M-SPECIFICATION type=“ontology”
name=“simple-ont”>name=“simple-ont”>
<ONTOLOGY><ONTOLOGY>
sentence = repeated(word)sentence = repeated(word)
</ONTOLOGY></ONTOLOGY>
</C2M-SPECIFICATION></C2M-SPECIFICATION>
24
File format specificationFile format specification
• File format specification: grammar + File format specification: grammar + semantic bindingssemantic bindings
• Grammar specifies structureGrammar specifies structure• System uses grammar to produce parse treeSystem uses grammar to produce parse tree• Semantic bindings map nodes in parse tree Semantic bindings map nodes in parse tree
onto concepts in internal representationonto concepts in internal representation
25
File format spec exampleFile format spec example
<C2M-SPECIFICATION type=“file-format”<C2M-SPECIFICATION type=“file-format” name=“simple-form”name=“simple-form”<READGRAM><READGRAM> ........</READGRAM></READGRAM><SBREAD><SBREAD> ........</SBREAD></SBREAD> ........</C2M-SPECIFICATION></C2M-SPECIFICATION>
26
File format spec: readgramFile format spec: readgram<READGRAM><READGRAM><ULG><ULG>line -> stringline -> stringline -> sp-string+line -> sp-string+sp-string -> spaces stringsp-string -> spaces string</ULG></ULG><LLG><LLG>spaces -> space+spaces -> space+string -> printable-char+string -> printable-char+</LLG></LLG></READGRAM></READGRAM>
27
File format spec: sbreadFile format spec: sbread
<SBREAD><SBREAD>
sentence =^ linesentence =^ line
word <- identity(string)word <- identity(string)
</SBREAD></SBREAD>
28
ClaimsClaimsCC22M isM is
• sufficiently expressivesufficiently expressive• fully declarativefully declarative• a literate programming environment a literate programming environment
(specification and documentation in one)(specification and documentation in one)• easy to learneasy to learn• amenable to division of labouramenable to division of labour
29
Claims (contnd.)Claims (contnd.)
• Compared to ChemDraw and their likes, CCompared to ChemDraw and their likes, C22M:M:º Allows for easy addition of new formatsAllows for easy addition of new formatsº Format specifications can be reusedFormat specifications can be reusedº Prepares for true middlewarePrepares for true middleware
• Compared to “roll-your-own” wrappers, CCompared to “roll-your-own” wrappers, C22M:M:º Facilitates reuse and adaptationFacilitates reuse and adaptationº Facilitates extensive documentationFacilitates extensive documentation
30
To be done (short term)To be done (short term)
• Stabilise systemStabilise system• ExperimentExperiment• Provide extensive manual and Provide extensive manual and
documentationdocumentation• Prepare system for others to experimentPrepare system for others to experiment
º But current version implemented in proprietary But current version implemented in proprietary software platformsoftware platform
31
To be done (long term)To be done (long term)
• Test language by means of user surveysTest language by means of user surveys• Develop version 2 Develop version 2 • Version Version xx may well be wholly visual may well be wholly visual
• Embed system in larger environmentEmbed system in larger environmentº SciDashboard™SciDashboard™º ““Habitable Interfaces”Habitable Interfaces”