the c 2 m system. 2 paul van der vet, peter geurts, theo huibers, hans roosendaal, sjoerd van...

31
The C The C 2 2 M system M system

Upload: holly-willis

Post on 04-Jan-2016

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The C 2 M system. 2 Paul van der Vet, Peter Geurts, Theo Huibers, Hans Roosendaal, Sjoerd van Tongeren ECCI CTIT, University of Twente, Netherlands p.e.vandervet@utwente.nl

The CThe C22M systemM system

Page 2: The C 2 M system. 2 Paul van der Vet, Peter Geurts, Theo Huibers, Hans Roosendaal, Sjoerd van Tongeren ECCI CTIT, University of Twente, Netherlands p.e.vandervet@utwente.nl

2

The CThe C22M systemM system

Paul van der Vet, Peter Geurts, Theo Huibers,Paul van der Vet, Peter Geurts, Theo Huibers,

Hans Roosendaal, Sjoerd van TongerenHans Roosendaal, Sjoerd van Tongeren

ECCIECCI

CTIT, University of Twente, NetherlandsCTIT, University of Twente, Netherlands

[email protected]@utwente.nl

Page 3: The C 2 M system. 2 Paul van der Vet, Peter Geurts, Theo Huibers, Hans Roosendaal, Sjoerd van Tongeren ECCI CTIT, University of Twente, Netherlands p.e.vandervet@utwente.nl

3

SettingSetting

• Scientist working with multiple, Scientist working with multiple, heterogeneous resources likeheterogeneous resources likeº DatabasesDatabasesº Knowledge basesKnowledge basesº ProgramsPrograms

• Task requires co-operation of resourcesTask requires co-operation of resources• Resources in-house or remote makes no Resources in-house or remote makes no

differencedifference

Page 4: The C 2 M system. 2 Paul van der Vet, Peter Geurts, Theo Huibers, Hans Roosendaal, Sjoerd van Tongeren ECCI CTIT, University of Twente, Netherlands p.e.vandervet@utwente.nl

4

SciDashboardSciDashboard™™

• Long-term vision: scientist’s dashboardLong-term vision: scientist’s dashboard• SciDashboard™ allows scientist to visually:SciDashboard™ allows scientist to visually:

º Select resourcesSelect resourcesº Connect resourcesConnect resourcesº Identify sources and sinksIdentify sources and sinksº Specify data transformations underwaySpecify data transformations underway

• CC22M first step towards SciDashboard™M first step towards SciDashboard™

Page 5: The C 2 M system. 2 Paul van der Vet, Peter Geurts, Theo Huibers, Hans Roosendaal, Sjoerd van Tongeren ECCI CTIT, University of Twente, Netherlands p.e.vandervet@utwente.nl

5

Co-operating resourcesCo-operating resources

• First problem: format multiplicityFirst problem: format multiplicity• Format multiplicity is unavoidableFormat multiplicity is unavoidable

º Standardisation social process with high stakesStandardisation social process with high stakesº No format caters for all needsNo format caters for all needs

• Second problem: combining resourcesSecond problem: combining resourcesº Merging, comparing, deduplicatingMerging, comparing, deduplicating

Page 6: The C 2 M system. 2 Paul van der Vet, Peter Geurts, Theo Huibers, Hans Roosendaal, Sjoerd van Tongeren ECCI CTIT, University of Twente, Netherlands p.e.vandervet@utwente.nl

6

Format multiplicityFormat multiplicity

• Chemical example: Chemical example: molecular structure molecular structure filesfiles

O

N

NH

NH

O

O

Page 7: The C 2 M system. 2 Paul van der Vet, Peter Geurts, Theo Huibers, Hans Roosendaal, Sjoerd van Tongeren ECCI CTIT, University of Twente, Netherlands p.e.vandervet@utwente.nl

7

Molecular structure filesMolecular structure files

• About 20 formats in daily use, for example:About 20 formats in daily use, for example:º MDL Molfile (MOL)MDL Molfile (MOL)º Connection table (CT)Connection table (CT)º Standard Molecular Description file (SMD)Standard Molecular Description file (SMD)

• Almost all formats specify plaintext files with Almost all formats specify plaintext files with record-field structurerecord-field structure

• Delimiters often space and newline Delimiters often space and newline characterscharacters

Page 8: The C 2 M system. 2 Paul van der Vet, Peter Geurts, Theo Huibers, Hans Roosendaal, Sjoerd van Tongeren ECCI CTIT, University of Twente, Netherlands p.e.vandervet@utwente.nl

8

CT-file ethanol CHCT-file ethanol CH33CHCH22OHOH

ethanol.ctethanol.ct

3 23 2

-0.8667 -0.2500 0.0000 C-0.8667 -0.2500 0.0000 C

0.0000 0.2500 0.0000 C0.0000 0.2500 0.0000 C

0.8667 -0.2500 0.0000 O0.8667 -0.2500 0.0000 O

1 2 1 11 2 1 1

2 3 1 12 3 1 1

Page 9: The C 2 M system. 2 Paul van der Vet, Peter Geurts, Theo Huibers, Hans Roosendaal, Sjoerd van Tongeren ECCI CTIT, University of Twente, Netherlands p.e.vandervet@utwente.nl

9

CT-file ethanol CHCT-file ethanol CH33CHCH22OHOH

ethanol.ctethanol.ct

3 23 2

-0.8667 -0.2500 0.0000 C-0.8667 -0.2500 0.0000 C

0.0000 0.2500 0.0000 C0.0000 0.2500 0.0000 C

0.8667 -0.2500 0.0000 O0.8667 -0.2500 0.0000 O

1 2 1 11 2 1 1

2 3 1 12 3 1 1

Page 10: The C 2 M system. 2 Paul van der Vet, Peter Geurts, Theo Huibers, Hans Roosendaal, Sjoerd van Tongeren ECCI CTIT, University of Twente, Netherlands p.e.vandervet@utwente.nl

10

CT-file ethanol CHCT-file ethanol CH33CHCH22OHOH

ethanol.ctethanol.ct

3 23 2

-0.8667 -0.2500 0.0000 C -0.8667 -0.2500 0.0000 C (1)(1)

0.0000 0.2500 0.0000 C 0.0000 0.2500 0.0000 C (2)(2)

0.8667 -0.2500 0.0000 O 0.8667 -0.2500 0.0000 O (3)(3)

1 2 1 11 2 1 1

2 3 1 12 3 1 1

Page 11: The C 2 M system. 2 Paul van der Vet, Peter Geurts, Theo Huibers, Hans Roosendaal, Sjoerd van Tongeren ECCI CTIT, University of Twente, Netherlands p.e.vandervet@utwente.nl

11

MOL-file ethanol CHMOL-file ethanol CH33CHCH22OHOH

ethanol.molethanol.mol ChemDraw03070310372DChemDraw03070310372D

3 2 0 0 0 0 0 0 0 0999 V20003 2 0 0 0 0 0 0 0 0999 V2000 -1.2975 -0.3750 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0-1.2975 -0.3750 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 0.0025 0.3750 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 00.0025 0.3750 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 1.3000 -0.3750 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 01.3000 -0.3750 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 1 2 1 0 0 0 01 2 1 0 0 0 0 2 3 1 0 0 0 02 3 1 0 0 0 0M ENDM END

Page 12: The C 2 M system. 2 Paul van der Vet, Peter Geurts, Theo Huibers, Hans Roosendaal, Sjoerd van Tongeren ECCI CTIT, University of Twente, Netherlands p.e.vandervet@utwente.nl

12

Solving format multiplicity: Solving format multiplicity: WrappersWrappers

User desktop

wrapper

wrapper

wrapperwrapper

wrap

per

Page 13: The C 2 M system. 2 Paul van der Vet, Peter Geurts, Theo Huibers, Hans Roosendaal, Sjoerd van Tongeren ECCI CTIT, University of Twente, Netherlands p.e.vandervet@utwente.nl

13

WrappersWrappers• Wrapper tools exist such asWrapper tools exist such as

º Chemistry: Babel, ChemDrawChemistry: Babel, ChemDrawº Molecular biology: SRSMolecular biology: SRSº Bibliography management: EndNotes, bpBibliography management: EndNotes, bp

• Disadvantage: adding new format impossible Disadvantage: adding new format impossible or very difficultor very difficult

• ““Roll your own” wrappers: Roll your own” wrappers: awk, perlawk, perl• Difficult to maintainDifficult to maintain

Page 14: The C 2 M system. 2 Paul van der Vet, Peter Geurts, Theo Huibers, Hans Roosendaal, Sjoerd van Tongeren ECCI CTIT, University of Twente, Netherlands p.e.vandervet@utwente.nl

14

Wrapper generatorsWrapper generators

• Basic idea: produce wrapper from high-level Basic idea: produce wrapper from high-level description of formatsdescription of formats

• Often two-step process: Often two-step process: A A → → R → BR → B with with RR an internal representationan internal representation

• Obvious argument: two-step process takes Obvious argument: two-step process takes fewer converters than direct conversionfewer converters than direct conversion

• Disadvantage: Disadvantage: R R fixed and dedicatedfixed and dedicated

Page 15: The C 2 M system. 2 Paul van der Vet, Peter Geurts, Theo Huibers, Hans Roosendaal, Sjoerd van Tongeren ECCI CTIT, University of Twente, Netherlands p.e.vandervet@utwente.nl

15

Preparing for middlewarePreparing for middleware

• Keyword: modularisationKeyword: modularisation• Stakeholders are responsible for their own Stakeholders are responsible for their own

specifications, for example:specifications, for example:º Content provider offers syntactic format descriptionContent provider offers syntactic format descriptionº User determines internal representationUser determines internal representation

• Internal representation allows combination of Internal representation allows combination of resourcesresources

Page 16: The C 2 M system. 2 Paul van der Vet, Peter Geurts, Theo Huibers, Hans Roosendaal, Sjoerd van Tongeren ECCI CTIT, University of Twente, Netherlands p.e.vandervet@utwente.nl

16

The CThe C22M systemM system• CC22M: chemical configurable middlewareM: chemical configurable middleware• Implemented in Quintus PrologImplemented in Quintus Prolog• Current state: a wrapper generatorCurrent state: a wrapper generator• Wrappers produced from high-level Wrappers produced from high-level

specifications of formats and internal specifications of formats and internal representationrepresentation

• Internal representation chosen by user, if Internal representation chosen by user, if desired per taskdesired per task

• CC22M can be extended to middlewareM can be extended to middleware

Page 17: The C 2 M system. 2 Paul van der Vet, Peter Geurts, Theo Huibers, Hans Roosendaal, Sjoerd van Tongeren ECCI CTIT, University of Twente, Netherlands p.e.vandervet@utwente.nl

17

Current CCurrent C22M is …M is …• a a specificationspecification language language

º for specifying the format of foreign filesfor specifying the format of foreign filesº for specifying the internal representationfor specifying the internal representation

• a a programmingprogramming language languageº for programming wrappers by means of specificationsfor programming wrappers by means of specificationsº for inserting copious documentationfor inserting copious documentation

• a a systemsystemº for producing wrappers and their documentationfor producing wrappers and their documentation

Page 18: The C 2 M system. 2 Paul van der Vet, Peter Geurts, Theo Huibers, Hans Roosendaal, Sjoerd van Tongeren ECCI CTIT, University of Twente, Netherlands p.e.vandervet@utwente.nl

18

CC22M system overviewM system overview

docs

humansfor

code

converter

core

runtime

system

compiler

specscode

generator

documenter

modules

Page 19: The C 2 M system. 2 Paul van der Vet, Peter Geurts, Theo Huibers, Hans Roosendaal, Sjoerd van Tongeren ECCI CTIT, University of Twente, Netherlands p.e.vandervet@utwente.nl

19

File conversion by CFile conversion by C22MM

format Aspec

ontologyspec

format Bspec

file in file outinternal

reprread write

Page 20: The C 2 M system. 2 Paul van der Vet, Peter Geurts, Theo Huibers, Hans Roosendaal, Sjoerd van Tongeren ECCI CTIT, University of Twente, Netherlands p.e.vandervet@utwente.nl

20

CC22M specificationsM specifications

• Two kinds of specifications:Two kinds of specifications:º Specification of internal representationSpecification of internal representationº Specification of file formatSpecification of file formateach in a file of its owneach in a file of its own

• Internal representation: ontologyInternal representation: ontology• File format specification: read-only, write-File format specification: read-only, write-

only, or both read and writeonly, or both read and write

Page 21: The C 2 M system. 2 Paul van der Vet, Peter Geurts, Theo Huibers, Hans Roosendaal, Sjoerd van Tongeren ECCI CTIT, University of Twente, Netherlands p.e.vandervet@utwente.nl

21

Language design principlesLanguage design principles• Adhere to well-known designsAdhere to well-known designs

º HTML (tags and tag attributes)HTML (tags and tag attributes)º context-free grammar (as in BNF)context-free grammar (as in BNF)º functionsfunctions

• Use or mimic well-known symbolsUse or mimic well-known symbolsº grammar rules: grammar rules: llhs -> rhs1 rhs2 rhs3 hs -> rhs1 rhs2 rhs3 ((→)→)

or or llhs hs ::=::= rhs1 rhs2 rhs3 rhs1 rhs2 rhs3 ((as in BNF)as in BNF)º instantiation: instantiation: lhs <- funct(arg1, arg2)lhs <- funct(arg1, arg2) (←) (←)

Page 22: The C 2 M system. 2 Paul van der Vet, Peter Geurts, Theo Huibers, Hans Roosendaal, Sjoerd van Tongeren ECCI CTIT, University of Twente, Netherlands p.e.vandervet@utwente.nl

22

OntologyOntology

• Frame system Frame system • Tree structure with concepts and attributesTree structure with concepts and attributes• Three kinds of concepts:Three kinds of concepts:

º concept1 = concept2 concept3 concept4concept1 = concept2 concept3 concept4º concept1 = repeated(concept2)concept1 = repeated(concept2)º primitive concepts (leaves)primitive concepts (leaves)

• Leaves hold informationLeaves hold information

Page 23: The C 2 M system. 2 Paul van der Vet, Peter Geurts, Theo Huibers, Hans Roosendaal, Sjoerd van Tongeren ECCI CTIT, University of Twente, Netherlands p.e.vandervet@utwente.nl

23

Ontology exampleOntology example

<C2M-SPECIFICATION type=“ontology”<C2M-SPECIFICATION type=“ontology”

name=“simple-ont”>name=“simple-ont”>

<ONTOLOGY><ONTOLOGY>

sentence = repeated(word)sentence = repeated(word)

</ONTOLOGY></ONTOLOGY>

</C2M-SPECIFICATION></C2M-SPECIFICATION>

Page 24: The C 2 M system. 2 Paul van der Vet, Peter Geurts, Theo Huibers, Hans Roosendaal, Sjoerd van Tongeren ECCI CTIT, University of Twente, Netherlands p.e.vandervet@utwente.nl

24

File format specificationFile format specification

• File format specification: grammar + File format specification: grammar + semantic bindingssemantic bindings

• Grammar specifies structureGrammar specifies structure• System uses grammar to produce parse treeSystem uses grammar to produce parse tree• Semantic bindings map nodes in parse tree Semantic bindings map nodes in parse tree

onto concepts in internal representationonto concepts in internal representation

Page 25: The C 2 M system. 2 Paul van der Vet, Peter Geurts, Theo Huibers, Hans Roosendaal, Sjoerd van Tongeren ECCI CTIT, University of Twente, Netherlands p.e.vandervet@utwente.nl

25

File format spec exampleFile format spec example

<C2M-SPECIFICATION type=“file-format”<C2M-SPECIFICATION type=“file-format” name=“simple-form”name=“simple-form”<READGRAM><READGRAM> ........</READGRAM></READGRAM><SBREAD><SBREAD> ........</SBREAD></SBREAD> ........</C2M-SPECIFICATION></C2M-SPECIFICATION>

Page 26: The C 2 M system. 2 Paul van der Vet, Peter Geurts, Theo Huibers, Hans Roosendaal, Sjoerd van Tongeren ECCI CTIT, University of Twente, Netherlands p.e.vandervet@utwente.nl

26

File format spec: readgramFile format spec: readgram<READGRAM><READGRAM><ULG><ULG>line -> stringline -> stringline -> sp-string+line -> sp-string+sp-string -> spaces stringsp-string -> spaces string</ULG></ULG><LLG><LLG>spaces -> space+spaces -> space+string -> printable-char+string -> printable-char+</LLG></LLG></READGRAM></READGRAM>

Page 27: The C 2 M system. 2 Paul van der Vet, Peter Geurts, Theo Huibers, Hans Roosendaal, Sjoerd van Tongeren ECCI CTIT, University of Twente, Netherlands p.e.vandervet@utwente.nl

27

File format spec: sbreadFile format spec: sbread

<SBREAD><SBREAD>

sentence =^ linesentence =^ line

word <- identity(string)word <- identity(string)

</SBREAD></SBREAD>

Page 28: The C 2 M system. 2 Paul van der Vet, Peter Geurts, Theo Huibers, Hans Roosendaal, Sjoerd van Tongeren ECCI CTIT, University of Twente, Netherlands p.e.vandervet@utwente.nl

28

ClaimsClaimsCC22M isM is

• sufficiently expressivesufficiently expressive• fully declarativefully declarative• a literate programming environment a literate programming environment

(specification and documentation in one)(specification and documentation in one)• easy to learneasy to learn• amenable to division of labouramenable to division of labour

Page 29: The C 2 M system. 2 Paul van der Vet, Peter Geurts, Theo Huibers, Hans Roosendaal, Sjoerd van Tongeren ECCI CTIT, University of Twente, Netherlands p.e.vandervet@utwente.nl

29

Claims (contnd.)Claims (contnd.)

• Compared to ChemDraw and their likes, CCompared to ChemDraw and their likes, C22M:M:º Allows for easy addition of new formatsAllows for easy addition of new formatsº Format specifications can be reusedFormat specifications can be reusedº Prepares for true middlewarePrepares for true middleware

• Compared to “roll-your-own” wrappers, CCompared to “roll-your-own” wrappers, C22M:M:º Facilitates reuse and adaptationFacilitates reuse and adaptationº Facilitates extensive documentationFacilitates extensive documentation

Page 30: The C 2 M system. 2 Paul van der Vet, Peter Geurts, Theo Huibers, Hans Roosendaal, Sjoerd van Tongeren ECCI CTIT, University of Twente, Netherlands p.e.vandervet@utwente.nl

30

To be done (short term)To be done (short term)

• Stabilise systemStabilise system• ExperimentExperiment• Provide extensive manual and Provide extensive manual and

documentationdocumentation• Prepare system for others to experimentPrepare system for others to experiment

º But current version implemented in proprietary But current version implemented in proprietary software platformsoftware platform

Page 31: The C 2 M system. 2 Paul van der Vet, Peter Geurts, Theo Huibers, Hans Roosendaal, Sjoerd van Tongeren ECCI CTIT, University of Twente, Netherlands p.e.vandervet@utwente.nl

31

To be done (long term)To be done (long term)

• Test language by means of user surveysTest language by means of user surveys• Develop version 2 Develop version 2 • Version Version xx may well be wholly visual may well be wholly visual

• Embed system in larger environmentEmbed system in larger environmentº SciDashboard™SciDashboard™º ““Habitable Interfaces”Habitable Interfaces”