mining usage patterns from a repository of scientific workflow
DESCRIPTION
Full paper: http://boole.diiga.univpm.it/paper/sac2012.pdf In many experimental domains, especially e-Science, workflow management systems are gaining increasing attention to design and execute in-silico experiments involving data analysis tools. As a by-product, a repository of workflows is generated, that formally describes experimental protocols and the way different tools are combined inside experiments. In this paper we describe the use of the SUBDUE graph clustering algorithm to discover sub-workflows from a repository. Since sub-workflows represent significant usage patterns of tools, the discovered knowledge can be exploited by scientists to learn by-example about design practices, or to retrieve and reuse workflows. Such a knowledge, ultimately, leverages the potential of scientific workflow repositories to become a knowledge-asset. A set of experiments is conducted on the myExperiment repository to assess the effectiveness of the approach.TRANSCRIPT
![Page 1: Mining Usage Patterns from a Repository of Scientific Workflow](https://reader035.vdocuments.mx/reader035/viewer/2022081511/557d5488d8b42aba3d8b4574/html5/thumbnails/1.jpg)
Mining Usage Patterns from a Repositoryof Scientific Workflows
Claudia Diamantini, Domenico Potena, Emanuele Storti
DII, Universita Politecnica delle Marche, Ancona, Italy
SAC 2012, Riva del Garda, March 26-30
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 1 / 21
![Page 2: Mining Usage Patterns from a Repository of Scientific Workflow](https://reader035.vdocuments.mx/reader035/viewer/2022081511/557d5488d8b42aba3d8b4574/html5/thumbnails/2.jpg)
Introduction
Process mining techniques allow for extracting knowledge from eventlogs, i.e. traces of running processes:
audit trails of a workflow management systemtransaction logs of an enterprise resource planning system
Main applications:process discovery: what is really happening?conformance check: are we doing what was agreed upon?process extension: how can we redesign the process?
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 2 / 21
![Page 3: Mining Usage Patterns from a Repository of Scientific Workflow](https://reader035.vdocuments.mx/reader035/viewer/2022081511/557d5488d8b42aba3d8b4574/html5/thumbnails/3.jpg)
Introduction
Process mining techniques allow for extracting knowledge from eventlogs, i.e. traces of running processes:
audit trails of a workflow management systemtransaction logs of an enterprise resource planning system
Main applications:process discovery: what is really happening?conformance check: are we doing what was agreed upon?process extension: how can we redesign the process?
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 2 / 21
![Page 4: Mining Usage Patterns from a Repository of Scientific Workflow](https://reader035.vdocuments.mx/reader035/viewer/2022081511/557d5488d8b42aba3d8b4574/html5/thumbnails/4.jpg)
Introduction
Process mining techniques allow for extracting knowledge from eventlogs, i.e. traces of running processes:
audit trails of a workflow management systemtransaction logs of an enterprise resource planning system
Main applications:process discovery: what is really happening?conformance check: are we doing what was agreed upon?process extension: how can we redesign the process?
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 2 / 21
![Page 5: Mining Usage Patterns from a Repository of Scientific Workflow](https://reader035.vdocuments.mx/reader035/viewer/2022081511/557d5488d8b42aba3d8b4574/html5/thumbnails/5.jpg)
Motivation
Little work about how to extract knowledge from a set of models (i.e.:process schemas)
understanding of common patterns of usage (best/worstpractices)support in process designprocess integration (from multiple sources)process optimization/re-engineeringindexing of processes and their retrieval
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 3 / 21
![Page 6: Mining Usage Patterns from a Repository of Scientific Workflow](https://reader035.vdocuments.mx/reader035/viewer/2022081511/557d5488d8b42aba3d8b4574/html5/thumbnails/6.jpg)
Methodology
Processes schemas have an inherent graph structure: manygraph-mining tools available
Approach1 representation of original processes into graphs2 usage of graph-mining techniques to extract SUBs
Which representation form?Which specific technique?Graph clustering techniques (sub-processes are clusters) with ahierarchical approach (various levels of abstractions, more/lessspecificity)
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 4 / 21
![Page 7: Mining Usage Patterns from a Repository of Scientific Workflow](https://reader035.vdocuments.mx/reader035/viewer/2022081511/557d5488d8b42aba3d8b4574/html5/thumbnails/7.jpg)
Methodology
Processes schemas have an inherent graph structure: manygraph-mining tools available
Approach1 representation of original processes into graphs2 usage of graph-mining techniques to extract SUBs
Which representation form?Which specific technique?Graph clustering techniques (sub-processes are clusters) with ahierarchical approach (various levels of abstractions, more/lessspecificity)
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 4 / 21
![Page 8: Mining Usage Patterns from a Repository of Scientific Workflow](https://reader035.vdocuments.mx/reader035/viewer/2022081511/557d5488d8b42aba3d8b4574/html5/thumbnails/8.jpg)
Methodology
Processes schemas have an inherent graph structure: manygraph-mining tools available
Approach1 representation of original processes into graphs2 usage of graph-mining techniques to extract SUBs
Which representation form?Which specific technique?Graph clustering techniques (sub-processes are clusters) with ahierarchical approach (various levels of abstractions, more/lessspecificity)
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 4 / 21
![Page 9: Mining Usage Patterns from a Repository of Scientific Workflow](https://reader035.vdocuments.mx/reader035/viewer/2022081511/557d5488d8b42aba3d8b4574/html5/thumbnails/9.jpg)
Methodology
Processes schemas have an inherent graph structure: manygraph-mining tools available
Approach1 representation of original processes into graphs2 usage of graph-mining techniques to extract SUBs
Which representation form?Which specific technique?Graph clustering techniques (sub-processes are clusters) with ahierarchical approach (various levels of abstractions, more/lessspecificity)
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 4 / 21
![Page 10: Mining Usage Patterns from a Repository of Scientific Workflow](https://reader035.vdocuments.mx/reader035/viewer/2022081511/557d5488d8b42aba3d8b4574/html5/thumbnails/10.jpg)
SUBDUE
SUBDUE [Joyner, 2000]graph-based hierarchical clustering algorithmsuited for discrete-valued and structured datasearches for substructures (i.e., subgraphs) that best compressthe input graph, according to MDL
Comments:to compress G with a SUB: to replace every occurrence of SUB inG with a single new nodeSUB best compresses G: num. of bits needed to represent G afterthe compression is minimum
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 5 / 21
![Page 11: Mining Usage Patterns from a Repository of Scientific Workflow](https://reader035.vdocuments.mx/reader035/viewer/2022081511/557d5488d8b42aba3d8b4574/html5/thumbnails/11.jpg)
SUBDUE
SUBDUE [Joyner, 2000]graph-based hierarchical clustering algorithmsuited for discrete-valued and structured datasearches for substructures (i.e., subgraphs) that best compressthe input graph, according to MDL
Comments:to compress G with a SUB: to replace every occurrence of SUBin G with a single new nodeSUB best compresses G: num. of bits needed to represent G afterthe compression is minimum
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 5 / 21
![Page 12: Mining Usage Patterns from a Repository of Scientific Workflow](https://reader035.vdocuments.mx/reader035/viewer/2022081511/557d5488d8b42aba3d8b4574/html5/thumbnails/12.jpg)
SUBDUE
SUBDUE [Joyner, 2000]graph-based hierarchical clustering algorithmsuited for discrete-valued and structured datasearches for substructures (i.e., subgraphs) that best compressthe input graph, according to MDL
Comments:to compress G with a SUB: to replace every occurrence of SUB inG with a single new nodeSUB best compresses G: num. of bits needed to represent Gafter the compression is minimum
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 5 / 21
![Page 13: Mining Usage Patterns from a Repository of Scientific Workflow](https://reader035.vdocuments.mx/reader035/viewer/2022081511/557d5488d8b42aba3d8b4574/html5/thumbnails/13.jpg)
SUBDUE (cont’d)
1 searches for the best SUB (i.e., that minimizes the DescriptionLength of the graph)
2 compresses the graph by using the best SUB
Subsequent SUBs may be defined in terms of previously definedSUBs. Iterations of this basic step results in a lattice
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 6 / 21
![Page 14: Mining Usage Patterns from a Repository of Scientific Workflow](https://reader035.vdocuments.mx/reader035/viewer/2022081511/557d5488d8b42aba3d8b4574/html5/thumbnails/14.jpg)
SUBDUE (cont’d)
1 searches for the best SUB (i.e., that minimizes the DescriptionLength of the graph)
2 compresses the graph by using the best SUB
Subsequent SUBs may be defined in terms of previously definedSUBs. Iterations of this basic step results in a lattice
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 6 / 21
![Page 15: Mining Usage Patterns from a Repository of Scientific Workflow](https://reader035.vdocuments.mx/reader035/viewer/2022081511/557d5488d8b42aba3d8b4574/html5/thumbnails/15.jpg)
SUBDUE (cont’d)
1 searches for the best SUB (i.e., that minimizes the DescriptionLength of the graph)
2 compresses the graph by using the best SUB
Subsequent SUBs may be defined in terms of previously definedSUBs. Iterations of this basic step results in a lattice
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 6 / 21
![Page 16: Mining Usage Patterns from a Repository of Scientific Workflow](https://reader035.vdocuments.mx/reader035/viewer/2022081511/557d5488d8b42aba3d8b4574/html5/thumbnails/16.jpg)
SUBDUE (cont’d)
1 searches for the best SUB (i.e., that minimizes the DescriptionLength of the graph)
2 compresses the graph by using the best SUB
Subsequent SUBs may be defined in terms of previously definedSUBs. Iterations of this basic step results in a lattice
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 6 / 21
![Page 17: Mining Usage Patterns from a Repository of Scientific Workflow](https://reader035.vdocuments.mx/reader035/viewer/2022081511/557d5488d8b42aba3d8b4574/html5/thumbnails/17.jpg)
SUBDUE (cont’d)
1 searches for the best SUB (i.e., that minimizes the DescriptionLength of the graph)
2 compresses the graph by using the best SUB
Subsequent SUBs may be defined in terms of previously definedSUBs. Iterations of this basic step results in a lattice
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 6 / 21
![Page 18: Mining Usage Patterns from a Repository of Scientific Workflow](https://reader035.vdocuments.mx/reader035/viewer/2022081511/557d5488d8b42aba3d8b4574/html5/thumbnails/18.jpg)
SUBDUE (cont’d)
1 searches for the best SUB (i.e., that minimizes the DescriptionLength of the graph)
2 compresses the graph by using the best SUB
Subsequent SUBs may be defined in terms of previously definedSUBs. Iterations of this basic step results in a lattice
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 6 / 21
![Page 19: Mining Usage Patterns from a Repository of Scientific Workflow](https://reader035.vdocuments.mx/reader035/viewer/2022081511/557d5488d8b42aba3d8b4574/html5/thumbnails/19.jpg)
SUBDUE (cont’d)
1 searches for the best SUB (i.e., that minimizes the DescriptionLength of the graph)
2 compresses the graph by using the best SUB
Subsequent SUBs may be defined in terms of previously definedSUBs. Iterations of this basic step results in a lattice
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 6 / 21
![Page 20: Mining Usage Patterns from a Repository of Scientific Workflow](https://reader035.vdocuments.mx/reader035/viewer/2022081511/557d5488d8b42aba3d8b4574/html5/thumbnails/20.jpg)
SUBDUE (cont’d)
Traditional cluster evaluation is not applicable to hierarchic domains
quality = interClusterDistanceintraClusterDistance
Desirable features of a good clustering:few and big clusters (large coverage, good generality)minimal/no overlap (better defined concepts)
New indexescompleteness: % of original nodes/edges still present in the finallatticerepresentativeness (of a SUB): % of input graphs holding theSUB at least oncesignificance: qualitative measure of the meaningfulness of acluster w.r.t. the domain and application
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 7 / 21
![Page 21: Mining Usage Patterns from a Repository of Scientific Workflow](https://reader035.vdocuments.mx/reader035/viewer/2022081511/557d5488d8b42aba3d8b4574/html5/thumbnails/21.jpg)
SUBDUE (cont’d)
Traditional cluster evaluation is not applicable to hierarchic domains
quality = interClusterDistanceintraClusterDistance
Desirable features of a good clustering:few and big clusters (large coverage, good generality)minimal/no overlap (better defined concepts)
New indexescompleteness: % of original nodes/edges still present in the finallatticerepresentativeness (of a SUB): % of input graphs holding theSUB at least oncesignificance: qualitative measure of the meaningfulness of acluster w.r.t. the domain and application
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 7 / 21
![Page 22: Mining Usage Patterns from a Repository of Scientific Workflow](https://reader035.vdocuments.mx/reader035/viewer/2022081511/557d5488d8b42aba3d8b4574/html5/thumbnails/22.jpg)
SUBDUE (cont’d)
Traditional cluster evaluation is not applicable to hierarchic domains
quality = interClusterDistanceintraClusterDistance
Desirable features of a good clustering:few and big clusters (large coverage, good generality)minimal/no overlap (better defined concepts)
New indexescompleteness: % of original nodes/edges still present in the finallatticerepresentativeness (of a SUB): % of input graphs holding theSUB at least oncesignificance: qualitative measure of the meaningfulness of acluster w.r.t. the domain and application
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 7 / 21
![Page 23: Mining Usage Patterns from a Repository of Scientific Workflow](https://reader035.vdocuments.mx/reader035/viewer/2022081511/557d5488d8b42aba3d8b4574/html5/thumbnails/23.jpg)
SUBDUE (cont’d)
Traditional cluster evaluation is not applicable to hierarchic domains
quality = interClusterDistanceintraClusterDistance
Desirable features of a good clustering:few and big clusters (large coverage, good generality)minimal/no overlap (better defined concepts)
New indexescompleteness: % of original nodes/edges still present in the finallatticerepresentativeness (of a SUB): % of input graphs holding theSUB at least oncesignificance: qualitative measure of the meaningfulness of acluster w.r.t. the domain and application
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 7 / 21
![Page 24: Mining Usage Patterns from a Repository of Scientific Workflow](https://reader035.vdocuments.mx/reader035/viewer/2022081511/557d5488d8b42aba3d8b4574/html5/thumbnails/24.jpg)
Representation
Approach1 representation of original processes into graphs2 usage of graph-mining techniques to extract SUBs
Common operators:sequencesplit-AND (parallel split), split-XOR (exclusive choice)join-AND (synchronization), join-XOR (simple merge)
How to map the process to the graph? Several choicesEmanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 8 / 21
![Page 25: Mining Usage Patterns from a Repository of Scientific Workflow](https://reader035.vdocuments.mx/reader035/viewer/2022081511/557d5488d8b42aba3d8b4574/html5/thumbnails/25.jpg)
Representation models: A rule
lowest level of compactnessevery operator is represented by a nodecalled operator, linked to another nodespecifying its type (AND/OR/SEQ) and toits operandsjoin/split are distinguishable by thenumber of in/out-going arcs
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 9 / 21
![Page 26: Mining Usage Patterns from a Repository of Scientific Workflow](https://reader035.vdocuments.mx/reader035/viewer/2022081511/557d5488d8b42aba3d8b4574/html5/thumbnails/26.jpg)
Representation models: B rule
medium level of compactnessthe closest to the original representationjoin/split are replaced by different nodes,one for each kind of operator
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 10 / 21
![Page 27: Mining Usage Patterns from a Repository of Scientific Workflow](https://reader035.vdocuments.mx/reader035/viewer/2022081511/557d5488d8b42aba3d8b4574/html5/thumbnails/27.jpg)
Representation models: C rule
highest level of compactnessimplicit representation of operatorsmultiple alternative graphs in case ofSPLIT-XOR or JOIN-XORarcs are labeled to preserve informationabout domain/range nodes
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 11 / 21
![Page 28: Mining Usage Patterns from a Repository of Scientific Workflow](https://reader035.vdocuments.mx/reader035/viewer/2022081511/557d5488d8b42aba3d8b4574/html5/thumbnails/28.jpg)
Evaluation
Commentsthe three representations hold the same information, with no losseasy extension with more attributes (actor name/role, info abouttime/resources, ...):
attribute name → edgevalue → a node
Need of experimentations to assess which one performs best w.r.t.the quality of clustering.
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 12 / 21
![Page 29: Mining Usage Patterns from a Repository of Scientific Workflow](https://reader035.vdocuments.mx/reader035/viewer/2022081511/557d5488d8b42aba3d8b4574/html5/thumbnails/29.jpg)
Evaluation
Commentsthe three representations hold the same information, with no losseasy extension with more attributes (actor name/role, info abouttime/resources, ...):
attribute name → edgevalue → a node
Need of experimentations to assess which one performs best w.r.t.the quality of clustering.
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 12 / 21
![Page 30: Mining Usage Patterns from a Repository of Scientific Workflow](https://reader035.vdocuments.mx/reader035/viewer/2022081511/557d5488d8b42aba3d8b4574/html5/thumbnails/30.jpg)
Evaluation
Commentsthe three representations hold the same information, with no losseasy extension with more attributes (actor name/role, info abouttime/resources, ...):
attribute name → edgevalue → a node
Need of experimentations to assess which one performs best w.r.t.the quality of clustering.
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 12 / 21
![Page 31: Mining Usage Patterns from a Repository of Scientific Workflow](https://reader035.vdocuments.mx/reader035/viewer/2022081511/557d5488d8b42aba3d8b4574/html5/thumbnails/31.jpg)
myExperiment
Dataset:258 processes (XScufl language for Taverna framework)e-Science WF for distributed computation over scientific dataapplications: local tools, Web Services, scripts
Setting:WF extraction/parsingpreprocessingtranslation into graphs (A,B,Crules)SUBDUE execution
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 13 / 21
![Page 32: Mining Usage Patterns from a Repository of Scientific Workflow](https://reader035.vdocuments.mx/reader035/viewer/2022081511/557d5488d8b42aba3d8b4574/html5/thumbnails/32.jpg)
Experimental results
Output lattice (fragment)
Red boxes = top level SUBs
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 14 / 21
![Page 33: Mining Usage Patterns from a Repository of Scientific Workflow](https://reader035.vdocuments.mx/reader035/viewer/2022081511/557d5488d8b42aba3d8b4574/html5/thumbnails/33.jpg)
Experimental results
A rule B rule C ruleSize (nodes) 5618 2871 2414SUBs (num) 671 611 580top SUBs 2.24% 52.37% 52.93%Completeness 95.41% 90.90% 90.27%Representativeness (first 10 top SUBs) 99.22% 31.78% 23.26%Significance of top level clusters – + ++Execution time (hh:mm) 03:38 00:06 00:03
Considerations:A model: low significance, high execution timeNo definitively best choice between B and C:
B model: higher representativeness → indexingC model: higher significance → usage pattern discovery
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 15 / 21
![Page 34: Mining Usage Patterns from a Repository of Scientific Workflow](https://reader035.vdocuments.mx/reader035/viewer/2022081511/557d5488d8b42aba3d8b4574/html5/thumbnails/34.jpg)
Experimental results
A rule B rule C ruleSize (nodes) 5618 2871 2414SUBs (num) 671 611 580top SUBs 2.24% 52.37% 52.93%Completeness 95.41% 90.90% 90.27%Representativeness (first 10 top SUBs) 99.22% 31.78% 23.26%Significance of top level clusters – + ++Execution time (hh:mm) 03:38 00:06 00:03
Considerations:A model: low significance, high execution timeNo definitively best choice between B and C:
B model: higher representativeness → indexingC model: higher significance → usage pattern discovery
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 15 / 21
![Page 35: Mining Usage Patterns from a Repository of Scientific Workflow](https://reader035.vdocuments.mx/reader035/viewer/2022081511/557d5488d8b42aba3d8b4574/html5/thumbnails/35.jpg)
Experimental results
A rule B rule C ruleSize (nodes) 5618 2871 2414SUBs (num) 671 611 580top SUBs 2.24% 52.37% 52.93%Completeness 95.41% 90.90% 90.27%Representativeness (first 10 top SUBs) 99.22% 31.78% 23.26%Significance of top level clusters – + ++Execution time (hh:mm) 03:38 00:06 00:03
Considerations:A model: low significance, high execution timeNo definitively best choice between B and C:
B model: higher representativeness → indexingC model: higher significance → usage pattern discovery
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 15 / 21
![Page 36: Mining Usage Patterns from a Repository of Scientific Workflow](https://reader035.vdocuments.mx/reader035/viewer/2022081511/557d5488d8b42aba3d8b4574/html5/thumbnails/36.jpg)
Experimental results
A rule B rule C ruleSize (nodes) 5618 2871 2414SUBs (num) 671 611 580top SUBs 2.24% 52.37% 52.93%Completeness 95.41% 90.90% 90.27%Representativeness (first 10 top SUBs) 99.22% 31.78% 23.26%Significance of top level clusters – + ++Execution time (hh:mm) 03:38 00:06 00:03
Considerations:A model: low significance, high execution timeNo definitively best choice between B and C:
B model: higher representativeness → indexingC model: higher significance → usage pattern discovery
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 15 / 21
![Page 37: Mining Usage Patterns from a Repository of Scientific Workflow](https://reader035.vdocuments.mx/reader035/viewer/2022081511/557d5488d8b42aba3d8b4574/html5/thumbnails/37.jpg)
Experimental results
A rule B rule C ruleSize (nodes) 5618 2871 2414SUBs (num) 671 611 580top SUBs 2.24% 52.37% 52.93%Completeness 95.41% 90.90% 90.27%Representativeness (first 10 top SUBs) 99.22% 31.78% 23.26%Significance of top level clusters – + ++Execution time (hh:mm) 03:38 00:06 00:03
Considerations:A model: low significance, high execution timeNo definitively best choice between B and C:
B model: higher representativeness → indexingC model: higher significance → usage pattern discovery
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 15 / 21
![Page 38: Mining Usage Patterns from a Repository of Scientific Workflow](https://reader035.vdocuments.mx/reader035/viewer/2022081511/557d5488d8b42aba3d8b4574/html5/thumbnails/38.jpg)
Experimental results
A rule B rule C ruleSize (nodes) 5618 2871 2414SUBs (num) 671 611 580top SUBs 2.24% 52.37% 52.93%Completeness 95.41% 90.90% 90.27%Representativeness (first 10 top SUBs) 99.22% 31.78% 23.26%Significance of top level clusters – + ++Execution time (hh:mm) 03:38 00:06 00:03
Considerations:A model: low significance, high execution timeNo definitively best choice between B and C:
B model: higher representativeness → indexingC model: higher significance → usage pattern discovery
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 15 / 21
![Page 39: Mining Usage Patterns from a Repository of Scientific Workflow](https://reader035.vdocuments.mx/reader035/viewer/2022081511/557d5488d8b42aba3d8b4574/html5/thumbnails/39.jpg)
Experimental results
A rule B rule C ruleSize (nodes) 5618 2871 2414SUBs (num) 671 611 580top SUBs 2.24% 52.37% 52.93%Completeness 95.41% 90.90% 90.27%Representativeness (first 10 top SUBs) 99.22% 31.78% 23.26%Significance of top level clusters – + ++Execution time (hh:mm) 03:38 00:06 00:03
Considerations:A model: low significance, high execution timeNo definitively best choice between B and C:
B model: higher representativeness → indexingC model: higher significance → usage pattern discovery
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 15 / 21
![Page 40: Mining Usage Patterns from a Repository of Scientific Workflow](https://reader035.vdocuments.mx/reader035/viewer/2022081511/557d5488d8b42aba3d8b4574/html5/thumbnails/40.jpg)
Experimental results
A rule B rule C ruleSize (nodes) 5618 2871 2414SUBs (num) 671 611 580top SUBs 2.24% 52.37% 52.93%Completeness 95.41% 90.90% 90.27%Representativeness (first 10 top SUBs) 99.22% 31.78% 23.26%Significance of top level clusters – + ++Execution time (hh:mm) 03:38 00:06 00:03
Considerations:A model: low significance, high execution timeNo definitively best choice between B and C:
B model: higher representativeness → indexingC model: higher significance → usage pattern discovery
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 15 / 21
![Page 41: Mining Usage Patterns from a Repository of Scientific Workflow](https://reader035.vdocuments.mx/reader035/viewer/2022081511/557d5488d8b42aba3d8b4574/html5/thumbnails/41.jpg)
Use Case
Scenario: a scientist wants to use web service X for DNA alignment
Baseline: browse myExperiment to find every workflow with X:too many resultsgiven a workflow: common practice or exception?
New approach: search for X in the lattice’s SUBs:1 find all usage patterns for X (i.e., top SUBs containing X)2 analyse their representativeness3 choose one of them or browse more in depth the lattice
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 16 / 21
![Page 42: Mining Usage Patterns from a Repository of Scientific Workflow](https://reader035.vdocuments.mx/reader035/viewer/2022081511/557d5488d8b42aba3d8b4574/html5/thumbnails/42.jpg)
Use Case
Scenario: a scientist wants to use web service X for DNA alignment
Baseline: browse myExperiment to find every workflow with X:too many resultsgiven a workflow: common practice or exception?
New approach: search for X in the lattice’s SUBs:1 find all usage patterns for X (i.e., top SUBs containing X)2 analyse their representativeness3 choose one of them or browse more in depth the lattice
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 16 / 21
![Page 43: Mining Usage Patterns from a Repository of Scientific Workflow](https://reader035.vdocuments.mx/reader035/viewer/2022081511/557d5488d8b42aba3d8b4574/html5/thumbnails/43.jpg)
Use Case
Scenario: a scientist wants to use web service X for DNA alignment
Baseline: browse myExperiment to find every workflow with X:too many resultsgiven a workflow: common practice or exception?
New approach: search for X in the lattice’s SUBs:1 find all usage patterns for X (i.e., top SUBs containing X)2 analyse their representativeness3 choose one of them or browse more in depth the lattice
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 16 / 21
![Page 44: Mining Usage Patterns from a Repository of Scientific Workflow](https://reader035.vdocuments.mx/reader035/viewer/2022081511/557d5488d8b42aba3d8b4574/html5/thumbnails/44.jpg)
Use Case
Scenario: a scientist wants to use web service X for DNA alignment
Baseline: browse myExperiment to find every workflow with X:too many resultsgiven a workflow: common practice or exception?
New approach: search for X in the lattice’s SUBs:1 find all usage patterns for X (i.e., top SUBs containing X)2 analyse their representativeness3 choose one of them or browse more in depth the lattice
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 16 / 21
![Page 45: Mining Usage Patterns from a Repository of Scientific Workflow](https://reader035.vdocuments.mx/reader035/viewer/2022081511/557d5488d8b42aba3d8b4574/html5/thumbnails/45.jpg)
Use Case
Scenario: a scientist wants to use web service X for DNA alignment
Baseline: browse myExperiment to find every workflow with X:too many resultsgiven a workflow: common practice or exception?
New approach: search for X in the lattice’s SUBs:1 find all usage patterns for X (i.e., top SUBs containing X)2 analyse their representativeness3 choose one of them or browse more in depth the lattice
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 16 / 21
![Page 46: Mining Usage Patterns from a Repository of Scientific Workflow](https://reader035.vdocuments.mx/reader035/viewer/2022081511/557d5488d8b42aba3d8b4574/html5/thumbnails/46.jpg)
Use Case
Scenario: a scientist wants to use web service X for DNA alignment
Baseline: browse myExperiment to find every workflow with X:too many resultsgiven a workflow: common practice or exception?
New approach: search for X in the lattice’s SUBs:1 find all usage patterns for X (i.e., top SUBs containing X)2 analyse their representativeness3 choose one of them or browse more in depth the lattice
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 16 / 21
![Page 47: Mining Usage Patterns from a Repository of Scientific Workflow](https://reader035.vdocuments.mx/reader035/viewer/2022081511/557d5488d8b42aba3d8b4574/html5/thumbnails/47.jpg)
Use case (cont’d)
Applications:reference during process design (learn-by-example)process retrieval: find every myExperiment WF containing such apattern
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 17 / 21
![Page 48: Mining Usage Patterns from a Repository of Scientific Workflow](https://reader035.vdocuments.mx/reader035/viewer/2022081511/557d5488d8b42aba3d8b4574/html5/thumbnails/48.jpg)
Use case (cont’d)
Applications:reference during process design (learn-by-example)process retrieval: find every myExperiment WF containing such apattern
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 17 / 21
![Page 49: Mining Usage Patterns from a Repository of Scientific Workflow](https://reader035.vdocuments.mx/reader035/viewer/2022081511/557d5488d8b42aba3d8b4574/html5/thumbnails/49.jpg)
Discussion
A graph-based clustering technique (SUBDUE) to recognize the mostfrequent/common subprocesses among a set of workflows/processschemas.
Commentsseveral representation alternatives for application of SUBDUEevaluations on specialized e-Science domainuseful to find typical patterns and schemas of usage for tools/tasks
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 18 / 21
![Page 50: Mining Usage Patterns from a Repository of Scientific Workflow](https://reader035.vdocuments.mx/reader035/viewer/2022081511/557d5488d8b42aba3d8b4574/html5/thumbnails/50.jpg)
Conclusion
Applications:recognition of reference processes (common/best/worstpractices)organize a process repository (by indexing processes throughsubstructures)enterprise integration: find similarities (differences, overlapping,complementariness) in BP in different companies
Future work:extending experimentations, especially with (real) BusinessProcessesmanagement of heterogeneity (syntactic/semantic, level ofgranularity)
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 19 / 21
![Page 51: Mining Usage Patterns from a Repository of Scientific Workflow](https://reader035.vdocuments.mx/reader035/viewer/2022081511/557d5488d8b42aba3d8b4574/html5/thumbnails/51.jpg)
Mining Usage Patterns from a Repositoryof Scientific Workflows
Claudia Diamantini, Domenico Potena, Emanuele Storti
DII, Universita Politecnica delle Marche, Ancona, Italy
SAC 2012, Riva del Garda, March 26-30
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 20 / 21
![Page 52: Mining Usage Patterns from a Repository of Scientific Workflow](https://reader035.vdocuments.mx/reader035/viewer/2022081511/557d5488d8b42aba3d8b4574/html5/thumbnails/52.jpg)
SUBDUE
Subdue uses a variant of beam search (heuristic)Best SUB minimizes the Description Length of the graph
1 Search the best SUBInit : set of SUBs consisting of all uniquely labeled verticesextend each SUB in all possible ways by a single edge and a vertexorder SUBs according to MDL principlekeep only the top n SUBsEnd : exhaustion of search space or user constraints
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 21 / 21
![Page 53: Mining Usage Patterns from a Repository of Scientific Workflow](https://reader035.vdocuments.mx/reader035/viewer/2022081511/557d5488d8b42aba3d8b4574/html5/thumbnails/53.jpg)
SUBDUE
Subdue uses a variant of beam search (heuristic)Best SUB minimizes the Description Length of the graph
1 Search the best SUBInit : set of SUBs consisting of all uniquely labeled verticesextend each SUB in all possible ways by a single edge and a vertexorder SUBs according to MDL principlekeep only the top n SUBsEnd : exhaustion of search space or user constraints
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 21 / 21
![Page 54: Mining Usage Patterns from a Repository of Scientific Workflow](https://reader035.vdocuments.mx/reader035/viewer/2022081511/557d5488d8b42aba3d8b4574/html5/thumbnails/54.jpg)
SUBDUE
Subdue uses a variant of beam search (heuristic)Best SUB minimizes the Description Length of the graph
1 Search the best SUBInit : set of SUBs consisting of all uniquely labeled verticesextend each SUB in all possible ways by a single edge and a vertexorder SUBs according to MDL principlekeep only the top n SUBsEnd : exhaustion of search space or user constraints
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 21 / 21
![Page 55: Mining Usage Patterns from a Repository of Scientific Workflow](https://reader035.vdocuments.mx/reader035/viewer/2022081511/557d5488d8b42aba3d8b4574/html5/thumbnails/55.jpg)
SUBDUE
Subdue uses a variant of beam search (heuristic)Best SUB minimizes the Description Length of the graph
1 Search the best SUBInit : set of SUBs consisting of all uniquely labeled verticesextend each SUB in all possible ways by a single edge and a vertexorder SUBs according to MDL principlekeep only the top n SUBsEnd : exhaustion of search space or user constraints
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 21 / 21
![Page 56: Mining Usage Patterns from a Repository of Scientific Workflow](https://reader035.vdocuments.mx/reader035/viewer/2022081511/557d5488d8b42aba3d8b4574/html5/thumbnails/56.jpg)
SUBDUE
Subdue uses a variant of beam search (heuristic)Best SUB minimizes the Description Length of the graph
1 Search the best SUBInit : set of SUBs consisting of all uniquely labeled verticesextend each SUB in all possible ways by a single edge and a vertexorder SUBs according to MDL principlekeep only the top n SUBsEnd : exhaustion of search space or user constraints
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 21 / 21
![Page 57: Mining Usage Patterns from a Repository of Scientific Workflow](https://reader035.vdocuments.mx/reader035/viewer/2022081511/557d5488d8b42aba3d8b4574/html5/thumbnails/57.jpg)
SUBDUE
Subdue uses a variant of beam search (heuristic)Best SUB minimizes the Description Length of the graph
1 Search the best SUBInit : set of SUBs consisting of all uniquely labeled verticesextend each SUB in all possible ways by a single edge and a vertexorder SUBs according to MDL principlekeep only the top n SUBsEnd : exhaustion of search space or user constraints
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 21 / 21
![Page 58: Mining Usage Patterns from a Repository of Scientific Workflow](https://reader035.vdocuments.mx/reader035/viewer/2022081511/557d5488d8b42aba3d8b4574/html5/thumbnails/58.jpg)
SUBDUE
Subdue uses a variant of beam search (heuristic)Best SUB minimizes the Description Length of the graph
1 Search the best SUBInit : set of SUBs consisting of all uniquely labeled verticesextend each SUB in all possible ways by a single edge and a vertexorder SUBs according to MDL principlekeep only the top n SUBsEnd : exhaustion of search space or user constraints
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 21 / 21
![Page 59: Mining Usage Patterns from a Repository of Scientific Workflow](https://reader035.vdocuments.mx/reader035/viewer/2022081511/557d5488d8b42aba3d8b4574/html5/thumbnails/59.jpg)
SUBDUE
Subdue uses a variant of beam search (heuristic)Best SUB minimizes the Description Length of the graph
1 Search the best SUBInit : set of SUBs consisting of all uniquely labeled verticesextend each SUB in all possible ways by a single edge and a vertexorder SUBs according to MDL principlekeep only the top n SUBsEnd : exhaustion of search space or user constraints
2 Compress the graph with the best SUB
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 21 / 21
![Page 60: Mining Usage Patterns from a Repository of Scientific Workflow](https://reader035.vdocuments.mx/reader035/viewer/2022081511/557d5488d8b42aba3d8b4574/html5/thumbnails/60.jpg)
SUBDUE
Subdue uses a variant of beam search (heuristic)Best SUB minimizes the Description Length of the graph
1 Search the best SUBInit : set of SUBs consisting of all uniquely labeled verticesextend each SUB in all possible ways by a single edge and a vertexorder SUBs according to MDL principlekeep only the top n SUBsEnd : exhaustion of search space or user constraints
2 Compress the graph with the best SUB
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 21 / 21
![Page 61: Mining Usage Patterns from a Repository of Scientific Workflow](https://reader035.vdocuments.mx/reader035/viewer/2022081511/557d5488d8b42aba3d8b4574/html5/thumbnails/61.jpg)
SUBDUE
Subdue uses a variant of beam search (heuristic)Best SUB minimizes the Description Length of the graph
1 Search the best SUBInit : set of SUBs consisting of all uniquely labeled verticesextend each SUB in all possible ways by a single edge and a vertexorder SUBs according to MDL principlekeep only the top n SUBsEnd : exhaustion of search space or user constraints
2 Compress the graph with the best SUB
Subsequent SUBs may be defined in terms of previously defined SUBs
Iterations of this basic step results in a lattice
Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 21 / 21