fuzzy mining – adaptive process simplification based …wvdaalst/publications/p400.pdf · fuzzy...
Post on 30-May-2018
221 Views
Preview:
TRANSCRIPT
Fuzzy Mining – Adaptive Process SimplificationBased on Multi-Perspective Metrics
Christian W. Gunther and Wil M.P. van der Aalst
Eindhoven University of TechnologyP.O. Box 513, NL-5600 MB, Eindhoven, The Netherlands{c.w.gunther, w.m.p.v.d.aalst}@tue.nl
Abstract. Process Mining is a technique for extracting process models from ex-ecution logs. This is particularly useful in situations where people have an ide-alized view of reality. Real-life processes turn out to be less structured than peo-ple tend to believe. Unfortunately, traditional process mining approaches haveproblems dealing with unstructured processes. The discovered models are often“spaghetti-like”, showing all details without distinguishing what is important andwhat is not. This paper proposes a new process mining approach to overcome thisproblem. The approach is configurable and allows for different faithfully simpli-fied views of a particular process. To do this, the concept of a roadmap is used asa metaphor. Just like different roadmaps provide suitable abstractions of reality,process models should provide meaningful abstractions of operational processesencountered in domains ranging from healthcare and logistics to web servicesand public administration.
1 Introduction
Business processes, whether defined and prescribed or implicit and ad-hoc, drive andsupport most of the functions and services in enterprises and administrative bodies oftoday’s world. For describing such processes, modeling them as graphs has proven tobe a useful and intuitive tool. While modeling is well-established in process design, itis complicated to do for monitoring and documentation purposes. However, especiallyfor monitoring, process models are valuable artifacts, because they allow us to commu-nicate complex knowledge in intuitive, compact, and high-level form.
Process mining is a line of research which attempts to extract such abstract, compactrepresentations of processes from their logs, i.e. execution histories [1–3, 5–7, 10, 14].The α-algorithm, for example, can create a Petri net process model from an executionlog [2]. In the last years, a number of process mining approaches have been developed,which address the various perspectives of a process (e.g., control flow, social network),and use various techniques to generalize from the log (e.g., genetic algorithms, theoryof regions [12, 4]). Applied to explicitly designed, well-structured, and rigidly enforcedprocesses, these techniques are able to deliver an impressive set of information, yet theirpurpose is somewhat limited to verifying the compliant execution. However, most pro-cesses in real life have not been purposefully designed and optimized, but have evolvedover time or are not even explicitly defined. In such situations, the application of pro-cess mining is far more interesting, as it is not limited to re-discovering what we alreadyknow, but it can be used to unveil previously hidden knowledge.
2
Over the last couple of years we obtained much experience in applying the tried-and-tested set of mining algorithms to real-life processes. Existing algorithms tend toperform well on structured processes, but often fail to provide insightful models for lessstructured processes. The phrase “spaghetti models” is often used to refer to the resultsof such efforts. The problem is not that existing techniques produce incorrect results.In fact, some of the more robust process mining techniques guarantee that the resultingmodel is “correct” in the sense that reality fits into the model. The problem is that theresulting model shows all details without providing a suitable abstraction. This is com-parable to looking at the map of a country where all cities and towns are represented byidentical nodes and all roads are depicted in the same manner. The resulting map is cor-rect, but not very suitable. Therefore, the concept of a roadmap is used as a metaphor tovisualize the resulting models. Based on an analysis of the log, the importance of activ-ities and relations among activities are taken into account. Activities and their relationscan be clustered or removed depending on their role in the process. Moreover, certainaspects can be emphasized graphically just like a roadmap emphasizes highways andlarge cities over dirt roads and small towns. As will be demonstrated in this paper, theroadmap metaphor allows for meaningful process models.
In this paper we analyze the problems traditional mining algorithms have with less-structured processes (Section 2), and use the metaphor of maps to derive a novel, moreappropriate approach from these lessons (Section 3). We abandon the idea of perform-ing process mining confined to one perspective only, and propose a multi-perspectiveset of log-based process metrics (Section 4). Based on these, we have developed a flexi-ble approach for Fuzzy Mining, i.e. adaptively simplifying mined process models (Sec-tion 5).
2 Less-Structured Processes – The Infamous Spaghetti Affair
The fundamental idea of process mining is both simple and persuasive: There is a pro-cess which is unknown to us, but we can follow the traces of its behavior, i.e. we haveaccess to enactment logs. Feeding those into a process mining technique will yield anaggregate description of that observed behavior, e.g. in form of a process model.
In the beginning of process mining research, mostly artificially generated logs wereused to develop and verify mining algorithms. Then, also logs from real-life work-flow management systems, e.g. Staffware, could be successfully mined with these tech-niques. Early mining algorithms had high requirements towards the qualities of log files,e.g. they were supposed to be complete and limited to events of interest. Yet, most ofthe resulting problems could be easily remedied with more data, filtering the log andtuning the algorithm to better cope with problematic data.
While these successes were certainly convincing, most real-life processes are notexecuted within rigid, inflexible workflow management systems and the like, which en-force correct, predictive behavior. It is the inherent inflexibility of these systems whichdrove the majority of process owners (i.e., organizations having the need to supportprocesses) to choose more flexible or ad-hoc solutions. Concepts like Adaptive Work-flow or Case Handling either allow users to change the process at runtime, or defineprocesses in a somewhat more “loose” manner which does not strictly define a specific
3
path of execution. Yet the most popular solutions for supporting processes do not en-force any defined behavior at all, but merely offer functionality like sharing data andpassing messages between users and resources. Examples for these systems are ERP(Enterprise Resource Planning) and CSCW (Computer-Supported Cooperative Work)systems, custom-built solutions, or plain E-Mail.
It is obvious that executing a process within such less restrictive environments willlead to more diverse and less-structured behavior. This abundance of observed behav-ior, however, unveiled a fundamental weakness in most of the early process miningalgorithms. When these are used to mine logs from less-structured processes, the resultis usually just as unstructured and hard to understand. These “spaghetti” process mod-els do not provide any meaningful abstraction from the event logs themselves, and aretherefore useless to process analysts. It is important to note that these “spaghetti” mod-els are not incorrect. The problem is that the processes themselves are really “spaghetti-like”, i.e., the model is an accurate reflection of reality.
PYSY(complete)
108
0.971
47
PYWZ(complete)
316
0.917
39
DSYE(complete)
69
0.667
3
LEWP(complete)
2 0
1
0.917
27
0.994
197
OSYW(complete)
181
0.833
41
OSCW(complete)
94
0.667
31
YWNB(complete)
84
0.833
19
DSWZ(complete)
31
0.667
11
ZEZR(complete)
1 0.5
1
IWLH(complete)
252
0.99
162
IWNP(complete)
303
0.527
75
0.986
159
UNYK(complete)
35
0.667
18
CEOI(complete)
450
0.88
61
ONIX(complete)
12
0.8
12
IWOL(complete)
230
0.968
99
IWDB(complete)
64
0.588
43
0.75
60
VPMO(complete)
4
0.5
2
IWOW(complete)
144
0.857
24
0.964
61
OSIX(complete)
20
0.5
4
ONXB(complete)
169
0.75
19
KZWZ(complete)
224
0.75
15
LEYW(complete)
115
0.75
14
PYSK(complete)
83
0.971
38
OSWZ(complete)
78
0.667
14
YWNA(complete)1651
0.75
26
0.968
46
ONWZ(complete)
34 0.75
7
YWZP(complete)
23 0.667
10
0.987
125
OSZY(complete)
28
0.8
10
AHCW(complete)
1
0.5
1
0.955
52
OSVO(complete)
210
0.769
29
OSIO(complete)
5
0.667
2
0.991
149
ONYW(complete)
55
0.8
25
OSPD(complete)
17
0.5
9
0.923
17
0.5
5
UNHL(complete)
91
0.979
74
VPPQ(complete)
184
0.667
11
UNEL(complete)
155
0.977
86
UNLE(complete)
41
0.545
18
0.881
36
0.5
16
UNEN(complete)
44
NEOI(complete)
15 0.75
9
DNEZ(complete)
6
0.833
6
0.978
50
YWHY(complete)
44
0.75
27YWWY
(complete)94
0.889
32
0.8
29
YWLY(complete)
138 0.886
53
HQOQ(complete)6163
0.667
53
ONOW(complete)
5
0.5
2
ONZY(complete)
394
0.992
244
ONVL(complete)
155
0.756
70
ONZO(complete)
173
0.956
78
0.955
28
DSYN(complete)
57 0.921
37
DSYV(complete)
52
0.939
49
DSPY(complete)
35
0.8
18
DSLQ(complete)
30
0.909
16
0.667
10
0.917
13
OSYN(complete)
106
0.5
7
0.957
52
OSDL(complete)
80
0.821
28
OSAT(complete)
8
0.75
5
0.929
19
DSSV(complete)
234
0.966
50
OSAI(complete)
8
0.8
5
OSHY(complete)
224
0.989
153
AHZI(complete)
68
0.878
50
0.983
88
DSSN(complete)
304
0.854
129
OSVZ(complete)
18
0.5
4
APPP(complete)9110
0.794
60
0.8
43
0.9
28
1
7387
AHHW(complete)
197
0.917
76
ONVY(complete)
160
0.941
53
ZEDS(complete)
195
0.828
118
XIEO(complete)1038
0.929
366
DNLP(complete)
264
0.929
117
AIVY(complete)
552
0.941
178
HQQL(complete)1153
0.909
379
KZPA(complete)
224
0.941
100
TUTC(complete)
23
0.667
23
OSYR(complete)
19 0.5
11
OSXZ(complete)
10
0.5
2
WSII(complete)
13 0.5
6
YVYW(complete)
1
0.5
1
AISB(complete)
4
0.5
3
XIWW(complete)1422
0.996
770YWSM
(complete)648
0.923
123
AICK(complete)
56 0.936
49
0.984
165
OSHB(complete)
231
0.944
65
OSZO(complete)
61 0.8
28
OSOS(complete)
35
OSSP(complete)
358
0.667
17
DSSA(complete)
238
0.98
147
0.969
66
VPPN(complete)
2
0.5
1
DSVM(complete)
29
0.833
12 0.756
42
0.984
128
SVEI(complete)
302
0.825
38
POZI(complete)
263
0.8
20
0.932
51
0.991
173
SCEI(complete)
449
0.944
126
OSOI(complete)
291
0.981
110
POLA(complete)1255
0.923
169
OSLP(complete)
75
0.8
5
ONCZ(complete)
11
0.5
2
0.5
5
0.923
89
0.998
871
OSWL(complete)
103
0.877
54
AISW(complete)
430 0.667
34
ONPI(complete)
264
0.819
104
LELY(complete)
126
0.667
12
KZOL(complete)
166
0.8
36
SPWV(complete)
114
0.929
28
ONYZ(complete)
7
0.5
6
OSPL(complete)
4
0.5
1
0.667
20
0.995
281
YHLH(complete)
749
0.918
127
0.994
437
AHHJ(complete)
264
0.972
56
0.942
84
SPWB(complete)
214
0.938
54
AHBL(complete)
271
0.981
45
AISY(complete)
845
0.917
43
DSYA(complete)
3
0.5
2
0.99
154
AHCA(complete)
206
0.978
51
0.888
51
0.986
114
0.889
58
DSNA(complete)1159
0.833
36
0.857
8
0.997
1054 DSCL(complete)
3 0.5
3
DSIC(complete)
20
0.75
11
LEOJ(complete)
7
0.667
5
DSRY(complete)
2
0.5
2
0.5
7
0.667
55
0.986
104
OONO(complete)
28
0.5
6
0.8
5
0.974
71
ONAZ(complete)
171
0.909
41
ONHL(complete)
264
0.974
108
VPHT(complete)
118
0.975
42
ONYN(complete)
142
0.942
88
ONIZ(complete)
23 0.75
8
LEVN(complete)
5
0.5
2
VPNY(complete)
10
0.5
1
0.982
59
VPNH(complete)
160
0.854
53
0.983
75
ONLW(complete)
100
0.978
46
UNEA(complete)
239
0.941
36
VPNE(complete)
15
0.667
2
0.957
52
0.75
24
0.909
14
0.985
94
NEOW(complete)
60
0.923
34
NEPL(complete)
47
0.885
26
0.909
19
0.667
24
NEWN(complete)
96
0.667
23
0.958
49
0.985
161
KZOO(complete)
26
0.8
18
OVWZ(complete)
21 0.5
6
LEWO(complete)
2
0.5
2
0.941
37
XISH(complete)
866
0.833
59
OSEL(complete)
1
0.5
1
0.9
65
0.995
285
0.833
29
0.667
29
0.8
46
0.997
729
AIVM(complete)
130
0.667
44
YWOK(complete)
856
0.998
604
YWVP(complete)
469 0.769
134
AHMK(complete)
1
0.5
1
0.991
112
0.923
46
0.997
407
YWXB(complete)
257
0.896
92
0.917
64
0.984
69
SVPZ(complete)
203
0.98
64
OSAX(complete)
6
0.833
5
AHNP(complete)
301
0.993
253
AHMI(complete)
10
0.875
8
0.997
370
0.8
40
HDEI(complete)
40
0.5
3
TUII(complete)
90
0.667
9
0.995
245
0.667
27
0.999
1471
0.812
101
AISI(complete)
842
0.75
27
IOII(complete)
243 0.872
43
BKZV(complete)
4
0.5
1
0.993
158
0.881
73
0.9
62
0.976
47YWPZ
(complete)92
0.944
54
CESO(complete)
123
0.82
45
0.875
85
0.977
55
0.921
40
0.989
123
SCPZ(complete)
154
0.889
28
DNWZ(complete)
22
0.667
8
0.994
168
HNRN(complete)
49
0.944
27
AHZE(complete)
1
0.5
1
VPPK(complete)
350
0.904
80
0.986
154
0.9
38
0.909
72
0.964
25
0.984
101
VPHA(complete)
102 0.9
43
0.981
173
0.978
44
0.909
37
ONOI(complete)
703
0.667
54
0.941
104
0.966
56
ONHI(complete)
78
0.947
60
ONHB(complete)
178
0.952
79
ONHY(complete)
74
0.983
64
ONNY(complete)
191
0.938
64
0.988
108
0.75
77
0.75
60
0.998
522
VPHN(complete)
279
0.867
140
0.985
136
0.75
97
0.985
151
0.75
18
0.985
83
LEJO(complete)
7 0.5
5
LEJV(complete)
2
0.5
1
AHLZ(complete)
5
0.5
1
PONA(complete)
65
0.667
11
0.938
38
0.98
91
ONIY(complete)
117
0.921
70
0.981
70
0.97
43
HQWL(complete)
221
0.99
165
HQXZ(complete)
83
0.759
35
0.75
17
1
5207
INCY(complete)
50
0.812
45
HQXB(complete)
775
0.909
277
KZZL(complete)1427
0.917
270
HQOH(complete)
59
0.848
59
AIYW(complete)
114
0.941
67
BLZI(complete)
3
0.5
3
NWYA(complete)
1
0.5
1
0.968
47
HQOZ(complete)
601
0.75
13
HQHI(complete)4504
1
4245
INKY(complete)
93
0.963
44
INLO(complete)
119
0.885
35
KZOY(complete)
122 0.875
35
0.982
84
LEJE(complete)
50
0.923
41
LEOO(complete)
85
0.914
37
LELZ(complete)
10
0.75
3
0.917
35
LEVL(complete)
176
0.786
43
0.967
63
ZEPZ(complete)
141
0.955
121
ZELC(complete)
304
0.988
128
0.9
117
0.975
109
KZOI(complete)
107
0.909
42
OVNA(complete)
155
0.909
21
0.929
253
0.998
629
AIHT(complete)
87
0.978
53
OVLE(complete)
107
0.875
16
0.932
53
TURC(complete)
53
0.95
26
BKXT(complete)
1
0.5
1
0.929
152
0.909
37 0.977
111
0.923
15
OONV(complete)
26
0.667
9
0.667
14
0.909
11
OPNO(complete)
2
0.5
1
0.762
67
0.75
62
0.99
107
AIVJ(complete)
7 0.5
1
HQJW(complete)
2
0.5
1
AIGP(complete)
877
0.983
85
0.996
445
XISV(complete)
418 0.917
149
0.929
32
CNNN(complete)
60
0.743
46
HDEA(complete)
243
0.727
18
HDWE(complete)
43
0.833
6
TUGP(complete)
73
0.688
19
DSCW(complete)
20
0.667
4
0.944
95
0.957
47
0.917
15
0.994
230
AINO(complete)
270
0.872
47
ZENC(complete)
806
0.933
101
0.909
53
0.75
29
0.976
112
HQPI(complete)
48
0.667
7
0.8
31
0.909
70
INOL(complete)
53 0.909
34
0.955
35
0.917
12
INOM(complete)
44
0.861
38
0.96
27
HQEL(complete)
256
0.8
16
0.909
385
0.998
390
0.857
50
0.992
205
0.75
14
0.893
42
0.97
30
0.991
187
0.75
32
0.75
37
INVW(complete)
42
0.727
17 0.929
22
0.667
5
0.976
41
0.963
47
OVNZ(complete)
163
0.828
56
0.983
97
HQLE(complete)6010
0.833
66
0.941
250
0.934
160
0.996
285
OSOL(complete)
28 0.667
6
TUXW(complete)
1
0.5
1
HQQY(complete)
185
0.962
83
HQLY(complete)
80
0.902
62
HQLK(complete)
1
0.5
1
0.896
79
0.909
425
0.998
682
0.981
56
0.857
30
0.667
10
0.845
96 1
5791
TUWI(complete)
43
0.667
27
KZEY(complete)
43 0.667
18
HQXY(complete)
12
0.667
4
LEOR(complete)
4
0.5
3
0.998
651
ZENA(complete)1094
0.795
151
0.939
113 0.999
931
0.941
104
0.904
78
0.99
115
0.711
52
0.998
779
KZWN(complete)
87
0.964
55
0.516
30
LGZI(complete)
6
0.5
1
0.917
262
0.75
160
0.999
1124
TUJI(complete)
296
0.889
13
HLPA(complete)
122
0.75
21
AIHN(complete)
23
0.667
3
0.909
98
DSPO(complete)
3
0.5
1
0.833
30
HDOE(complete)
10
0.5
1
CWNW(complete)
782
0.667
15
BKZI(complete)
8 0.5
1
KZZO(complete)
111
0.955
46
0.929
74
0.998
689
AINP(complete)
7
0.5
2
OSIZ(complete)
161
0.917
28
AIYB(complete)
2
0.667
2
0.875
80
0.995
312
0.987
171
POZW(complete)
103
0.933
75
0.8
43
0.933
15
0.933
27
0.97
72
OVBK(complete)
84
0.932
53
0.941
29
0.964
22
0.955
29
YWWP(complete)
226
0.994
173
ZEIN(complete)
887
0.923
15
CWSW(complete)
177 0.971
34
0.889
59
0.998
743
ZEBZ(complete)
179
0.917
56
ZEWZ(complete)
9
0.5
4
HQLI(complete)
28
0.75
5
0.909
18
0.993
158
IOIK(complete)
76 0.617
67
0.938
31
0.952
26
DSLI(complete)
7 0.5
3
0.881
51
0.989
121
0.667
18
TUMI(complete)
409
0.995
310
0.923
10
HDPT(complete)
145
0.833
10
BKMI(complete)
195
0.923
60
TUMK(complete)
40
0.571
16
0.848
39
0.986
97
SCNI(complete)
4
0.5
1
0.917
20
0.944
73
CNAT(complete)
65
0.95
24
CNNS(complete)
68
0.889
40
0.794
29
CNIK(complete)
43
0.971
42
0.941
16
0.923
18
0.938
17
TURI(complete)
53
0.839
34
0.909
15
TURK(complete)
42
0.825
38
0.9
20
TUSC(complete)
166
0.833
17
0.967
36
SPWA(complete)
129
0.824
72
0.962
45
SPWN(complete)
120
0.9
74
0.957
45
SPWI(complete)
99
0.963
72
0.914
46
0.909
27
TUZV(complete)
308
0.986
119
TUZC(complete)
390
0.929
148
TUJV(complete)
92
0.889
28
0.929
14
0.987
101
TUZI(complete)
395
0.944
252
TUWV(complete)
5
0.5
1
TUTP(complete)
24
0.667
4
0.944
2
0.98
119
TUOK(complete)
44 0.975
42
TUPK(complete)
66
0.954
65
TUZK(complete)
207
0.969
156
0.875
43
0.972
34
TUPC(complete)
18
0.75
8
BKYA(complete)
153
0.97
55
BKYI(complete)
160
0.892
93
KZAL(complete)
2
0.5
2
0.952
37
BKYV(complete)
144
0.884
109
TUSI(complete)
306
0.667
10
LEBR(complete)
1 0.5
1
0.933
21
BLYI(complete)
128
0.797
120
0.968
56
0.978
50
0.96
80
0.957
84
0.938
49
BCCC(complete)
82
0.667
18
TUEW(complete)
154
0.688
35
TUIV(complete)
1
0.5
1
0.97
82
0.8
16
HDWP(complete)
11
0.5
1
TUSK(complete)
88
0.667
13
0.667
32
0.989
209
0.923
10
0.941
25
0.941
66
0.981
46
0.5
7
0.667
23 0.667
8
0.941
43
0.962
51
TUJC(complete)
121
0.857
22
0.9
37
0.955
32
TUIC(complete)
24
0.667
7
TUJP(complete)
203
0.964
99
0.923
86
TUGV(complete)
39
0.8
13 0.923
56
0.986
100
TUJK(complete)
131
0.852
124
TUMC(complete)
192
0.8
49
LEBS(complete)
197
0.833
38
TUAB(complete)
12
0.5
2
TUYK(complete)
1
0.5
1
TUKW(complete)
73
0.833
12
0.955
45
0.5
1
0.667
7
0.955
28
0.75
12
0.981
100
HDPR(complete)
35
0.594
31
0.8
8
BCCV(complete)
55
0.75
22
TUGC(complete)
79
0.667
20
0.989
134
0.857
19 0.875
21
0.964
36
0.8
18
0.955
41
0.75
27
0.967
49
TUGI(complete)
152
0.986
119
TUGK(complete)
40
0.857
21
HLPO(complete)
4
0.5
3
BCCI(complete)
177
0.989
139
BCCK(complete)
63
0.923
30
0.993
231
0.917
54
TUTI(complete)
26
0.667
7
0.5
8
0.5
4
0.5
1
0.75
21
0.998
744
0.917
56
0.992
122
DSLN(complete)
30
0.917
13
DSLL(complete)
36
0.759
28
DSZN(complete)
25
0.958
25
DSLX(complete)
48
0.87
20
0.889
26
0.947
20
DSZV(complete)
22
0.957
22
DSLV(complete)
26
0.952
20
0.95
15
HDOP(complete)
7
0.5
6
0.667
3
0.8
30
TUMV(complete)
96
0.75
41
0.938
38
AIWZ(complete)
116
0.75
14
BKMA(complete)
130
0.8
29
0.944
69
0.875
53
0.99
129
0.923
44
0.96
80
BKMV(complete)
143
0.912
66
0.75
17
0.923
14
0.917
45
0.7
14 0.955
17
0.923
17
0.929
18
BCCW(complete)
4
0.4
4
0.75
19
0.667
28
0.8
28
0.5
3
DSIY(complete)
14
0.667
4
DSIT(complete)
13 0.727
13
0.875
8
0.533
10
0.5
2
OSOW(complete)
6
0.5
2
0.833
10
0.941
17
0.5
2
0.667
3
LECL(complete)
211
0.75
23
0.991
159
0.75
37
0.99
135
TUIP(complete)
39
0.75
28
0.667
20
0.909
86
SYPI(complete)
6
0.667
5
0.976
85
HLPN(complete)
99
0.8
33
0.8
2
0.98
66
TUIK(complete)
24
0.842
21
0.889
43
0.917
50
BLOM(complete)
22
0.522
20
BLAM(complete)
63
HLPW(complete)
32
0.833
11
TUQK(complete)
1
0.5
1
0.889
18
0.857
26
0.833
26 0.977
76
BLBO(complete)
10 0.5
2
0.75
15
0.929
19
TUWP(complete)
39
0.944
19
0.5
7
0.923
20
0.545
17
TUWK(complete)
23
0.938
22
0.667
2
0.667
12
0.958
64
0.857
15
0.75
27
HLPI(complete)
6 0.667
4
0.667
7
BKZA(complete)
1
0.5
1
TUTK(complete)
13
0.857
13
0.667
7
0.667
12
0.5
2
0.917
31
0.5
2
0.981
111
0.8
11
OSHW(complete)
4
0.5
1
OSXY(complete)
10
0.667
2
0.667
40
0.933
30
AHWZ(complete)
3
0.5
1
0.5
1
0.864
24
0.964
21
0.75
23
0.667
3
0.5
1
0.5
3
0.938
18
0.909
94
0.989
470
HQWP(complete)
1
0.5
1
0.5
1
LEJC(complete)
6
0.667
4
0.75
5
NEBI(complete)
33
0.917
13
0.947
16
NEOO(complete)
16 0.562
14
NEEK(complete)
18
0.9
11
NELT(complete)
11
0.909
11
NELO(complete)
13
0.909
10
0.667
6
0.5
3
OSLL(complete)
61
0.75
16
0.941
35
OSHD(complete)
17
0.667
6
0.5
10
0.812
30
0.667
8
0.667
8
0.933
13
0.5
2
HLPJ(complete)
1
0.5
1
ONWL(complete)
21
0.875
7
0.8
10
0.625
6
0.667
4
0.667
5
0.5
2
BKXI(complete)
1
0.5
1
BKXA(complete)
1
0.5
1
0.5
1
0.875
17
0.5
1
0.5
2
0.875
21
0.5
1
0.4
4
0.5
1
0.75
3
0.75
4
0.929
14
0.5
3
0.5
2
0.667
2
0.75
2
0.5
1
0
1
0.667
7
TUAV(complete)
5
0.5
7
0.25
4
TUAI(complete)
1
0.5
1
TUAK(complete)
2
0.5
1
0.5
2
0.5
1
0.8
6
0.96
22
0.5
2
0.5
2
0.667
2
0.5
4
0.5
1
AHMY(complete)
10
0.889
8
HNWZ(complete)
15
0.25
2
AHMH(complete)
9
0.727
9
AHMW(complete)
8
0.889
8
AHMO(complete)
8
0.889
8 AHMZ(complete)
8
0.875
8
AHMN(complete)
8
0.875
8
0.875
6
0.5
1 0.917
12
0.5
1
0.5
2
SCNA(complete)
3
0.5
2
SCNK(complete)
1 0.5
1
0.5
1
0.5
1
0.5
1
0.667
4
0.75
2
0.5
2
0.5
1
0.5
1
0.5
1
0.5
2
0.5
1
0.5
7
0.5
1
0.5
1
0.5
2
TUXK(complete)
2
0.5
2
0.5
2 0.5
1
0.5
4
0.833
8
0.833
5
0.75
5 0.667
2
0.929
17
0.667
4
0.5
1
0.5
1
0.25
2
PYYW(complete)
1
0
1
0.5
3
0.5
1
0.5
1
0.5
1
Fig. 1. Excerpt of a typical “Spaghetti” process model (ca. 20% of complete model).
An example of such a “spaghetti” model is given in Figure 1. It is noteworthy thatthis figure shows only a small excerpt (ca. 20%) of a highly unstructured process model.It has been mined from machine test logs using the Heuristics Miner, one of the tradi-tional process mining techniques which is most resilient towards noise in logs [14].Although this result is rather useful, certainly in comparison with other early processmining techniques, it is plain to see that deriving helpful information from it is not easy.
Event classes found in the log are interpreted as activity nodes in the process model.Their sheer amount makes it difficult to focus on the interesting parts of the process. Theabundance of arcs in the model, which constitute the actual “spaghetti”, introduce aneven greater challenge for interpretation. Separating cause from effect, or the generaldirection in which the process is executed, is not possible because virtually every nodeis transitively connected to any other node in both directions. This mirrors the crux offlexibility in process execution – when people are free to execute anything in any givenorder they will usually make use of such feature, which renders monitoring businessactivities an essentially infeasible task.
4
We argue that the fault for these problems lies neither with less-structured pro-cesses, nor with process mining itself. Rather, it is the result of a number of, mostlyimplicit, assumptions which process mining has historically made, both with respectto the event logs under consideration, and regarding the processes which have gener-ated them. While being perfectly sound in structured, controlled environments, theseassumptions do not hold in less-structured, real-life environments, and thus ultimatelymake traditional process mining fail there.
Assumption 1: All logs are reliable and trustworthy. Any event type found in thelog is assumed to have a corresponding logical activity in the process. However,activities in real-life processes may raise a random number of seemingly unrelatedevents. Activities may also go unrecorded, while other events do not correspond toany activity at all.The assumption that logs are well-formed and homogeneous is also often not true.For example, a process found in the log is assumed to correspond to one logicalentity. In less-structured environments, however, there are often a number of “tacit”process types which are executed, and thus logged, under the same name.Also, the idea that all events are raised on the same level of abstraction, and arethus equally important, is not true in real-life settings. Events on different levels are“flattened” into the same event log, while there is also a high amount of informa-tional events (e.g., debug messages from the system) which need to be disregarded.
Assumption 2: There exists an exact process which is reflected in the logs. This as-sumption implies that there is the one perfect solution out there, which needs tobe found. Consequently, the mining result should model the process completely,accurately, and precisely. However, as stated before, spaghetti models are not nec-essarily incorrect – the models look like spaghetti, because they precisely describeevery detail of the less-structured behavior found in the log. A more high-level so-lution, which is able to abstract from details, would thus be preferable.Traditional mining algorithms have also been confined to a single perspective (e.g.,control flow, data), as such isolated view is supposed to yield higher precision.However, perspectives are interacting in less-structured processes, e.g. the data flowmay complement the control flow, and thus also needs to be taken into account.In general, the assumption of a perfect solution is not well-suited for real-life appli-cation. Reality often differs significantly from theory, in ways that had not been an-ticipated. Consequently, useful tools for practical application must be explorative,i.e. support the analyst to tweak results and thus capitalize on their knowledge.
We have conducted process mining case studies in organizations like Philips Med-ical Systems, UWV, Rijkswaterstaat, the Catharina Hospital Eindhoven and the AMChospital Amsterdam, and the Dutch municipalities of Alkmaar and Heusden. Our ex-periences in these case studies have shown the above assumptions to be violated in allways imaginable. Therefore, to make process mining a useful tool in practical, less-structured settings, these assumptions need to be discarded. The next section introducesthe main concept of our mining approach, which takes these lessons into account.
5
3 An Adaptive Approach for Process Simplification
Process mining techniques which are suitable for less-structured environments need tobe able to provide a high-level view on the process, abstracting from undesired details.The field of cartography has always been faced with a quite similar challenge, namelyto simplify highly complex and unstructured topologies. Activities in a process can berelated to locations in a topology (e.g. towns or road crossings) and precedence relationsto traffic connections between them (e.g., railways or motorways).
insignificant roadsare not shown.
parts of the city are merged.
Focuses on theintended use andlevel of detail.
Highways are highlighted by size, contrast and color.
Abstraction
Aggregation
Customization
Emphasis
Fig. 2. Example of a road map.
When one takes a closer look at maps (such as the example in Figure 2), the solutioncartography has come up with to simplify and present complex topologies, one canderive a number of valuable concepts from them.
Aggregation: To limit the number of information items displayed, maps often showcoherent clusters of low-level detail information in an aggregated manner. One ex-ample are cities in road maps, where particular houses and streets are combinedwithin the city’s transitive closure (e.g., the city of Eindhoven in Figure 2).
Abstraction: Lower-level information which is insignificant in the chosen context issimply omitted from the visualization. Examples are bicycle paths, which are of nointerest in a motorist’s map.
Emphasis: More significant information is highlighted by visual means such as color,contrast, saturation, and size. For example, maps emphasize more important roadsby displaying them as thicker, more colorful and contrasting lines (e.g., motorway“E25” in Figure 2).
Customization: There is no one single map for the world. Maps are specialized on adefined local context, have a specific level of detail (city maps vs highway maps),and a dedicated purpose (interregional travel vs alpine hiking).
These concepts are universal, well-understood, and established. In this paper we ex-plore how they can be used to simplify and properly visualize complex, less-structuredprocesses. To do that, we need to develop appropriate decision criteria on which to basethe simplification and visualization of process models. We have identified two funda-mental metrics which can support such decisions: (1) significance and (2) correlation.
6
Significance, which can be determined both for event classes (i.e., activities) andbinary precedence relations over them (i.e., edges), measures the relative importance ofbehavior. As such, it specifies the level of interest we have in events, or their occurringafter one another. One example for measuring significance is by frequency, i.e. events orprecedence relations which are observed more frequently are deemed more significant.
Correlation on the other hand is only relevant for precedence relations over events.It measures how closely related two events following one another are. Examples formeasuring correlation include determining the overlap of data attributes associated totwo events following one another, or comparing the similarity of their event names.More closely correlated events are assumed to share a large amount of their data, or havetheir similarity expressed in their recorded names (e.g. “check customer application”and “approve customer application”).
Based on these two metrics, which have been defined specially for this purpose, wecan sketch our approach for process simplification as follows.
– Highly significant behavior is preserved, i.e. contained in the simplified model.– Less significant but highly correlated behavior is aggregated, i.e. hidden in clusters
within the simplified model.– Less significant and lowly correlated behavior is abstracted from, i.e. removed from
the simplified model.
This approach can greatly reduce and focus the displayed behavior, by employingthe concepts of aggregation and abstraction. Based on such simplified model, we canemploy the concept of emphasis, by highlighting more significant behavior.
Fig. 3. Excerpt of a simplified and decorated process model.
Figure 3 shows an excerpt from a simplified process model, which has been cre-ated using our approach. Bright square nodes represent significant activities, the darkeroctagonal node is an aggregated cluster of three less-significant activities. All nodesare labeled with their respective significance, with clusters displaying the mean sig-nificance of their elements. The brightness of edges between nodes emphasizes theirsignificance, i.e. more significant relations are darker. Edges are also labeled with theirrespective significance and correlation values. By either removing or hiding less signif-icant information, this visualization enables the user to focus on the most interestingbehavior in the process.
7
Yet, the question of what constitutes “interesting” behavior can have a number ofanswers, based on the process, the purpose of analysis, or the desired level of abstrac-tion. In order to yield the most appropriate result, significance and correlation measuresneed to be configurable. We have thus developed a set of metrics, which can each mea-sure significance or correlation based on different perspectives (e.g., control flow ordata) of the process. By influencing the “mix” of these metrics and the simplificationprocedure itself, the user can customize the produced results to a large degree.
The following section introduces some of the metrics we have developed for signif-icance and correlation in more detail.
4 Log-Based Process Metrics
Significance and correlation, as introduced in the previous section, are suitable conceptsfor describing the importance of behavior in a process in a compact manner. However,because they represent very generalized, condensed metrics, it is important to mea-sure them in an appropriate manner. Taking into account the wide variety of processes,analysis questions and objectives, and levels of abstraction, it is necessary to make thismeasurement adaptable to such parameters.
Our approach is based on a configurable and extensible framework for measuringsignificance and correlation. The design of this framework is introduced in the nextsubsection, followed by detailed introductions to the three primary types of metrics:unary significance, binary significance, and binary correlation.
4.1 Metrics Framework
An important property of our measurement framework is that for each of the three pri-mary types of metrics (unary significance, binary significance, and binary correlation)different implementations may be used. A metric may either be measured directly fromthe log (log-based metric), or it may be based on measurements of other, log-basedmetrics (derivative metric).
When the log contains a large number of undesired events, which occur in betweendesired ones, actual causal dependencies between the desired event classes may go un-recorded. To counter this, our approach also measures long-term relations, i.e. when thesequence A,B,C is found in the log, we will not only record the relations A→ B andB → C, but also the length-2-relationship A → C. We allow measuring relationshipsof arbitrary length, while the measured value will be attenuated, i.e. decreased, withincreasing length of relationship.
4.2 Unary Significance Metrics
Unary significance describes the relative importance of an event class, which will berepresented as a node in the process model. As our approach is based on removing lesssignificant behavior, and as removing a node implies removing all of its connected arcs,unary significance is the primary driver of simplification.
8
One metric for unary significance is frequency significance, i.e. the more often acertain event class was observed in the log, the more significant it is. Frequency is a log-based metric, and is in fact the most intuitive of all metrics. Traditional process miningtechniques are built solely on the principle of measuring frequency, and it remains animportant foundation of our approach. However, real-life logs often contain a largenumber of events which are in fact not very significant, e.g. an event which describessaving the process state after every five activities. In such situations, frequency plays adiminished role and can rather distort results.
Another, derivate metric for unary significance is routing significance. The ideabehind routing significance is that points, at which the process either forks (i.e., splitnodes) or synchronizes (i.e., join nodes), are interesting in that they substantially definethe structure of a process. The higher the number and significance of predecessors for anode (i.e., its incoming arcs) differs from the number and significance of its successors(i.e., outgoing arcs), the more important that node is for routing in the process. Routingsignificance is important as amplifier metric, i.e. it helps separating important routingnodes (whose significance it increases) from those less important.
4.3 Binary Significance Metrics
Binary significance describes the relative importance of a precedence relation betweentwo event classes, i.e. an edge in the process model. Its purpose is to amplify and toisolate the observed behavior which is supposed to be of the greatest interest. In oursimplification approach, it primarily influences the selection of edges which will beincluded in the simplified process model.
Like for unary significance, the log-based frequency significance metric is alsothe most important implementation for binary significance. The more often two eventclasses are observed after one another, the more significant their precedence relation.
The distance significance metric is a derivative implementation of binary signifi-cance. The more the significance of a relation differs from its source and target nodes’significances, the less its distance significance value. The rationale behind this metric is,that globally important relations are also always the most important relations for theirendpoints. Distance significance locally amplifies crucial key relations between eventclasses, and weakens already insignificant relations. Thereby, it can clarify ambiguoussituations in edge abstraction, where many relations “compete” over being included inthe simplified process model. Especially in very unstructured execution logs, this metricis an indispensible tool for isolating behavior of interest.
4.4 Binary Correlation Metrics
Binary correlation measures the distance of events in a precedence relation, i.e. howclosely related two events following one another are. Distance, in the process domain,can be equated to the magnitude of context change between two activity executions.Subsequently occurring activities which have a more similar context (e.g., which areexecuted by the same person or in a short timeframe) are thus evaluated to be highercorrelated. Binary correlation is the main driver of the decision between aggregation orabstraction of less-significant behavior.
9
Proximity correlation evaluates event classes, which occur shortly after one another,as highly correlated. This is important for identifying clusters of events which corre-spond to one logical activity, as these are commonly executed within a short timeframe.
Another feature of such clusters of events occurring within the realm of one higher-level activity is, that they are executed by the same person. Originator correlation be-tween event classes is determined from the names of the persons, which have triggeredtwo subsequent events. The more similar these names, the higher correlated the respec-tive event classes. In real applications, user names often include job titles or functionidentifiers (e.g.“sales John” and “sales Paul”). Therefore, this metric implementation isa valuable tool also for unveiling implicit correlation between events.
Endpoint correlation is quite similar, however, instead of resources it compares theactivity names of subsequent events. More similar names will be interpreted as highercorrelation. This is important for low-level logs including a large amount of less sig-nificant events which are closely related. Most of the time, events which reflect similartasks also are given similar names (e.g., “open valve13” and “close valve13”), and thismetric can unveil these implicit dependencies.
In most logs, events also include additional attributes, containing snapshots from thedata perspective of the process (e.g., the value of an insurance claim). In such cases,the selection of attributes logged for each event can be interpreted as its context. Thus,the data type correlation metric evaluates event classes, where subsequent events sharea large amount of data types (i.e., attribute keys), as highly correlated. Data value cor-relation is more specific, in that it also takes the values of these common attributes intoaccount. In that, it uses relative similarity, i.e. small changes of an attribute value willcompromise correlation less than a completely different value.
Currently, all implementations for binary correlation in our approach are log-based.The next section introduces our approach for adaptive simplification and visualizationof complex process models, which is based on the aggregated measurements of allmetric implementations which have been introduced in this section.
5 Adaptive Graph Simplification
Most process mining techniques follow an interpretative approach, i.e. they attempt tomap behavior found in the log to typical process design patterns (e.g., whether a splitnode has AND- or XOR-semantics). Our approach, in contrast, focuses on high-levelmapping of behavior found in the log, while not attempting to discover such patterns.Thus, creating the initial (non-simplified) process model is straightforward: All eventclasses found in the log are translated to activity nodes, whose importance is expressedby unary significance. For every observed precedence relation between event classes, acorresponding directed edge is added to the process model. This edge is described bythe binary significance and correlation of the ordering relation it represents.
Subsequently, we apply three transformation methods to the process model, whichwill successively simplify specific aspects of it. The first two phases, conflict reso-lution and edge filtering, remove edges (i.e., precedence relations) between activitynodes, while the final aggregation and abstraction phase removes and/or clusters less-significant nodes. Removing edges from the model first is important – due to the less-
10
structured nature of real-life processes and our measurement of long-term relationships,the initial model contains deceptive ordering relations, which do not correspond to validbehavior and need to be discarded. The following sections provide details about thethree phases of our simplification approach, given in the order in which they are appliedto the initial model.
5.1 Conflict Resolution in Binary Relations
Whenever two nodes in the initial process model are connected by edges in both direc-tions, they are defined to be in conflict. Depending on their specific properties, conflictsmay represent one of three possible situations in the process:
– Length-2-loop: Two activities A and B constitute a loop in the process model,i.e. after executing A and B in sequence, one may return to A and start over. Inthis case, the conflicting ordering relations between these activities are explicitlyallowed in the original process, and thus need to be preserved.
– Exception: The process orders A → B in sequence, however, during real-life exe-cution the exceptional case of B → A also occurs. Most of the time, the “normal”behavior is clearly more significant. In such cases, the “weaker” relation needs tobe discarded to focus on the main behavior.
– Concurrency: A and B can be executed in any order (i.e., they are on two distinct,parallel paths), the log will most likely record both possible cases, i.e. A → Band B → A, which will create a conflict. In this case, both conflicting orderingrelations need to be removed from the process model.
Conflict resolution attempts to classify each conflict as one of these three cases, andthen resolves it accordingly. For that, it first determines the relative significance of bothconflicting relations.
A BA → B
B → A
Ain
Aout
Bin
Bout
Fig. 4. Evaluating the relative significance of conflicting relations.
Figure 4 shows an example of two activities A and B in conflict. The relative sig-nificance for an edge A→ B can be determined as follows.
Definition 1 (Relative significance). LetN be the set of nodes in a process model, andlet sig : N × N → R+
0 be a relation that assigns to each pair of nodes A,B ∈ Nthe significance of a precedence relation over them. rel : N × N → R+
0 is a relationwhich assigns to each pair of nodes A,B ∈ N the relative importance of their orderingrelation: rel(A,B) = 1
2 ·sig(A,B)P
X∈N sig(A,X) + 12 ·
sig(A,B)PX∈N sig(X,B)
11
Every ordering relationA→ B has a set of competing relationsCompAB = Aout∪Bin. This set of competing relations is composed ofAout, i.e. all edges starting fromA,and ofBin, i.e. all edges pointing toB (cf. Figure 4). Note that this set also contains thereference relation itself, i.e. more specifically:Bin∩Aout = {A→ B}. By dividing thesignificance of an ordering relation A→ B with the sum of all its competing relations’significances, we get the importance of this relation in its local context.
If the relative significance of both conflicting relations, rel(A,B) and rel(B,A)exceeds a specified preserve threshold value, this signifies that A and B are apparentlyforming a length-2-loop, which is their most significant behavior in the process. Thus,in this case, both A→ B and B → A will be preserved.
In case at least one conflicting relation’s relative significance is below this threshold,the offset between both relations’ relative significances is determined, i.e. ofs(A,B) =|rel(A,B)−rel(B,A)|. The larger this offset value, the more the relative significancesof both conflicting relations differ, i.e. one of them is clearly more important. Thus, ifthe offset value exceeds a specified ratio threshold, we assume that the relatively lesssignificant relation is in fact an exception and remove it from the process model.
Otherwise, i.e. if at least one of the relations has a relative significance below thepreserve threshold and their offset is smaller than the ratio threshold, this signifies thatbothA→ B andB → A are relations which are of no greater importance for both theirsource and target activities. This low, yet balanced relative significance of conflictingrelations hints at A and B being executed concurrently, i.e. in two separate threads ofthe process. Consequently, both edges are removed from the process model, as they donot correspond to factual ordering relations.
5.2 Edge Filtering
Although conflict resolution removes a number of edges from the process model, themodel still contains a large amount of precedence relations. To infer further structurefrom this model, it is necessary to remove most of these remaining edges by edge fil-tering, which isolates the most important behavior. The obvious solution is to removethe globally least significant edges, leaving only highly significant behavior. However,this approach yields sub-optimal results, as it is prone to create small, disparate clustersof highly frequent behavior. Also, in the subsequent aggregation step, highly corre-lated relations play an important part in connecting clusters, even if they are not verysignificant.
Therefore, our edge filtering approach evaluates each edge A → B by its utilityutil(A,B), a weighed sum of its significance and correlation. A configurable utilityratio ur ∈ [0, 1] determines the weight, such that util(A,B) = ur · sig(A,B) + (1−ur) · cor(A,B). A larger value for ur will preserve more significant edges, while asmaller value will favor highly correlated edges.
Figure 5 shows an example for processing the incoming arcs of a node A. Usinga utility ratio of 0.5, i.e. taking significance and correlation equally into account, theutility value is calculated, which ranges from 0.4 to 1.0 in this example.
Filtering edges is performed on a local basis, i.e. for each node in the process model,the algorithm preserves its incoming and outgoing edges with the highest utility value.The decision of which edges get preserved is configured by the edge cutoff parameter
12
A
1.0P
Q
R
S
0.9
0.7
0.5
1.0
0.2
0.8
0.3
Sign. Corr.
1.0
0.55
0.75
0.4
1.0
0.25
0.58
0.0
Util. Normal. Util.
ur = 0.5 co = 0.41.0
0.58
Preserve
AQ
S
P
R
Fig. 5. Filtering the set of incoming edges for a node A.
co ∈ [0, 1]. For every node N , the utility values for each incoming edge X → N arenormalized to [0, 1], so that the weakest edge is assigned 0 and the strongest one 1. Alledges whose normalized utility value exceeds co are added to the preserved set. In theexample in Figure 5, only two of the original edges, are preserved, using a edge cutoffvalue of 0.4: P → A (with normalized utility of 1.0) and R → A (norm. utility of0.56). The outgoing edges are processed in the same manner for each node.
The edge cutoff parameter determines the aggressiveness of the algorithm, i.e. thehigher its value, the more likely the algorithm is to remove edges. In very unstructuredprocesses, where precedence relations are likely to have a balanced significance, it isoften useful to use a lower utility ratio, such that correlation will be taken more intoaccount and resolve such ambiguous situations. On top of that, a high edge cutoff willact as an amplifier, helping to distinguish the most important edges.
Note that our edge filtering approach starts from an empty set of precedence rela-tions, i.e. all edges are removed by default. Only if an edge is selected locally for at leastone node, it will be preserved. This approach keeps the process model connected, whileclarifying ambiguous situations. Whether an edge is preserved depends on its utility fordescribing the behavior of the activities it connects – and not on global comparisonswith other parts of the model, which it does not even interact with.
topcomplete0.517
midtopcomplete0.406
0.7950.815
sponsorscomplete0.473
0.0490.697
toolbarcomplete0.408
0.0820.697
layout_files/_rootcomplete0.896
0.0150.672
bottomcomplete0.417
0.0410.638
witcomplete0.657
0.0510.660
zwartcomplete0.610
0.3640.651
stylescomplete0.785
0.0240.676
introductioncomplete0.430
0.1570.653
subtopcomplete0.437
0.3430.768
processmining/_rootcomplete1.000
0.0400.697
0.0380.846
0.0960.671
0.2210.618
0.0230.551
0.1250.703
0.0460.734
0.0790.667
0.0380.380
0.4220.698
0.8960.820
0.0010.395
0.0260.640
0.0290.675
0.0081.000
0.0320.719
0.2270.494
0.0590.749
0.2060.598
0.4790.679
0.4720.637
0.0680.669
0.0370.651
0.0660.484
0.0310.766
0.0200.660
0.4290.685
0.0620.453
0.9810.674
0.0610.545
0.1230.640
0.1320.528
0.0710.708
0.0230.669
0.0310.589
0.0700.471
0.0180.509
0.0200.615
0.0140.444
0.2410.716
0.0070.412
0.0750.413
0.0480.455
0.2950.531
0.0590.408
0.0490.507
0.1990.476
0.0230.765
0.0280.781
0.9820.726
0.1010.707
0.1210.517
0.0020.853
0.1030.620
0.2220.592
0.2200.556
0.0400.730
0.0290.764
0.0480.586
0.5920.612
0.1990.707
0.0340.515
0.0560.589
0.1790.473
0.0310.669
0.0390.936
0.6700.743
0.4410.493
0.0600.592
0.1910.679
0.0320.595
0.5730.639
0.5410.623
0.0370.636
0.0960.627
0.1160.515
0.0720.627
0.4830.782
0.0400.904
0.2430.507
0.1220.650
0.2850.622
0.0380.651
0.0510.414
0.0300.603
0.0160.677
0.0160.433
0.9840.574
0.0380.530
0.0730.372
0.0600.457
0.4920.707
0.0980.392
0.0120.572
0.1850.377
0.0370.687
0.0180.744
0.1790.615
0.9570.699
0.0490.406
0.4370.685
0.0600.568
0.0760.662
0.1670.520
0.0070.682
0.0260.736
0.0220.382
0.0710.843
0.1220.834
0.0860.674
0.3920.610
0.0180.421
0.1840.697
0.0230.651
0.0930.660
0.0460.588
0.9140.714
0.0200.672
0.2660.549
0.1040.630
0.0230.702
0.0250.500
0.0070.794
0.0230.669
1.0000.614
0.3060.558
0.0060.718
0.0320.613
0.0620.565
0.0080.682
topcomplete0.517
midtopcomplete0.406
0.7950.815
subtopcomplete0.437
0.8960.820
sponsorscomplete0.473
0.0081.000
zwartcomplete0.610
0.4790.679
stylescomplete0.785
0.4720.637
toolbarcomplete0.408
bottomcomplete0.417
0.9810.674
layout_files/_rootcomplete0.896
0.2410.716
processmining/_rootcomplete1.000
0.1990.476
0.9820.726
0.0020.853
witcomplete0.657
0.5920.612
0.0390.936
0.4410.493
0.5730.639
0.5410.623
0.0400.904
0.9840.574
0.4920.707
introductioncomplete0.430
0.9570.699
0.0070.682
0.9140.714
1.0000.614
0.0080.682
Fig. 6. Example of a process model before (left) and after (right) edge filtering.
Figure 6 shows the effect of edge filtering applied to a small, but very unstructuredprocess. The number of nodes remains the same, while removing an appropriate subsetof edges clearly brings structure to the previously chaotic process model.
5.3 Node Aggregation and Abstraction
While removing edges brings structure to the process model, the most effective tool forsimplification is removing nodes. It enables the analyst to focus on an interesting subset
13
of activities. Our approach preserves highly correlated groups of less-significant nodesas aggregated clusters, while removing isolated, less-significant nodes. Removing nodesis based on the node cutoff parameter. Every node whose unary significance is belowthis threshold becomes a victim, i.e. it will either be aggregated or abstracted from. Thefirst phase of our algorithm builds initial clusters of less-significant behavior as follows.
– For each victim, find the most highly correlated neighbor (i.e., connected node)– If this neighbor is a cluster node, add the victim to this cluster.– Otherwise, create a new cluster node, and add the victim as its first element.
Whenever a node is added to a cluster, the cluster will “inherit” the ordering rela-tions of that node, i.e. its incoming and outgoing arcs, while the actual node will behidden. The second phase is merging the clusters, which is necessary as most clusterswill, at this stage, only consist of one single victim. The following routine is performedto aggregate larger clusters and decrease their number.
– For each cluster, check whether all predecessors or all successors are also clusters.– If all predecessor nodes are clusters as well, merge with the most highly correlated
one and move on to the next cluster.– If all successors are clusters as well, merge with the most highly correlated one.– Otherwise, i.e. if both the cluster’s pre- and postset contain regular nodes, the clus-
ter is left untouched.
Xstart
0.507Cluster B
2 primitives~ 0.127
0.4470.902
Cluster C1 primitives
~ 0.156
0.4470.902 Y
start0.927
0.0000.451Cluster A
2 primitives~ 0.169
0.0000.564
0.0010.676
Fig. 7. Excerpt of a clustered model after the first aggregation phase.
It is important that clusters will only be merged, if the “victim” has only clusters inhis pre- or postset. Figure 7 shows an example of a process model after the first phaseof clustering. Cluster A cannot merge with cluster B, as they are also both connectedto node X . Otherwise, X would be connected to the merged cluster in both directions,making the model less informative. However, clusters B and C can merge, as B’s post-set consists only of C. This simplification of the model does not lessen the amount ofinformation, and is thus valid.
The last phase, which constitutes the abstraction, removes isolated and singularclusters. Isolated clusters are detached parts of the process, which are less significantand highly correlated, and which have thus been folded into one single, isolated clusternode. It is obvious that such detached nodes do not contribute to the process model,which is why they are simply removed. Singular clusters consist only of one, singleactivity node. Thus, they represent less-significant behavior which is not highly corre-lated to adjacent behavior. Singular clusters are undesired, because they do not simplifythe model. Therefore, they are removed from the model, while their most significantprecedence relations are transitively preserved (i.e., their predecessors are artificiallyconnected to their successors, if such edge does not already exist in the model).
14
6 Implementation and Application
We have implemented our approach as the Fuzzy Miner plugin for the ProM framework[8]. All metrics introduced in Section 4 have been implemented, and can be configuredby the user. Figure 8 shows the result view of the Fuzzy Miner, with the simplified graphview on the left, and a configuration pane for simplification parameters on the right.Alternative views allow the user to inspect and verify measurements for all metrics,which helps to tune these metrics to the log.
Fig. 8. Screenshot of the Fuzzy Miner, applied to the very large and unstructured log also usedfor mining the model in Figure 1.
Note that this approach is the result of valuable lessons learnt from a great number ofcase studies with real-life logs. As such, both the applied metrics and the simplificationalgorithm have been optimized using large amounts actual, less-structured data. Whileit is difficult to validate the approach formally, the Fuzzy Miner has already becomeone of the most useful tools in case study applications.
For example, Figure 8 shows the result of applying the Fuzzy Miner to a large testlog of manufacturing machines (155.296 events in 16 cases, 826 event classes). Theapproach has been shown to scale well and is of linear complexity. Deriving all metricsfrom the mentioned log was performed in less than ten seconds, while simplifying theresulting model took less than two seconds on a 1.8 GHz dual-core machine. While thismodel has been created from the raw logs, Figure 1 has been mined with the HeuristicsMiner, after applying log filters [14]. It is obvious that the Fuzzy Miner is able to cleanup a large amount of confusing behavior, and to infer and extract structure from whatis chaotic. We have successfully used the Fuzzy Miner on various machinery test andusage logs, development process logs, hospital patient treatment logs, logs from case
15
handling systems and web servers, among others. These are notoriously flexible andunstructured environments, and we hold our approach to be one of the most useful toolsfor analyzing them so far.
7 Related Work
While parting with some characteristics of traditional process mining techniques, suchas absolute precision and describing the complete behavior, it is obvious that the ap-proach described in this paper is based on previous work in this domain, most specif-ically control-flow mining algorithms [2, 14, 7]. The mining algorithm most related toFuzzy Mining is the Heuristics Miner, which also employs heuristics to limit the set ofprecedence relations included in the model [14]. Our approach also incorporates con-cepts from log filtering, i.e. removing less significant events from the logs [8]. However,the foundation on multi-perspective metrics, i.e. looking at all aspects of the process atonce, its interactive and explorative nature, and the integrated simplification algorithmclearly distinguishes Fuzzy Mining from all previous process mining techniques.
The adaptive simplification approach presented in Section 5 uses concepts from thedomains of data clustering and graph clustering. Data clustering attempts to find relatedsubsets of attributes, based on a binary distance metric inferred upon them [11]. Graphclustering algorithms, on the other hand, are based on analyzing structural properties ofgraphs, from which they derive partitioning strategies [13, 9]. Our approach, however,is based on a unique combination of analyzing the significance and correlation of graphelements, which are based on a wide set of process perspectives. It integrates abstractionand aggregation, and is also more specialized towards the process domain.
8 Discussion and Future Work
We have described the problems traditional process mining techniques face when ap-plied to large, less-structured processes, as often found in practice. Subsequently, wehave analyzed the causes for these problems, which lie in a mismatch between fun-damental assumptions of traditional process mining, and the characteristics of real-lifeprocesses. Based on this analysis, we have developed an adaptive simplification and vi-sualization technique for process models, which is based on two novel metrics, signifi-cance and correlation. We have described a framework for deriving these metrics froman enactment log, which can be adjusted to particular situations and analysis questions.
While the fine-grained configurability of the algorithm and its metrics makes ourapproach universally applicable, it is also one of its weaknesses, as finding the “right”parameter settings can sometimes be time-consuming. Thus, our next steps will concen-trate on deriving higher-level parameters and sensible default settings, while preservingthe full range of parameters for advanced users. Further work will concentrate on ex-tending the set of metric implementations and improving the simplification algorithm.
It is our belief that process mining, in order to become more meaningful, and tobecome applicable in a wider array of practical settings, needs to address the problemsit has with unstructured processes. We have shown that the traditional desire to modelthe complete behavior of a process in a precise manner conflicts with the original goal,
16
i.e. to provide the user with understandable, high-level information. The success ofprocess mining will depend on whether it is able to balance these conflicting goalssensibly. Fuzzy Mining is a first step in that direction.
Acknowledgements This research is supported by the Technology Foundation STW,applied science division of NWO and the technology programme of the Dutch Ministryof Economic Affairs.
References
1. W.M.P. van der Aalst, B.F. van Dongen, J. Herbst, L. Maruster, G. Schimm, and A.J.M.M.Weijters. Workflow Mining: A Survey of Issues and Approaches. Data and KnowledgeEngineering, 47(2):237–267, 2003.
2. W.M.P. van der Aalst, A.J.M.M. Weijters, and L. Maruster. Workflow Mining: DiscoveringProcess Models from Event Logs. IEEE Transactions on Knowledge and Data Engineering,16(9):1128–1142, 2004.
3. R. Agrawal, D. Gunopulos, and F. Leymann. Mining Process Models from Workflow Logs.In Sixth International Conference on Extending Database Technology, pages 469–483, 1998.
4. E. Badouel, L. Bernardinello, and P. Darondeau. The Synthesis Problem for Elementary NetSystems is NP-complete. Theoretical Computer Science, 186(1-2):107–134, 1997.
5. J.E. Cook and A.L. Wolf. Discovering Models of Software Processes from Event-BasedData. ACM Transactions on Software Engineering and Methodology, 7(3):215–249, 1998.
6. A. Datta. Automating the Discovery of As-Is Business Process Models: Probabilistic andAlgorithmic Approaches. Information Systems Research, 9(3):275–301, 1998.
7. B.F. van Dongen and W.M.P. van der Aalst. Multi-Phase Process Mining: Building InstanceGraphs. In P. Atzeni, W. Chu, H. Lu, S. Zhou, and T.W. Ling, editors, International Con-ference on Conceptual Modeling (ER 2004), volume 3288 of Lecture Notes in ComputerScience, pages 362–376. Springer-Verlag, Berlin, 2004.
8. B.F. van Dongen, A.K. Alves de Medeiros, H.M.W. Verbeek, A.J.M.M. Weijters, and W.M.P.van der Aalst. The ProM framework: A New Era in Process Mining Tool Support. InG. Ciardo and P. Darondeau, editors, Application and Theory of Petri Nets 2005, volume3536 of Lecture Notes in Computer Science, pages 444–454. Springer-Verlag, Berlin, 2005.
9. S. van Dongen. Graph Clustering by Flow Simulation. PhD thesis, University of Utrecht,2000.
10. J. Herbst. A Machine Learning Approach to Workflow Management. In Proceedings 11thEuropean Conference on Machine Learning, volume 1810 of Lecture Notes in ComputerScience, pages 183–194. Springer-Verlag, Berlin, 2000.
11. A.K. Jain, M.N. Murty, and P.J. Flynn. Data clustering: A review. ACM Computing Surveys,31(3):264–323, 1999.
12. A.K.A. de Medeiros, A.J.M.M. Weijters, and W.M.P. van der Aalst. Genetic Process Mining:A Basic Approach and its Challenges. In C. Bussler et al., editor, BPM 2005 Workshops(Workshop on Business Process Intelligence), volume 3812 of Lecture Notes in ComputerScience, pages 203–215. Springer-Verlag, Berlin, 2006.
13. A. Pothen, H. D. Simon, and K. Liou. Partitioning sparse matrics with eigenvectors of graphs.SIAM J. Matrix Anal. Appl., 11(3):430–452, July 1990.
14. A.J.M.M. Weijters and W.M.P. van der Aalst. Rediscovering Workflow Models from Event-Based Data using Little Thumb. Integrated Computer-Aided Engineering, 10(2):151–162,2003.
top related