![Page 1: On the Role of Fitness, Precision, Generalization and Simplicity in Process Discovery](https://reader034.vdocuments.mx/reader034/viewer/2022051608/545bee60b0af9f12318b459c/html5/thumbnails/1.jpg)
On the Role of Fitness, Precision, Generalization and Simplicity in Process Discovery
Joos Buijs
Boudewijn van Dongen
Wil van der Aalst
http://www.win.tue.nl/coselog/
![Page 2: On the Role of Fitness, Precision, Generalization and Simplicity in Process Discovery](https://reader034.vdocuments.mx/reader034/viewer/2022051608/545bee60b0af9f12318b459c/html5/thumbnails/2.jpg)
Advances in Process Mining
• Many process discovery and conformance checking algorithms and tools are available (cf. the various ProM packages).
• Also commercial software based on these ideas: Disco (Fluxicon), Reflect (Futura/Perceptive), BPMOne (Pallas Athena/Perceptive), ARIS Process Performance Manager (Software AG), Interstage Automated Process Discovery (Fujitsu), QPR ProcessAnalyzer/Analysis (QPR Software), flow (fourspark), Discovery Analyst (StereoLOGIC), etc.
• We applied process mining in over 100 organizations.
PAGE 2
More than 75 people involving more than 50 organizations created the Process Mining Manifesto in the context of the IEEE Task Force on Process Mining.
Available in 13 languages
![Page 3: On the Role of Fitness, Precision, Generalization and Simplicity in Process Discovery](https://reader034.vdocuments.mx/reader034/viewer/2022051608/545bee60b0af9f12318b459c/html5/thumbnails/3.jpg)
Example Process Discovery(Vestia, Dutch housing agency, 208 cases, 5987 events)
PAGE 3
![Page 4: On the Role of Fitness, Precision, Generalization and Simplicity in Process Discovery](https://reader034.vdocuments.mx/reader034/viewer/2022051608/545bee60b0af9f12318b459c/html5/thumbnails/4.jpg)
Example: Conformance Checking (WOZ objections Dutch municipality, 745 objections, 9583 event, f= 0.988)
PAGE 4
![Page 5: On the Role of Fitness, Precision, Generalization and Simplicity in Process Discovery](https://reader034.vdocuments.mx/reader034/viewer/2022051608/545bee60b0af9f12318b459c/html5/thumbnails/5.jpg)
Challenge: Four Competing Quality Criteria
PAGE 5
process discovery
replay fitness
precisiongeneralization
simplicity
“able to replay event log” “Occam’s razor”
“not overfitting the log” “not underfitting the log”
![Page 6: On the Role of Fitness, Precision, Generalization and Simplicity in Process Discovery](https://reader034.vdocuments.mx/reader034/viewer/2022051608/545bee60b0af9f12318b459c/html5/thumbnails/6.jpg)
Example: one log four models
PAGE 6
astart register
request
bexamine thoroughly
cexamine casually
d checkticket
decide
pay compensation
reject request
reinitiate requeste
g
hf
end
astart register
request
cexamine casually
dcheckticket
decide reject request
e hend
N3 : fitness = +, precision = -, generalization = +, simplicity = +
N2 : fitness = -, precision = +, generalization = -, simplicity = +
astart register
request
bexamine
thoroughly
cexamine casually
dcheck ticket
decide
pay compensation
reject request
reinitiate request
e
g
h
f
end
N1 : fitness = +, precision = +, generalization = +, simplicity = +
astart register
request
cexamine casually
dcheckticket
decide reject request
e hend
N4 : fitness = +, precision = +, generalization = -, simplicity = -
aregister request
dexamine casually
ccheckticket
decide reject request
e h
a cexamine casually
dcheckticket
decide
e g
a dexamine casually
ccheckticket
decide
e g
register request
register request
pay compensation
pay compensation
aregister request
b dcheckticket
decide reject request
e h
aregister request
d bcheckticket
decide reject request
e h
a b dcheckticket
decide
e gregister request
pay compensation
examine thoroughly
examine thoroughly
examine thoroughly
… (all 21 variants seen in the log)
acdeh
abdeg
adceh
abdeh
acdeg
adceg
adbeh
acdefdbeh
adbeg
acdefbdeh
acdefbdeg
acdefdbeg
adcefcdeh
adcefdbeh
adcefbdeg
acdefbdefdbeg
adcefdbeg
adcefbdefbdeg
adcefdbefbdeh
adbefbdefdbeg
adcefdbefcdefdbeg
455
191
177
144
111
82
56
47
38
33
14
11
9
8
5
3
2
2
1
1
1
# trace
1391
process discovery
replay fitness
precisiongeneralization
simplicity
“able to replay event log” “Occam’s razor”
“not overfitting the log” “not underfitting the log”
![Page 7: On the Role of Fitness, Precision, Generalization and Simplicity in Process Discovery](https://reader034.vdocuments.mx/reader034/viewer/2022051608/545bee60b0af9f12318b459c/html5/thumbnails/7.jpg)
Model N1
PAGE 7
acdeh
abdeg
adceh
abdeh
acdeg
adceg
adbeh
acdefdbeh
adbeg
acdefbdeh
acdefbdeg
acdefdbeg
adcefcdeh
adcefdbeh
adcefbdeg
acdefbdefdbeg
adcefdbeg
adcefbdefbdeg
adcefdbefbdeh
adbefbdefdbeg
adcefdbefcdefdbeg
455
191
177
144
111
82
56
47
38
33
14
11
9
8
5
3
2
2
1
1
1
# trace
1391
astart register
request
bexamine
thoroughly
cexamine casually
dcheck ticket
decide
pay compensation
reject request
reinitiate request
e
g
h
f
end
N1 : fitness = +, precision = +, generalization = +, simplicity = +
![Page 8: On the Role of Fitness, Precision, Generalization and Simplicity in Process Discovery](https://reader034.vdocuments.mx/reader034/viewer/2022051608/545bee60b0af9f12318b459c/html5/thumbnails/8.jpg)
Model N2
PAGE 8
acdeh
abdeg
adceh
abdeh
acdeg
adceg
adbeh
acdefdbeh
adbeg
acdefbdeh
acdefbdeg
acdefdbeg
adcefcdeh
adcefdbeh
adcefbdeg
acdefbdefdbeg
adcefdbeg
adcefbdefbdeg
adcefdbefbdeh
adbefbdefdbeg
adcefdbefcdefdbeg
455
191
177
144
111
82
56
47
38
33
14
11
9
8
5
3
2
2
1
1
1
# trace
1391
astart register
request
cexamine casually
dcheckticket
decide reject request
e hend
N2 : fitness = -, precision = +, generalization = -, simplicity = +
![Page 9: On the Role of Fitness, Precision, Generalization and Simplicity in Process Discovery](https://reader034.vdocuments.mx/reader034/viewer/2022051608/545bee60b0af9f12318b459c/html5/thumbnails/9.jpg)
Model N3
PAGE 9
acdeh
abdeg
adceh
abdeh
acdeg
adceg
adbeh
acdefdbeh
adbeg
acdefbdeh
acdefbdeg
acdefdbeg
adcefcdeh
adcefdbeh
adcefbdeg
acdefbdefdbeg
adcefdbeg
adcefbdefbdeg
adcefdbefbdeh
adbefbdefdbeg
adcefdbefcdefdbeg
455
191
177
144
111
82
56
47
38
33
14
11
9
8
5
3
2
2
1
1
1
# trace
1391
astart register
request
bexamine thoroughly
cexamine casually
d checkticket
decide
pay compensation
reject request
reinitiate requeste
g
hf
end
N3 : fitness = +, precision = -, generalization = +, simplicity = +
![Page 10: On the Role of Fitness, Precision, Generalization and Simplicity in Process Discovery](https://reader034.vdocuments.mx/reader034/viewer/2022051608/545bee60b0af9f12318b459c/html5/thumbnails/10.jpg)
Model N4
PAGE 10
acdeh
abdeg
adceh
abdeh
acdeg
adceg
adbeh
acdefdbeh
adbeg
acdefbdeh
acdefbdeg
acdefdbeg
adcefcdeh
adcefdbeh
adcefbdeg
acdefbdefdbeg
adcefdbeg
adcefbdefbdeg
adcefdbefbdeh
adbefbdefdbeg
adcefdbefcdefdbeg
455
191
177
144
111
82
56
47
38
33
14
11
9
8
5
3
2
2
1
1
1
# trace
1391
astart register
request
cexamine casually
dcheckticket
decide reject request
e hend
N4 : fitness = +, precision = +, generalization = -, simplicity = -
aregister request
dexamine casually
ccheckticket
decide reject request
e h
a cexamine casually
dcheckticket
decide
e g
a dexamine casually
ccheckticket
decide
e g
register request
register request
pay compensation
pay compensation
aregister request
b dcheckticket
decide reject request
e h
aregister request
d bcheckticket
decide reject request
e h
a b dcheckticket
decide
e gregister request
pay compensation
examine thoroughly
examine thoroughly
examine thoroughly
… (all 21 variants seen in the log)
![Page 11: On the Role of Fitness, Precision, Generalization and Simplicity in Process Discovery](https://reader034.vdocuments.mx/reader034/viewer/2022051608/545bee60b0af9f12318b459c/html5/thumbnails/11.jpg)
Another challenge: Huge search space
PAGE 11
MM
MMM MM
M
MM
M
M
M
M
M
M
MM
MM
M
M
MM
MMM MM
M
MM
M
M
M
M
M
M
MM
MM
M
M
MM
MMM MM
M
MM
M
M
M
M
M
M
MM
MM
M
M
MM
MMM MM
M
MM
M
M
M
M
M
M
MM
MM
M
M
MM
M
M
M
M
M
M
MM MMM
MM
M
M
M
M M MM
M
MM
MMM MM
M
MM
M
M
M
M
M
M
MM
MM
M
M
MM
MMM MM
M
MM
M
M
M
M
M
M
MM
MM
M
M
MM
MMM MM
M
MM
M
M
M
M
M
M
MM
MM
M
M
MM
MMM MM
M
MM
M
M
M
M
M
M
MM
MM
M
M
MM
M
M
M
M
M
M
MM MMM
MM
M
M
M
M M MM
M
MMM M
M
MM
M
M
M
M
M
M
MM
M
M
MMM M
M
MM
M
M
M
M
M
MM
MM
M
MMM
MM M
MM M
M
MM
M
M
M
M
M
MM
MM
M
MM
MMM M
M
MM
M
M
M
M
M
MM
MM
M
M
M
M
MM
M
M
M
M
M
M
MM
M
MM
M
M
M
M
M
M
MM
MM
M
M M
MM
M
M
M
M
M
M
M
MM
M
M
M
M
MM
M
M MM M M M
![Page 12: On the Role of Fitness, Precision, Generalization and Simplicity in Process Discovery](https://reader034.vdocuments.mx/reader034/viewer/2022051608/545bee60b0af9f12318b459c/html5/thumbnails/12.jpg)
… with just a few interesting candidates
PAGE 12
MM
MMM MM
M
MM
M
M
M
M
M
M
MM
MM
M
M
MM
MMM MM
M
MM
M
M
M
M
M
M
MM
MM
M
M
MM
MMM MM
M
MM
M
M
M
M
M
M
MM
MM
M
M
MM
MMM MM
M
MM
M
M
M
MM
M
MM
MM
M
M
MM
M
M
M
M
M
M
MM MMM
MM
M
M
M
M M MM
M
MM
MMM MM
M
MM
M
M
M
M
M
M
MM
MM
M
M
MM
MMM MM
M
MM
M
M
M
M
M
M
MM
MM
M
M
MM
MMM M
M
MMM
M
M
M
M
M
M
MM
MM
M
M
MM
MMM MM
M
MM
M
M
M
M
M
M
MM
MM
M
M
MM
M
M
M
M
M
M
MMM
MM MM
M
M
M
M M MM
M
MMM M
M
MM
M
M
M
M
M
M
MM
M
M
MMM M
M
MM
M
M
M
M
M
MM
MM
M
MMM
MM M
MM M
M
MM
M
M
M
M
M
MM
MM
M
MM
MMM M
M
MM
M
M
M
M
M
MM
MM
M
M
M
M
MM
M
M
M
M
M
M
MM
M
MM
M
M
M
M
M
M
MM
MM
M
M M
MM
M
M
M
M
M
M
M
MM
M
M
M
M
MM
M
M MM M M M
![Page 13: On the Role of Fitness, Precision, Generalization and Simplicity in Process Discovery](https://reader034.vdocuments.mx/reader034/viewer/2022051608/545bee60b0af9f12318b459c/html5/thumbnails/13.jpg)
Two requirements
1. It should be possible to seamlessly balance the different quality criteria based on user-defined preferences.
2. The algorithm should always return a "correct" process model and not waste time on model having deadlocks and other anomalies.
PAGE 13
process discovery
replay fitness
precisiongeneralization
simplicity
“able to replay event log” “Occam’s razor”
“not overfitting the log” “not underfitting the log”
![Page 14: On the Role of Fitness, Precision, Generalization and Simplicity in Process Discovery](https://reader034.vdocuments.mx/reader034/viewer/2022051608/545bee60b0af9f12318b459c/html5/thumbnails/14.jpg)
Proposal: Evolutionary Tree Miner (ETM)
• Process trees as representation (= limit search space to "good" models).
• Genetic approach (= very flexible)• Fitness function uses all four criteria (= seamlessly
balance the different "forces")
PAGE 14
![Page 15: On the Role of Fitness, Precision, Generalization and Simplicity in Process Discovery](https://reader034.vdocuments.mx/reader034/viewer/2022051608/545bee60b0af9f12318b459c/html5/thumbnails/15.jpg)
Representational Bias: Process Trees
• Always sound because of the block structure
• Also Loop and OR operator
A (BA)*
PAGE 15
A
DC
E
→
X
∧
→
B
A
![Page 16: On the Role of Fitness, Precision, Generalization and Simplicity in Process Discovery](https://reader034.vdocuments.mx/reader034/viewer/2022051608/545bee60b0af9f12318b459c/html5/thumbnails/16.jpg)
Petri Net Semantics(used for comparison and conformance checking only)
PAGE 16
A
B
B
A
A
B
A
B
Sequence
Exclusive Choice
Loop
Parallellism
Or Choice
A B
![Page 17: On the Role of Fitness, Precision, Generalization and Simplicity in Process Discovery](https://reader034.vdocuments.mx/reader034/viewer/2022051608/545bee60b0af9f12318b459c/html5/thumbnails/17.jpg)
Steps of the Genetic ETM Algorithm
PAGE 17
CreateInitial
Population
ReturnBest
Individual
MeasureQuality
ChangePopulation
Stop?
YesYes
NoNo
![Page 18: On the Role of Fitness, Precision, Generalization and Simplicity in Process Discovery](https://reader034.vdocuments.mx/reader034/viewer/2022051608/545bee60b0af9f12318b459c/html5/thumbnails/18.jpg)
Population Change
PAGE 18
Population i Population i+1
Elite
Crossover Mutation
Selection
Replace
![Page 19: On the Role of Fitness, Precision, Generalization and Simplicity in Process Discovery](https://reader034.vdocuments.mx/reader034/viewer/2022051608/545bee60b0af9f12318b459c/html5/thumbnails/19.jpg)
Four Metrics (see paper)
PAGE 19
process discovery
replay fitness
precisiongeneralization
simplicity
“able to replay event log” “Occam’s razor”
“not overfitting the log” “not underfitting the log”
1 = optimal0 = very bad
![Page 20: On the Role of Fitness, Precision, Generalization and Simplicity in Process Discovery](https://reader034.vdocuments.mx/reader034/viewer/2022051608/545bee60b0af9f12318b459c/html5/thumbnails/20.jpg)
Example
PAGE 20
B
C
D
E
F
A G
A = send e-mail, B = check credit, C = calculate capacity, D = check system, E = accept, F = reject, G = send e-mail
unknown
![Page 21: On the Role of Fitness, Precision, Generalization and Simplicity in Process Discovery](https://reader034.vdocuments.mx/reader034/viewer/2022051608/545bee60b0af9f12318b459c/html5/thumbnails/21.jpg)
Conventional Algorithms (1/3)("best effort" mapping to process trees to allow for comparison)
PAGE 21
alpha miner
ILP miner
language-based region miner
sound
sound
sound
low fitness
low precision
low fitness
… lucky
![Page 22: On the Role of Fitness, Precision, Generalization and Simplicity in Process Discovery](https://reader034.vdocuments.mx/reader034/viewer/2022051608/545bee60b0af9f12318b459c/html5/thumbnails/22.jpg)
Conventional Algorithms (2/3)
PAGE 22
heuristic miner
multi-phase miner
unsound
unsound
(relaxed
sound)
![Page 23: On the Role of Fitness, Precision, Generalization and Simplicity in Process Discovery](https://reader034.vdocuments.mx/reader034/viewer/2022051608/545bee60b0af9f12318b459c/html5/thumbnails/23.jpg)
Conventional Algorithms (3/3)
PAGE 23
state-based region miner
genetic miner
unsound
sound
![Page 24: On the Role of Fitness, Precision, Generalization and Simplicity in Process Discovery](https://reader034.vdocuments.mx/reader034/viewer/2022051608/545bee60b0af9f12318b459c/html5/thumbnails/24.jpg)
Often unsound result and no mechanism to seamlessly balance the four criteria
PAGE 24
unsound models
ignored criteria
![Page 25: On the Role of Fitness, Precision, Generalization and Simplicity in Process Discovery](https://reader034.vdocuments.mx/reader034/viewer/2022051608/545bee60b0af9f12318b459c/html5/thumbnails/25.jpg)
Genetic Mining (ETM) While Considering Only One Criterion
PAGE 25
best value possible
for this log
ETM with weight zero to three out of four perspectives.
![Page 26: On the Role of Fitness, Precision, Generalization and Simplicity in Process Discovery](https://reader034.vdocuments.mx/reader034/viewer/2022051608/545bee60b0af9f12318b459c/html5/thumbnails/26.jpg)
Considering Replay Fitness and One Other Criterion
PAGE 26ETM with weight zero to two out of four perspectives.
![Page 27: On the Role of Fitness, Precision, Generalization and Simplicity in Process Discovery](https://reader034.vdocuments.mx/reader034/viewer/2022051608/545bee60b0af9f12318b459c/html5/thumbnails/27.jpg)
Considering 3 of 4 Criteria
PAGE 27replay fitness needs to have a larger weight
![Page 28: On the Role of Fitness, Precision, Generalization and Simplicity in Process Discovery](https://reader034.vdocuments.mx/reader034/viewer/2022051608/545bee60b0af9f12318b459c/html5/thumbnails/28.jpg)
Considering All Four Criteria with Emphasis on Fitness
PAGE 28
fitness has weight 10
![Page 29: On the Role of Fitness, Precision, Generalization and Simplicity in Process Discovery](https://reader034.vdocuments.mx/reader034/viewer/2022051608/545bee60b0af9f12318b459c/html5/thumbnails/29.jpg)
Initial Model Versus Discovered Model
PAGE 29
B
C
D
E
F
A G
discovered by ETMsimulated
Discovered model
outperforms initia
l
model with respect
too all crite
ria!
Better than
existing algorithms
(but patience is
needed)!
![Page 30: On the Role of Fitness, Precision, Generalization and Simplicity in Process Discovery](https://reader034.vdocuments.mx/reader034/viewer/2022051608/545bee60b0af9f12318b459c/html5/thumbnails/30.jpg)
Real-Life Event Logs
• Event log L0 is the event log used before. L0 contains 100 traces, 590 events and 7 activities.
• Event Log L1 contains 105 traces, 743 events in total, with 6 different activities.
• Event Log L2 contains 444 traces, 3.269 events in total, with 6 different activities.
• Event Log L3 contains 274 traces, 1:582 events in total, with 6 different activities.
PAGE 30
Event logs L1, L2 and L3 are extracted from the information systems of municipalities participating in the CoSeLoG project (http://www.win.tue.nl/coselog/).
![Page 31: On the Role of Fitness, Precision, Generalization and Simplicity in Process Discovery](https://reader034.vdocuments.mx/reader034/viewer/2022051608/545bee60b0af9f12318b459c/html5/thumbnails/31.jpg)
Results
PAGE 31
Equal weights for all criteria.
If unsound , the sound behavior is approximated when creating the process tree.
![Page 32: On the Role of Fitness, Precision, Generalization and Simplicity in Process Discovery](https://reader034.vdocuments.mx/reader034/viewer/2022051608/545bee60b0af9f12318b459c/html5/thumbnails/32.jpg)
Conclusion
• First algorithm that allows for balancing all four perspectives.
• Genetic algorithm is very flexible, but also very slow.
• Process trees only used internally (choose your favorite representation)
• Future work:− Improve speed
− Distribute PM tasks
− Discover configurable process trees
PAGE 35
process discovery
replay fitness
precisiongeneralization
simplicity
“able to replay event log” “Occam’s razor”
“not overfitting the log” “not underfitting the log”
![Page 33: On the Role of Fitness, Precision, Generalization and Simplicity in Process Discovery](https://reader034.vdocuments.mx/reader034/viewer/2022051608/545bee60b0af9f12318b459c/html5/thumbnails/33.jpg)
PAGE 36
www.processmining.org
www.win.tue.nl/ieeetfpm/