on the role of fitness, precision, generalization and simplicity in process discovery

Post on 07-Nov-2014

264 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

 

TRANSCRIPT

On the Role of Fitness, Precision, Generalization and Simplicity in Process Discovery

Joos Buijs

Boudewijn van Dongen

Wil van der Aalst

http://www.win.tue.nl/coselog/

Advances in Process Mining

• Many process discovery and conformance checking algorithms and tools are available (cf. the various ProM packages).

• Also commercial software based on these ideas: Disco (Fluxicon), Reflect (Futura/Perceptive), BPMOne (Pallas Athena/Perceptive), ARIS Process Performance Manager (Software AG), Interstage Automated Process Discovery (Fujitsu), QPR ProcessAnalyzer/Analysis (QPR Software), flow (fourspark), Discovery Analyst (StereoLOGIC), etc.

• We applied process mining in over 100 organizations.

PAGE 2

More than 75 people involving more than 50 organizations created the Process Mining Manifesto in the context of the IEEE Task Force on Process Mining.

Available in 13 languages

Example Process Discovery(Vestia, Dutch housing agency, 208 cases, 5987 events)

PAGE 3

Example: Conformance Checking (WOZ objections Dutch municipality, 745 objections, 9583 event, f= 0.988)

PAGE 4

Challenge: Four Competing Quality Criteria

PAGE 5

process discovery

replay fitness

precisiongeneralization

simplicity

“able to replay event log” “Occam’s razor”

“not overfitting the log” “not underfitting the log”

Example: one log four models

PAGE 6

astart register

request

bexamine thoroughly

cexamine casually

d checkticket

decide

pay compensation

reject request

reinitiate requeste

g

hf

end

astart register

request

cexamine casually

dcheckticket

decide reject request

e hend

N3 : fitness = +, precision = -, generalization = +, simplicity = +

N2 : fitness = -, precision = +, generalization = -, simplicity = +

astart register

request

bexamine

thoroughly

cexamine casually

dcheck ticket

decide

pay compensation

reject request

reinitiate request

e

g

h

f

end

N1 : fitness = +, precision = +, generalization = +, simplicity = +

astart register

request

cexamine casually

dcheckticket

decide reject request

e hend

N4 : fitness = +, precision = +, generalization = -, simplicity = -

aregister request

dexamine casually

ccheckticket

decide reject request

e h

a cexamine casually

dcheckticket

decide

e g

a dexamine casually

ccheckticket

decide

e g

register request

register request

pay compensation

pay compensation

aregister request

b dcheckticket

decide reject request

e h

aregister request

d bcheckticket

decide reject request

e h

a b dcheckticket

decide

e gregister request

pay compensation

examine thoroughly

examine thoroughly

examine thoroughly

… (all 21 variants seen in the log)

acdeh

abdeg

adceh

abdeh

acdeg

adceg

adbeh

acdefdbeh

adbeg

acdefbdeh

acdefbdeg

acdefdbeg

adcefcdeh

adcefdbeh

adcefbdeg

acdefbdefdbeg

adcefdbeg

adcefbdefbdeg

adcefdbefbdeh

adbefbdefdbeg

adcefdbefcdefdbeg

455

191

177

144

111

82

56

47

38

33

14

11

9

8

5

3

2

2

1

1

1

# trace

1391

process discovery

replay fitness

precisiongeneralization

simplicity

“able to replay event log” “Occam’s razor”

“not overfitting the log” “not underfitting the log”

Model N1

PAGE 7

acdeh

abdeg

adceh

abdeh

acdeg

adceg

adbeh

acdefdbeh

adbeg

acdefbdeh

acdefbdeg

acdefdbeg

adcefcdeh

adcefdbeh

adcefbdeg

acdefbdefdbeg

adcefdbeg

adcefbdefbdeg

adcefdbefbdeh

adbefbdefdbeg

adcefdbefcdefdbeg

455

191

177

144

111

82

56

47

38

33

14

11

9

8

5

3

2

2

1

1

1

# trace

1391

astart register

request

bexamine

thoroughly

cexamine casually

dcheck ticket

decide

pay compensation

reject request

reinitiate request

e

g

h

f

end

N1 : fitness = +, precision = +, generalization = +, simplicity = +

Model N2

PAGE 8

acdeh

abdeg

adceh

abdeh

acdeg

adceg

adbeh

acdefdbeh

adbeg

acdefbdeh

acdefbdeg

acdefdbeg

adcefcdeh

adcefdbeh

adcefbdeg

acdefbdefdbeg

adcefdbeg

adcefbdefbdeg

adcefdbefbdeh

adbefbdefdbeg

adcefdbefcdefdbeg

455

191

177

144

111

82

56

47

38

33

14

11

9

8

5

3

2

2

1

1

1

# trace

1391

astart register

request

cexamine casually

dcheckticket

decide reject request

e hend

N2 : fitness = -, precision = +, generalization = -, simplicity = +

Model N3

PAGE 9

acdeh

abdeg

adceh

abdeh

acdeg

adceg

adbeh

acdefdbeh

adbeg

acdefbdeh

acdefbdeg

acdefdbeg

adcefcdeh

adcefdbeh

adcefbdeg

acdefbdefdbeg

adcefdbeg

adcefbdefbdeg

adcefdbefbdeh

adbefbdefdbeg

adcefdbefcdefdbeg

455

191

177

144

111

82

56

47

38

33

14

11

9

8

5

3

2

2

1

1

1

# trace

1391

astart register

request

bexamine thoroughly

cexamine casually

d checkticket

decide

pay compensation

reject request

reinitiate requeste

g

hf

end

N3 : fitness = +, precision = -, generalization = +, simplicity = +

Model N4

PAGE 10

acdeh

abdeg

adceh

abdeh

acdeg

adceg

adbeh

acdefdbeh

adbeg

acdefbdeh

acdefbdeg

acdefdbeg

adcefcdeh

adcefdbeh

adcefbdeg

acdefbdefdbeg

adcefdbeg

adcefbdefbdeg

adcefdbefbdeh

adbefbdefdbeg

adcefdbefcdefdbeg

455

191

177

144

111

82

56

47

38

33

14

11

9

8

5

3

2

2

1

1

1

# trace

1391

astart register

request

cexamine casually

dcheckticket

decide reject request

e hend

N4 : fitness = +, precision = +, generalization = -, simplicity = -

aregister request

dexamine casually

ccheckticket

decide reject request

e h

a cexamine casually

dcheckticket

decide

e g

a dexamine casually

ccheckticket

decide

e g

register request

register request

pay compensation

pay compensation

aregister request

b dcheckticket

decide reject request

e h

aregister request

d bcheckticket

decide reject request

e h

a b dcheckticket

decide

e gregister request

pay compensation

examine thoroughly

examine thoroughly

examine thoroughly

… (all 21 variants seen in the log)

Another challenge: Huge search space

PAGE 11

MM

MMM MM

M

MM

M

M

M

M

M

M

MM

MM

M

M

MM

MMM MM

M

MM

M

M

M

M

M

M

MM

MM

M

M

MM

MMM MM

M

MM

M

M

M

M

M

M

MM

MM

M

M

MM

MMM MM

M

MM

M

M

M

M

M

M

MM

MM

M

M

MM

M

M

M

M

M

M

MM MMM

MM

M

M

M

M M MM

M

MM

MMM MM

M

MM

M

M

M

M

M

M

MM

MM

M

M

MM

MMM MM

M

MM

M

M

M

M

M

M

MM

MM

M

M

MM

MMM MM

M

MM

M

M

M

M

M

M

MM

MM

M

M

MM

MMM MM

M

MM

M

M

M

M

M

M

MM

MM

M

M

MM

M

M

M

M

M

M

MM MMM

MM

M

M

M

M M MM

M

MMM M

M

MM

M

M

M

M

M

M

MM

M

M

MMM M

M

MM

M

M

M

M

M

MM

MM

M

MMM

MM M

MM M

M

MM

M

M

M

M

M

MM

MM

M

MM

MMM M

M

MM

M

M

M

M

M

MM

MM

M

M

M

M

MM

M

M

M

M

M

M

MM

M

MM

M

M

M

M

M

M

MM

MM

M

M M

MM

M

M

M

M

M

M

M

MM

M

M

M

M

MM

M

M MM M M M

… with just a few interesting candidates

PAGE 12

MM

MMM MM

M

MM

M

M

M

M

M

M

MM

MM

M

M

MM

MMM MM

M

MM

M

M

M

M

M

M

MM

MM

M

M

MM

MMM MM

M

MM

M

M

M

M

M

M

MM

MM

M

M

MM

MMM MM

M

MM

M

M

M

MM

M

MM

MM

M

M

MM

M

M

M

M

M

M

MM MMM

MM

M

M

M

M M MM

M

MM

MMM MM

M

MM

M

M

M

M

M

M

MM

MM

M

M

MM

MMM MM

M

MM

M

M

M

M

M

M

MM

MM

M

M

MM

MMM M

M

MMM

M

M

M

M

M

M

MM

MM

M

M

MM

MMM MM

M

MM

M

M

M

M

M

M

MM

MM

M

M

MM

M

M

M

M

M

M

MMM

MM MM

M

M

M

M M MM

M

MMM M

M

MM

M

M

M

M

M

M

MM

M

M

MMM M

M

MM

M

M

M

M

M

MM

MM

M

MMM

MM M

MM M

M

MM

M

M

M

M

M

MM

MM

M

MM

MMM M

M

MM

M

M

M

M

M

MM

MM

M

M

M

M

MM

M

M

M

M

M

M

MM

M

MM

M

M

M

M

M

M

MM

MM

M

M M

MM

M

M

M

M

M

M

M

MM

M

M

M

M

MM

M

M MM M M M

Two requirements

1. It should be possible to seamlessly balance the different quality criteria based on user-defined preferences.

2. The algorithm should always return a "correct" process model and not waste time on model having deadlocks and other anomalies.

PAGE 13

process discovery

replay fitness

precisiongeneralization

simplicity

“able to replay event log” “Occam’s razor”

“not overfitting the log” “not underfitting the log”

Proposal: Evolutionary Tree Miner (ETM)

• Process trees as representation (= limit search space to "good" models).

• Genetic approach (= very flexible)• Fitness function uses all four criteria (= seamlessly

balance the different "forces")

PAGE 14

Representational Bias: Process Trees

• Always sound because of the block structure

• Also Loop and OR operator

A (BA)*

PAGE 15

A

DC

E

X

B

A

Petri Net Semantics(used for comparison and conformance checking only)

PAGE 16

A

B

B

A

A

B

A

B

Sequence

Exclusive Choice

Loop

Parallellism

Or Choice

A B

Steps of the Genetic ETM Algorithm

PAGE 17

CreateInitial

Population

ReturnBest

Individual

MeasureQuality

ChangePopulation

Stop?

YesYes

NoNo

Population Change

PAGE 18

Population i Population i+1

Elite

Crossover Mutation

Selection

Replace

Four Metrics (see paper)

PAGE 19

process discovery

replay fitness

precisiongeneralization

simplicity

“able to replay event log” “Occam’s razor”

“not overfitting the log” “not underfitting the log”

1 = optimal0 = very bad

Example

PAGE 20

B

C

D

E

F

A G

A = send e-mail, B = check credit, C = calculate capacity, D = check system, E = accept, F = reject, G = send e-mail

unknown

Conventional Algorithms (1/3)("best effort" mapping to process trees to allow for comparison)

PAGE 21

alpha miner

ILP miner

language-based region miner

sound

sound

sound

low fitness

low precision

low fitness

… lucky

Conventional Algorithms (2/3)

PAGE 22

heuristic miner

multi-phase miner

unsound

unsound

(relaxed

sound)

Conventional Algorithms (3/3)

PAGE 23

state-based region miner

genetic miner

unsound

sound

Often unsound result and no mechanism to seamlessly balance the four criteria

PAGE 24

unsound models

ignored criteria

Genetic Mining (ETM) While Considering Only One Criterion

PAGE 25

best value possible

for this log

ETM with weight zero to three out of four perspectives.

Considering Replay Fitness and One Other Criterion

PAGE 26ETM with weight zero to two out of four perspectives.

Considering 3 of 4 Criteria

PAGE 27replay fitness needs to have a larger weight

Considering All Four Criteria with Emphasis on Fitness

PAGE 28

fitness has weight 10

Initial Model Versus Discovered Model

PAGE 29

B

C

D

E

F

A G

discovered by ETMsimulated

Discovered model

outperforms initia

l

model with respect

too all crite

ria!

Better than

existing algorithms

(but patience is

needed)!

Real-Life Event Logs

• Event log L0 is the event log used before. L0 contains 100 traces, 590 events and 7 activities.

• Event Log L1 contains 105 traces, 743 events in total, with 6 different activities.

• Event Log L2 contains 444 traces, 3.269 events in total, with 6 different activities.

• Event Log L3 contains 274 traces, 1:582 events in total, with 6 different activities.

PAGE 30

Event logs L1, L2 and L3 are extracted from the information systems of municipalities participating in the CoSeLoG project (http://www.win.tue.nl/coselog/).

Results

PAGE 31

Equal weights for all criteria.

If unsound , the sound behavior is approximated when creating the process tree.

Conclusion

• First algorithm that allows for balancing all four perspectives.

• Genetic algorithm is very flexible, but also very slow.

• Process trees only used internally (choose your favorite representation)

• Future work:− Improve speed

− Distribute PM tasks

− Discover configurable process trees

PAGE 35

process discovery

replay fitness

precisiongeneralization

simplicity

“able to replay event log” “Occam’s razor”

“not overfitting the log” “not underfitting the log”

PAGE 36

www.processmining.org

www.win.tue.nl/ieeetfpm/

top related