work ow mining: a stepwise approach for extracting event

Workflow Mining: A stepwise approach for extracting Event logs

from Corporate Network Data

G.P. Demmenie∗, M.V. Dignum∗, J. van den Berg∗, J.L.M. Vrancken∗, and M. Israel†

January 30, 2011

Abstract Repeated process execution patternsand the deviation therefrom, are useful for identifica-tion of fraudulent corporate behaviour. Even thoughnot all organisations have workflow management sys-tems that provide so-called event logs to use workflowmining techniques to give insight in those processes,but corporate network data is often available. How-ever, no methods are available to extract event logsfrom corporate network data. The stepwise approachpresented in this paper makes it possible to extractthat information in four steps: 1) indexing 2) invoicenumber mining 3) document discovery and 4) activityextraction. Though validation of the resulting work-flow model is difficult the stepwise approach makes itpossible to extract the needed information to createevent logs from corporate network data as demon-strated in a case study evaluation.

Keywords Workflow, Corporate Network Data,Event Logs, Workflow Mining, Workflow Models.

1 Introduction

Repeated process execution patterns and the devia-tion therefrom, are useful for identification of fraud-ulent corporate behaviour. While currently informa-tion from confiscated computer systems is indexedand then searched manually for keywords and names,it would be much more revealing to see the repeatedexecution of a certain process and look for the var-

∗Faculty Technology, Policy and Management, Delft Uni-versity of Technology, Netherlands†Kecida, Netherlands Forensic Institute, Netherlands

ious execution paths. When this shows deviatingpatterns, it might indicate deviation from work in-structions and possible fraud. The literature de-scribes several techniques to mine workflow modelsthat visualize the various execution paths of a pro-cess. Such techniques are often based on the avail-ability of so-called event logs, like the α-algorithm[19, 8, 22, 23, 16] or for instance HeuristicsMiner [21].These event logs are produced by workflow manage-ment systems like for instance ERP systems like SAP,PeopleSoft and Oracle and show which tasks are per-formed when and to which execution of the processthose tasks belong.

Even though not all organisations have such sys-tems in place, still many tasks from processes aresupported by digital documents. These documentsare stored on the various computer systems withinthe organisation. The combined images, or copies, ofthe storage devices in those computer systems of acorporation thus harbour information about the pro-cesses within the organisation. We call these com-bined images the Corporate Network Data and fromthe information in that data we want to extract theinformation such that we can mine the workflow mod-els and provide insight in the repeated execution ofprocesses.

Existing literature does not provide methods to ex-tract information from the Corporate Network Dataand format it such that the existing workflow miningtechniques can be used to create workflow models. Inthis paper, we introduce a stepwise approach to ex-tract event logs, that are the input for the workflowmining techniques, from the corporate network datawe have at our disposal.

1

The results from this research show that it is indeedpossible to extract the information to construct eventlogs and use existing workflow mining techniques tocreate workflow models.

The paper is organised as follows. Section 2 ex-poses the problem of why Corporate Network Datadoes not provide direct possibilities to mine work-flows from and describes both the input and the out-put of this research. In Section 3 we present a methodto extract and format the information needed to mineworkflows using existing methods. The approach isapplied in a case study of which we present the resultsin Section 4. In Section 6 we discuss the approach.Finally in Section 7 we present the conclusions anddirections for future research.

2 Preliminaries

Usually workflow mining is performed on so-calledevent logs [5, 6, 7, 9, 10, 12, 13, 14, 17, 18, 19],while only a few techniques enable the creation ofsuch event logs from unstructured data, like for in-stance emails, so that they can use other input thanjust the event logs [11, 15].

In order to use workflow mining techniques theright information from the Corporate Network Dataneeds to be extracted and formatted in the form thatthe existing workflow mining applications can per-form their analysis on it. To provide the proper in-formation and formatting we first need to describethe information needed to create event logs.

In this section the input and the output of thisresearch is described.

2.1 Corporate Network Data

Corporate Network Data (CND) is the data collectedwithin an organisation from its various computer sys-tems. An exact copy is taken from the storage devicesof all the computer systems in an organisation. Thiscopy includes all files that are stored on computers,servers and removable storage devices.

Corporate Network Data can be seen as a datasetfrom which we aim to extract the financial transac-tion workflow information. Documents such as in-

voices, trail balances and lists of outstanding ordersare the type of objects to look for. We assume thatthere is no prior knowledge about the organisationsuch that all the information must be extracted fromthe dataset itself.

2.2 Event logs

When a system logs several attributes of tasks or ac-tivities that have been executed at the time the taskhas finished one could speak of a log. Some examplesof attributes to be stored can be the name of the per-son executing the task or the time it took him or herto do it. Event logs though have specific characteris-tics that make them useful for workflow mining. Thebuilding blocks of an event log are the events and thenotion of cases.

Definition An Event is an executed task or activity.

Definition A Case is a list of events with causalrelations, hence the order in which they are recordedin the case is important.

Now that the building blocks for an event log aredefined the actual event log can be defined.

Definition An Event Log is a set of cases wherethe cases can be intertwined, but the order of eventswithin a case needs to be preserved.

So added to the simple log is the notion of cases,which represents a onetime execution of a workflow,hence the executions of the various tasks that shouldbe performed for that workflow. A workflow also pre-scribes the order in which tasks should be executed,therefore within a case the order of execution of thetasks should also be recorded. This does not haveto be with timestamps, it is enough to be able toassume that a task recorded earlier in the log alsowas preceded tasks recorded later in the log. Whenseveral cases are recorded in one log then that logis called an event log. It is not necessary that thecases are strictly separated in the log, they may beintertwined as long as the order within the cases arepreserved. This means that if two cases are executedat the same time that the execution of the events of

2

CaseID Activity1 A2 A3 A3 B1 B1 C2 C4 A2 B2 D5 A4 C1 D3 C3 D4 B5 E5 D4 D

Table 1: An event log where case 2 is highlighted andthe order within the case is important

the two cases can be recorded in serial and thus getmixed the order still stays intact.

In Table 1 an example of an event log is shownwhere case 2 is being highlighted. It is clearly visiblethat various cases have mingled, but the assumptionis made that the order within the cases are preserved.The example of case two can be described as Case2= {ABCD}, the tasks performed where CaseID =2, while the shortest form to represent the completeevent log, or all its contained cases, would be {ABCD,ACBD, AED}. Which are all list of tasks performedwithin the five cases while maintaining the order inwhich they appear in the log.

Many other attributes can be added. For instanceactors performing the task, this makes it possible tomine organisational structures [3, 20]. Timestampsmake it possible to omit the explicit order in whichthe execution of the tasks is performed as it is thenpossible to reconstruct the order using the times-tamps. Timestamps also add information in the form

of throughput times of a workflow and make it pos-sible to use different mining algorithms [4]. If thereis a start and end timestamp even the throughput ofthe task itself can be traced.

Requirements The most basic attributes that areneeded to construct event logs are as follows.

1. Case Identifier, the identifier that shows towhich execution of the process the event belongs.

2. Activity Description, the description of the taskthat is performed in that event.

3. Order, the order in which the events occur.

3 The Stepwise Approach

In this section, we will describe the Stepwise Ap-proach as depicted in Figure 1. The figure showsthe four distinct steps needed to extract the informa-tion from the Corporate Network Data to constructan event log. In between the steps the various in-termediate deliverables are shown in the form of setsthat need to be created. The approach assumes theavailability of the Corporate Network Data and endswhen all information to construct event logs is avail-able.


The starting point of this research is that CorporateNetwork Data is available. We call this set the Fcnd.This set encompasses, depending on how the organi-sation works, images of all the data stored on work-stations, servers and possibly removable storage de-vices.

3.2 STEP 1: Indexing

Indexing reduces the search times needed to searchfor various files in preceding steps immensely. Theindexing information needed per file is defined below.

Definition The index of a file is if =(Nf ,Wf , idf , fnf , tf )

3

Invoice Number Mining2.

Research Start

Research Goal

Corporate Network Data

Event LogWorkFlow Model

ValidationCollection

Sinv

S

Finv

Workflow Mining

I

Indexing

1.

Document Discovery

3.

Activity Extraction

4.

Figure 1: Visual representation of the stepwise ap-proach

Where,

• if is the index of a file f .

• Nf is the set of numbers available in f ,

• Wf is the set words available in f ,

• idf is an identifier to uniquely identify f ,

• fnf is the filename of f ,

• tf is the timestamp associated with f ,

For each file an index is created and all the indicescombined is defined below.

Definition I is the set of indices such that I ={if |f ∈ Fcnd}

3.3 STEP 2: Invoice Number Mining

Finding the invoice numbers is needed to comply tothe first requirement, finding a case identifier as de-scribed in Section 2.2. The invoice numbers is the

identifier for the transactions we are after, hence weneed to find the invoice numbers.

Trying to find invoice numbers is a two step ap-proach. First, it is needed to collect numbers thatmight possibly be invoice numbers. Second, the num-bers found need to be verified.

3.3.1 Invoice number collection

Collecting the possible invoice numbers means thata search has to be done in the locations where in-voice numbers are expected to be found. Limitingthe search to the files containing the word ‘invoice’,or a synonym in the language the company worksin, is already effectively limiting the set of files tosearch through. Also only including document typefiles that are used to support the expected tasks helpslimiting to converge the search space toward a morespecific environment where numbers are more find tobe invoice numbers. Usually the invoice numbers dohave a specific pattern, it thus pays to examine afew files thoroughly and search for hints to be ableto limit the scope of possible invoice numbers again.When this all has been done the limited set needs tobe searched for the possible invoice numbers. The setof numbers that results from this step contains onlynatural numbers and is called the set S.

3.3.2 Invoice number validation

For validation of the invoice numbers we do threeassumptions on how invoice numbers are established.First we assume that invoice numbers always are arange. Secondly we assume that the first occurrenceof a lower number should be before higher numbers.Besides the two previous assumptions a sample ofthe found invoice numbers should be sought for inthe files they originate from and be checked to be inthe context of being an invoice number. The set ofnumbers for which this is supposed to be true is theset of invoice which we call Sinv.

The assumption that invoice numbers should be arange of numbers follows from Dutch regulation thatdemands that invoice numbers are a range withoutmissing numbers [1]. This can also be represented asfollows:

4

Definition ∀x ∈ N : min(Sinv) ≤ x ≤max(Sinv) => x ∈ Sinv

Due to the same Dutch regulations we also arguethat numbers are not only in a range, but also in-crease over time. This leads to the concept that lowernumbers generally should appear earlier in time thanhigher numbers. So we assume the following:

Definition Let Ix be the set of indices such that∀x ∈ Sinv : Ix = {if |x ∈ Nf}

Resulting from that we can say that if we have aninvoice number x, we collect all indices if for whichx ∈ Nf this results in a set of indices we call Ix. Tofind out when x was recorded for the first time welook at all timestamps tf that are in Ix and we selectthe earliest one. If we would do the same for y andx < y then the earliest timestamp for x should be, intime, before that one of y.

When it is found that both assumptions have beenmet for a set of numbers found earlier and at least asample of the numbers are found in the right context,it is safe to say that it is highly probable that thenumbers represent invoice numbers.

3.4 STEP 3: Document Discovery

Document discovery is about finding all the docu-ments that contain information about invoices. Tobe able to find those documents, a search for the in-voice numbers (Sinv) in all the files of the CorporateNetwork Data needs to be performed. If a searchis done through the complete set of files, using theindex, a lot of noise is generated. So limiting thedataset to only the subset where invoice numbers areexpected to be found helps decreasing the amount ofnoise. Using the limitation that the documents needto contain the word ’invoice’ or a synonymous wordin the language of the documents reduces searchingin documents that are unrelated to the transactionprocesses.

To be able to have the appropriate data availablein a later stage we use the following definition fora document. The mentioned attributes need to becaptured for later use.

Definition Let a document be d ={sinv, fnd, ld, td, ud}

Where

• sinv ∈ Sinv and is the invoice number found in d

• fnd is the filename of d

• ld is the stored location of d

• td is timestamp associated with d

• ud is the username associated with d

The information extracted from the found docu-ments makes it possible to comply partly to the sec-ond and fully to the third requirement. The filenamerecorded as fnd provides the base for finding the ac-tivity, this will be refined in the last step. The times-tamp td provides the possibility to create the orderof the activities within the case that we need as thethird requirement.

3.5 STEP 4: Activity Extraction

We assume that the extracted documents support aspecific task in the workflow. The most rudimentaryway to extract the activities is to look at the namingof the files. Most organisations have a naming schemefor the products or supporting documents of theirprocesses. But this might not always be enough todetermine a sensible name for the activity it supports.In that case in-depth inspection of the files is neededto extract the supported task.

3.6 Event Logs

To align this work with workflow mining research andto be able to use the tools resulting from that researchthere is a need to create so-called event logs.

We have fulfilled all the information needed to cre-ate event logs as described in Section 2

1. The case identifier, an identifier that is uniquefor each time the workflow is executed.

2. The activity, the task being performed in thisevent.

5

3. The order of the events can be deduced.

The identifier is found in the second step wherethe invoice numbers are extracted and verified. Theactivity can partly be induced from the filename asdescribed in step four, but in some cases it meansthat the content of the files needs to be examinedthoroughly to find the task being supported by thefile. The order of the activities can be deduced fromthe time associated with the file extracted in stepthree.

Because all the information to build an event log isavailable, it is now trivial to create that event log andproduce the input for the workflow mining techniquesavailable. At this point we have reached the pointwhere the existing literature can be used again.

3.6.1 Workflow mining

Using the event logs it is possible to start miningworkflows. Either by using PRoM framework by vdAalst et al., or possibly other tools. The resultingworkflow models can then be examined in detail tosee what are workflows are the norm and which de-viate from them. They can also be compared withthe original design of the workflow to see whether itconforms the actual workflow or not.

4 Case Study

For this case study the in Section 3 designed Step-wise Approach as depicted in Figure 1 is used to ex-tract the information from Corporate Network Datato create event logs. The input for the approach, theCorporate Network Data, is the captured data froman international company.


The Corporate Network Data consists of a exact copyof various harddisks that came from the computersystems of a international company. This encom-passes 300 Gigabyte of data in total. The systemsfrom which the harddisks were copied include servers,

workstations and laptops. The total number of sys-tems included in the set are twelve physical separatedsystems.

The corporate network data is taken from thosetwelve systems. The systems captured are three file-servers and the rest are workstations. The completeset of data consists of a exact copy of various hard-disks that came from the computer systems of a inter-national company. Also a lot of paper documents arescanned and using OCR1 these documents are digi-tised and made searchable using computerised tech-niques. The total encompasses 300 Gigabyte of datawith about 3.6 million files. This whole set of files iscalled the Fcnd

We do not always need the complete Fcnd hencein the following steps we will create three subsetsthat are useful for that step as depicted in Figure 2.In step two we create a set of files to extract theinvoice numbers from Fn such that Fn ⊂ Fcnd. Instep three we create two sets, one is used to search fordocuments (Fe) such that Fe ⊂ Fcnd and the other setis the result of step three of the approach and containsall the found documents that have something to dowith invoices (Finv) such that Finv ⊂ Fe.

Fcnd

FeF

inv

Fn

Figure 2: File datasets and their relations

4.2 STEP 1: Indexing

This step was done for us by the team that operatesXiraf. Xiraf is a forensic processing- and analysis

1Optical Character Recognition

6

system that makes it possible to search digital evi-dence [2], developed together with the NetherlandsForensic Institute. Xiraf extracts information likefile metadata, document properties, email, chat logrecords, browser history records, etcetera. We onlyused the file metadata and document properties op-tions of this. The indexing being performed by Xirafmakes it possible for us to quickly find documentswith specific keywords and extract the metadata ofthose files, or search in the content of the files forother clues. The total number of objects to searchthrough now encompasses 9.6 million objects that arequickly searchable for content or metadata.

4.3 STEP 2: Invoice Number Mining

As the approach prescribes this part takes two steps.First we need to collect as many invoice numbers aspossible. We assume that there is no access to thefinancial administration, thus the numbers should befound in the data available to us in the CorporateNetwork Data. Second, we need to verify the foundinvoice numbers so we know with some certainty thatwe actually are modelling the right process and nota lot of noise.

During these two sub steps three different sets ofnumbers will be created, their relations are depictedin Figure 3. In the collection step a set of naturalnumbers S is created. Then during the validationstep first the set of S′

inv is created such that S′inv ⊂ S,

this set is not completely the set of invoice numbersneeded. The S′

inv is transformed into the Sinv in thefinal stages of this step.

S

S'inv

Sinv

Figure 3: Reference numbers

4.3.1 Invoice number collection

While in search of as many invoice numbers as pos-sible, a common problem arises. Although we wantto retrieve all invoice numbers, (recall) we also wantthat what numbers we find are indeed all invoicenumbers (precision). From a data-analysis perspec-tive it is more important that the numbers found arereally invoice numbers as noise will complicate fur-ther steps in the process. Due to the structure of thedataset and the composition of the invoice numbersit is hard to have a high recall and at the same timea high precision. Numbers are overly available in theavailable dataset and thus if high recall is requested,then there will be a lot of noise in the found invoicenumbers. This leads to complications in further stepsin the research and add unnecessary time consumingqueries in a later stage of the process.

Reducing the size of the dataset Due to thecomposition of the numbers we are after it is knownthat the dataset will contain a large amount of noise.Numbers originate in many files and attributes of fileson computers, thus it is needed to reduce the datasetto only relevant set of data. Numbers have many dif-ferent meanings depending on the context they arefound in. They can be sizes of files, amounts of moneytransferred in various currencies, IDs used by thecomputer, telephone numbers, dates and of courseinvoice numbers. By trying to reduce the dataset toknown contexts it is possible to curb the number ofmeanings a number can have.

Several contextual demarcations are made. First,only MS Office documents are included in thedataset. Initial exploration of the data showed thatmany processes include entering invoice informationin spreadsheets and word processing files. Second,the set of documents is reduced to only the docu-ments including the word ‘invoice’. This sets the con-text of where the numbers ‘live’ in. Invoices still con-tain various numbers, including amounts of money,VAT numbers, invoice numbers, shipment numbers,but it is more focused on the numbers we want toextract. This means that the possibility that thenumber we found is actually an invoice number ismuch larger than if we would be searching in the full

7

dataset. It was found in the initial exploration of thedata that most of the files used for keeping track ofinvoices and orders are usually not over 1MB size.

So in order to reduce the possible contexts a num-ber is found in the initial dataset is reduced to onlyfiles with the attributes mentioned above. The re-sulting set of files are called Fn.

Defining the invoice number Within the previ-ous reduced set of files the plan is to extract numbers,but numbers can have various length and might in-clude special characters like dashes, commas or fullstops. Based on inspection of the invoice numbersin several files in Fn we have decided to reduce oursearch to numbers following the structure of the nextregular expression:

’\b[1-9][0-9]{5,8}\b’

To explain the regular expression the following defi-nitions:

\b

Matches an empty string, but only at the be-ginning or end of a word. A word is de-fined as a sequence of alphanumeric or under-score characters, so the end of a word is indi-cated by whitespace or a non-alphanumeric, non-underscore character.

[1-9]

Matches a string of 1 numeric characters of thevalue between 1 and 9 inclusive.

[0-9]{5,8}

Matches a string of 5 to 8 numeric characterswith each a value between 0 and 9 inclusive.

This means that the expression includes numbersof length 6 to 9 characters if they appear as a’word’, broadly meaning that they should have a non-alphanumeric, non-underscore character before andafter the sequence of numbers. The sequence of nu-meric characters must not be separated by any non-numeric characters. Also the number must not startwith a 0.

Fn is a subset of the Corporate Network Data(Fcnd). It reduces the full set to only files that con-tain digital documents that contain the word ‘invoice’and are smaller than 1 Megabyte in size. This set ismeant to find the reference numbers we need to findthe transactions.

Mining the invoice number In the actual in-voice mining the above two concepts are used. Usingthe API to connect to Xiraf all documents from Fn

are retrieved and the content searched for matches ofthe regular expression as described above.

This results in the first set that does not consistof files, but merely consists of numbers, S. Thisset holds the numbers mined from the Fn. It holdsaround the 1.5 Million numbers and after deleting du-plicates S still contains 207 Thousand unique num-bers.

4.3.2 Invoice number validation

As been described earlier numbers can represent avariety of different concepts. While the context hasbeen curbed already, still many, explanations for thenumbers found can come up. Thus it is importantto find out what the possibility is that a found num-ber indeed is what we hope it is, an invoice number.The method described in Section 3 prescribes thatthe numbers should confirm the following three hy-potheses.

1. We expect them to be consecutive ranges.

2. We expect higher numbers to first appear at alater time than the lower numbers of a range.

3. We should be able to find these numbers in thecontext of invoices when actually looking at thecontent some of the documents.

Ranges To identify ranges we have decided to lookat the gaps between every two successive numbersin S. When the numbers are ordered ascending welooked at the distance between successive numbers.If this is 1 it means that there is no gap between thisnumber and the next, if it is bigger then there is sucha gap. To represent the results we have plotted this

8

distance between numbers in Figure 4. Because thehigh variance between the different distances we haveshown the distances on a logarithmic scale, wherethe scale in this picture is the color of the points.The scale is represented in heatmap colors, where inthis case dark blue colors mean that the distance ofthe next value is rather low, while lighter blue uptill red represent larger distances between successivenumbers.

0 100 200 300 4000

100

200

300

400

Figure 4: Distance between next number

The axis of the figure are just the enumeration ofthe values. We have found 207,000 numbers and thecolour of the points in the heatmap show the gap be-tween one number and the next starting at the leftbottom and the highest found number is in the righttop. It is clearly visible that there are dark blue‘bands’ that represent low gaps between successivevalues between y = 140 and y = 150.

Using this figures we have identified a candidatesuccessive range (S′

inv) and zoomed in on it. Dueto the collection method we know we probably willnot find a consecutive range of numbers in S′

inv. Wealso have seen that although we do see somethingthat is very much a successive range it still is notconsecutive. Since we know that, in the Netherlands,by law an organisation is forced to use a completeconsecutive range, we decided that instead of onlyusing the numbers we found, we inject the missingnumbers in our range S′

inv and thus create our invoicenumber set Sinv. Due to confidentiality we decided to

represent our range with values between 0 and 7000.

Increasing over time Our second hypothesisabout invoice numbers is that if they are to be consec-utive that then lower number should appear earlierin time than higher numbers. In order to check thatwe have plotted the appearances of numbers in timein Figure 5.

Figure 5: Numbers plotted in time with a linear lineshowing that higher numbers appear later in timethan lower numbers

The figure plots all occurrences of the numbersagainst the time of the file in which they are found.So if a number is found in more files it can be thatthere are more dots on the horizontal axis. If one fileor more files with the same date, harbour more thanone number there will be multiple dots along the ver-tical axis. When looking at the leftmost observationof an invoice number, which is the first appearance ofthat number, it is clearly visible that the red linearline increases over time and all observations but oneare under that line. Hence higher numbers appearlater in time than lower numbers.

There is only one noisy observation, around the in-voice number 1900 we observed one early appearanceof a number. Close inspection of the actual file it

9

showed that the number in that context was not aninvoice number, but a telephone number.

In-dept file analysis The third hypothesis wasthat close inspection of a sample of the files num-bers are found in and interpreting the context of thenumbers should show that the numbers are indeed in-voice numbers. We did inspect various files in whichwe did find numbers and the context all showed thatthe numbers found were indeed invoice numbers.

4.4 STEP 3: Document Discovery

To find the documents that will represent our activ-ities later in the research we are now going to usethe Sinv to find all documents that harbour one ofthe invoice numbers. We want to find all the filesthat can be linked to numbers from the Sinv, but wealso want that if we find such a number in a file thatagain the number is an invoice number. To make surethat we find invoice numbers again we decided to ap-ply approximately the same strategy as we did earlierduring the invoice mining step. This time we did notrestrict on filesize, so the only restrictions are, firstthat it must be an MS Office document, since thoseare used to support some of their tasks related to fi-nancial transactions and second that it must includethe word ‘invoice’. The resulting set is called Fe.

The extraction of the workflow information meansthat we need to extract all the files or documentsthat contain information about the transactions weare looking for.

A file containing a invoice numbers is seen as anevent, because we assume that when a file is savedsome event or activity linked to a transaction hastaken place.

So for each invoice number found in the previousstep of extracting invoice numbers, we now try to ex-tract all the events linked to them. For this we useSinv as input and Fe to search in for files. Using theAPI from Xiraf we can easily extract all files contain-ing numbers from the Sinv.

The resulting set we call Finv which is a subset ofFe and it overlaps partly with Fn. It contains allfiles that contain one or more of the invoice numbersfrom Sinv. The number of documents discovered is

618 and they contain between 1 and 6100 numberseach.

Because building this dataset takes quite a longtime the information is extracted from the Xiraf sys-tem and the metadata is stored in a database.

The stepwise approach prescribes the format of thefound documents to be defined as:

Definition Let a document be d ={sinv, fnd, ld, td, ud} (see Section 3.4)

The mapping of the actual data found in ourdataset to format as prescribed in the definition ofa document is shown in Table 2.

Attribute File Attribute Usedsinv Invoice Numberfnd Path (partly)ld Path (partly)td lastSavedOnud lastAuthor

Table 2: The mapping from the definition of docu-ments to the available information.

This information is directly stored as raw data ina database, then we transformed it into a dataware-house. The facts are then describing the fact thatan event has taken place. The dimensions are time,user, location and files.

Following steps will use as time the lastSavedOnvalue, as that is the best predictor of when the processhas taken place. For the user information it is chosento use the lastAuthor, as that is probably the one thatperformed the task. From the path of the files twodifferent pieces of information is retrieved. First, itcan tell us on which system the file is found. Second,the filename is later used to categorize the files intosimilar files and used to name the activity the filesupport.

4.5 STEP 4: Activity extraction

We assumed that we would find quite a few files thatare alike, not only in format but also in name. Mostorganisations apply some sort of a naming scheme forthe files that are vital for their processes. When a list

10

of all the files found is made they should be groupedin similar files. We did this by looking at similarnamed files. From the 618 files found we could groupthem in about 100 groups, of which some containedonly one file and others up to sixty. Of some groupsfiles were examined and labelled by an expert. Someof the groups were discarded as being noise, for in-stance where a file contained few numbers, that whilelooking at the context turned out to be a phone num-bers.

Confidentiality issues make it impossible to pro-vide the full list of activities, but we want to explaina few examples. First there is the activity we called’Outstanding’, this represents the listing of all out-standing, thus not yet payed, orders. We have seenthat this activity was performed on a weekly basis forquite some time during the used time window. Alsothe activity ’Profit and Loss’ is one of the activitiesthat caught our eyes. The profit and loss balancehas been created often in the time window we werelooking at.

After this labelling of all the files to activities allthe information to create event logs is available.

5 Results

5.1 Event Logs

The resulting data from the four steps taken com-prises all information needed to create the event logswe are after. Per event, which is a document, we havethe invoice number to represent the case ID. The ac-tivities is extracted from the name and the contentof the document. And the order have been deducedfrom the timestamps available per document. If thisinformation is stored in a simple text file we have theevent log. In our case this is an event log with 6367cases and total of 194524 events in a timeframe ofjust over a year.

5.2 Workflow mining

Using the ProM tool from vdAalst et al. the event logis mined and the resulting workflow model is shown inFigure 6. The mining has been done using the Heuris-

ticsMiner algorithm [21]. The choice for Heuristic-sMiner is because it can cope with the possibility ofnoise and the possibility to show or hide the noise.As this noise is probably exactly where possible fraudwill be visible.

ArtificialStartTask(complete)

6367

profit and loss(complete)

59609

1 5315

account(complete)

18401

0.999 4121

creditor list(complete)

1235

0.998 1119

cashflow(complete)

465

0.998 419

debtor(complete)

181

0.944 40

outstanding(complete)

10875

0.991 182

income check(complete)

112

0 14

void item(complete)

98

0.976 40

urgent payment(complete)

172

0.992 132

cost analysis(complete)

333

0.997 330

international(complete)

37

0.958 23

international revenues(complete)

4

0.5 1

1 46768

cashflow forecast(complete)

24899

0.995 6132

budget(complete)

42066

0.878 6374

ArtificialEndTask(complete)

6367

0.993 276

losses(complete)

46

0.977 44

audit(complete)

2

0.667 2

debtor balance(complete)

341

0.667 7

1 11991

0.983 2968

per exact(complete)

1867

0.966 1271

1 18255

analysis(complete)

7451

0.999 6016

cost by supplier(complete)

4607

0.967 46

revenues(complete)

471

0.98 450

0.964 5937

0.962 24

1 4430

0.999 1357

debtor ageing(complete)

91

0.979 46

0.983 113

0.947 55

0.972 1759

trail balance(complete)

283

0.947 26

1 35692

1 4544

0.999 1200

early warning report(complete)

457

0.991 172

transfer schedule(complete)

199

0.987 87

extraordinary costs(complete)

4

0.8 4

payment schedule(complete)

4

0.8 4

itl payment(complete)

12

0.857 6

investment schedule(complete)

99

0.9 15

debtor claimed(complete)

8

0.5 1

advances to creditors(complete)

6

0.75 3

0.984 1178

0.947 28

0.999 1386

0.998 470

int report(complete)

2

0.5 1

0.997 384

0.963 30

0.962 45

0.976 40

0.944 21

0.995 244

0.993 172

0.98 73

forecast(complete)

18

0.938 18

turnover by debtors(complete)

2710

0.998 465

0.998 460

0.999 2245

0.75 7

0.923 35

0.994 137

1 10558

financial review(complete)

824

0.952 38

outstanding international(complete)

866

0.991 194

0.994 452

debtor list(complete)

531

0.938 185

0.947 151

0.988 228

0.952 35

0.941 18

0.99 93

0.984 83

aged debt analysis(complete)

1755

0.9 15

0.938 55

0.938 16

0.973 37

0.941 10

0.998 640

supplier export(complete)

67

0.983 63

outstanding suppliers(complete)

519

0.985 67

0.984 61

0.998 452

0.964 25

0.938 7

0.976 52

0.923 12

0.889 56

0.994 172

0.8 4

0.988 87

0.999 1616

0.997 295

0.972 38

0.974 37

0.667 2

0.947 18

early warning budget(complete)

29

0.667 2

0.667 2

0.929 26

0.974 35

0.993 278

0.857 9

aging(complete)

34 0.944

17

0.5 2

0.75 4

0.984 83

0.5 1

0.8 6

0.667 3

Figure 6: The resulting Workflow model.

The resulting model is quite complex with manyconnections, loops and shortcuts. We have not fullyinterpreted the resulting model, but we have seensome interesting behaviour during a simulation of themodel. Using the ProM tool we have created a simu-lation of the mined model. This simulation visualizesall the cases in the event log and shows how theytravel through the model. It appears to be the casethat when for instance an invoice comes into the ac-tivity ‘outstanding’, which is the activity in whichall outstanding invoices are collected, it often loopsback to that same activity for a considerate amountof times.

It is also quite interesting that there are indeed afew paths within the model that are taken far moreoften than others.

11

5.3 Validation

Validating the actual workflow model is not possiblesince there is no access to the organisation the dataoriginates from. And even if it was possible to in-terview the employees it would be rather difficult toextract the model by that means. This is one of thereasons research in mining workflows has started.

Further more the invoice numbers themselves havebeen validated to actually be invoice numbers usingthe steps defined in the stepwise approach. Also thefiles in Fn have been validated to indeed contain in-voice numbers.

5.4 Evaluation

We have extracted the case ID from the CorporateNetwork Data. We have also extracted the activitiesfrom the filenames and the content of files. And wehave been able to extract the order in which the ac-tivities take place in the cases using the timestampsof files.

And we have shown that we with this stepwise ap-proach we have bridged the gap between the avail-ability of Corporate Network Data and the possibil-ity to apply workflow mining techniques. We havedone that by applying one of those techniques on theoutput of our stepwise approach which resulted in aworkflow model. What we cannot say is how good theworkflow model represents the real processes withinthe organisation the Corporate Network Data origi-nates from as discussed in Section 5.3.

5.5 Conclusion

With this case study we wanted to show the applica-bility of the designed stepwise approach. The Evalua-tion shows that we have been able to extract the eventlogs from the Corporate Network Data and mine aworkflow model. Thus the extraction of event logsfrom Corporation Network Data and with using thoseevent logs workflow mining has been demonstrated.

6 Discussion

The ambition level of the initial problem definitionturned out to be too high within the time avail-able to perform the research. The initial idea wasto find fraudulent invoices using the assumption thatcases that follow a workflow that happens infrequentcompared to other workflows is suspicious. Thetime needed to prepare the data for workflow miningturned out to be much longer and the road to therenon-existent. This lead to the research being endedbefore the initial goal could be reached. Still it didlay a new road as to how to prepare the poorly struc-tured data of Corporate Network Data for workflowmining.

The most important feature of this research isthat it makes it possible, with some reservations, tomine workflows regarding financial transactions froma vast amount of poorly structured data. We haveshown how to structure the data into a known struc-ture called event logs, from there proven methods canbe used to extract workflows that give insight into thepaths a specific item, in this case the invoice, travels.

Although we have taken the invoice as a meansto demonstrate this principle, it could also be arguedthat, for instance, tracking the workflows around con-tracts is possible. If there is a possibility to find allthe documents surrounding a contract and if thereis an identifier that shows which documents belongto a specific case the same approach can be taken inorder to find the workflows around contracts. Thisapproach can be used in e-discovery where fraudu-lent activities are tried to be found.

The preprocessing of the data to make it fit theinput format for software that can do the workflowmining was difficult. We have developed some scriptsthat help, but those are not generic enough t be usedon other datasets that the one used in this research.

7 Conclusions

The problem was to bridge the gap between the avail-ability of Corporate Network Data and the possibilityto apply workflow mining techniques in order to get awider view on processes within an organisation. the

12

format and information needed for input in the ex-isting worflow mining techniques, event logs, can notdirectly be taken form the Corporate Network Data.Our stepwise approach is designed to make it possi-ble to extract all information needed to create thoseevent logs. Those event logs can then be used asinput for the workflow mining techniques.

We have shown in Section 3 that we developed astepwise approach to distill event logs from CorporateNetwork Data by applying four distinct steps.

First, indexing of the Corporate Network Data.Second, finding the invoice numbers hidden amongstthe data. Third, finding the documents that containthe found invoice numbers. Fourth, extracting theactivities supported by the documents found.

This gives the input to create the event logs neededso we can apply proven methods to extract the work-flow model around the invoices.

In Section 4 we have shown how to apply the step-wise approach to construct an event log from Cor-porate Network Data. And how the resulting eventlogs can actually be used as input for existing andproven methods to develop a workflow model. Thecase study shows that the approach can be applied tocorporate network data and that it results in eventlogs which produce a workflow model.

The resulting workflow model though is very hardto verify, as we did not have access to the employ-ees of the organisation where the data originated andbecause it is inherently difficult to validate the realworkflows even if access to the employees would havebeen possible. Hence the resulting workflow model isnot verified.

Although the validation is not rock solid, the de-signed approach does look promising to be furtherresearched and adapted.

As already noted, the research has not reached itinitial goal. Further research could be done, bothtowards enriching the method itself and towards ap-plications of the method to everyday problems.

7.1 Future directions

A possible direction for future research is to includethe monetary value represented by invoice to the in-voice numbers, one could then search for fraud by

looking if and when the value of the invoice changed.If the users are mapped to real persons it can very

well show that one case is for instance handled by adifferent user. Which in case of a search for fraudu-lent invoices could point to an anomaly in the usualworkflow. It could also give more insight in the rolesusers fulfill in the organisation as Ang et al. show intheir research [3].

If one can really identify the ‘norm’ in the processit would be interesting to identify deviating patterns,being patterns that happen only sporadically. As-suming that often occurring patterns are legitimateone could then focus on the deviating patterns thatare found. That would give more insight in whetherthat transaction was legitimate or not.

Fraud detection is one of the possibilities of e-discovery, but using this stepwise approach not onlythe workflows around invoices can be found. If theright identifier is found for other processes the sameapproach can be used to for instance find the work-flows around contracts.

13

References

[1] Wet op de omzetbelasting 1968 hoofdstuk viafdeling 4 artikel 35a 1.b. Accessed 05-10-2010.

[2] W. Alink, R. Bhoedjang, P. Boncz, andA. de Vries. Xiraf - xml-based indexing andquerying for digital forensics. Digital Investiga-tion, 3(Supplement 1):50 – 58, 2006. The Pro-ceedings of the 6th Annual Digital Forensic Re-search Workshop (DFRWS ’06).

[3] G. Ang, Y. Yang, Z. Ming, J.-L. Zhang, andY.-W. Wang. Organizational structure miningbased on workflow logs. pages 455–459, 2009.

[4] M. Berlingerio, F. Pinelli, M. Nanni, and F. Gi-annotti. Temporal mining for interactive work-flow data analysis. In KDD ’09: Proceedingsof the 15th ACM SIGKDD international con-ference on Knowledge discovery and data min-ing, pages 109–118, New York, NY, USA, 2009.ACM.

[5] J. Cook and A. Wolf. Discovering modelsof software processes from event-based data.ACM Transactions on Software Engineering andMethodology, 7(3):215–249, 1998.

[6] J. E. Cook and A. L. Wolf. Software processvalidation: quantitatively measuring the corre-spondence of a process to a model. ACM Trans.Softw. Eng. Methodol., 8:147–176, April 1999.

[7] A. de Medeiros, A. Guzzo, G. Greco, W. van derAalst, A. Weijters, B. Van Dongen, andD. Sacca. Process mining based on clustering:A quest for precision. Lecture Notes in Com-puter Science (including subseries Lecture Notesin Artificial Intelligence and Lecture Notes inBioinformatics), 4928 LNCS:17–29, 2008.

[8] A. de Medeiros, B. van Dongen, W. van derAalst, and A. Weijters. Process mining forubiquitous mobile systems: An overview anda concrete algorithm. In L. Baresi, S. Dust-dar, H. Gall, and M. Matera, editors, UbiquitousMobile Information and Collaboration Systems,

volume 3272 of Lecture Notes in Computer Sci-ence, pages 151–165. Springer Berlin / Heidel-berg, 2005.

[9] F. S. Esfahani, M. A. A. Murad, M. N. Sulaiman,and N. I. Udzir. Using process mining to businessprocess distribution. In SAC ’09: Proceedingsof the 2009 ACM symposium on Applied Com-puting, pages 2140–2145, New York, NY, USA,2009. ACM.

[10] W. Gaaloul, K. Gaaloul, S. Bhiri, A. Haller, andM. Hauswirth. Log-based transactional workflowmining. Distrib. Parallel Databases, 25:193–240,June 2009.

[11] L. Geng, S. Buffett, B. Hamilton, X. Wang,L. Korba, H. Liu, and Y. Wang. Discoveringstructured event logs from unstructured audittrails for workflow mining. Lecture Notes inComputer Science (including subseries LectureNotes in Artificial Intelligence and Lecture Notesin Bioinformatics), 5722 LNAI:442–452, 2009.

[12] S. Goedertier, J. de Weerdt, D. Martens, J. Van-thienen, and B. Baesens. Process discovery inevent logs: An application in the telecom indus-try. Applied Soft Computing, 2010.

[13] M. Hammori, J. Herbst, and N. Kleiner. Inter-active workflow mining–requirements, conceptsand implementation. Data & Knowledge Engi-neering, 56(1):41 – 63, 2006. Business ProcessManagement.

[14] S. He, T. Lv, and B. Huang. A new process min-ing algorithm of workflow. pages 83–85, Haikou,2009. cited By (since 1996) 0; Conference of 2009International Conference on Industrial and In-formation Systems, IIS 2009; Conference Date:24 April 2009 through 25 April 2009; ConferenceCode: 77555.

[15] N. Kushmerick and T. Lau. Automated emailactivity management: An unsupervised learningapproach. pages 67–74, 2005.

[16] J. Li, D. Liu, and B. Yang. Process mining: Ex-tending α-algorithm to mine duplicate tasks in

14

process logs. Lecture Notes in Computer Science(including subseries Lecture Notes in ArtificialIntelligence and Lecture Notes in Bioinformat-ics), 4537 LNCS:396–407, 2007.

[17] W. van der Aalst, B. van Dongen, J. Herbst,L. Maruster, G. Schimm, and A. Weijters. Work-flow mining: a survey of issues and approaches.Data Knowl. Eng., 47:237–267, November 2003.

[18] W. van der Aalst and A. Weijters. Process min-ing: a research agenda. Computers in Industry,53(3):231 – 244, 2004.

[19] W. van der Aalst, T. Weijters, and L. Maruster.Workflow mining: Discovering process modelsfrom event logs. IEEE Transactions on Knowl-edge and Data Engineering, 16(9):1128–1142,2004.

[20] Z. Weidong, D. Weihui, W. Anhua, and F. Xi-aochun. Role-activity diagrams modeling basedon workflow mining. volume 4, pages 301–305,2009.

[21] A. Weijters, W. van der Aalst, and A. A.de Medeiros. Process mining with the heuristicsminer-algorithm. BETA Working Paper Series,WP 166, Eindhoven University of Technology,Eindhoven, 2006., 2006.

[22] L. Wen, W. van der Aalst, J. Wang, and J. Sun.Mining process models with non-free-choice con-structs. Data Mining and Knowledge Discovery,15:145–180, 2007. 10.1007/s10618-007-0065-y.

[23] L. Wen, J. Wang, W. M. van der Aalst,B. Huang, and J. Sun. Mining process modelswith prime invisible tasks. Data & KnowledgeEngineering, 2010.

15