data quality checklist for process mining -...

Data Quality Checklist for Process Mining Data for process mining can come from many different places. One of the big advantages of process mining is that it is not specific to some kind of system. Any workflow or ticket system, ERPs, data warehouses, click-streams, legacy systems, and even data that was collected manually in Excel, can be analyzed as long as a Case ID, an Activity, and a Timestamp column can be identified [1,2].

However, most of that data was not originally collected for process mining purposes. And especially data that was manually entered can always contain errors. How do you make sure that errors in the data will not jeopardize your analysis results?

Data quality is an important topic for any data analysis technique: If you base your analysis results on data, you have to make sure the data is sound and correct. Otherwise the results that you get will be wrong. But there are some challenges regarding data quality that are specific to process mining. Many of these challenges revolve around problems with timestamps. In fact, you could say that timestamps are the achilles heel of data quality in process mining. But timestamps are not the only problem.

In this article, we give you a data quality checklist. Before you start with the actual process mining analysis, you can use this checklist to make sure that your data is correct and suitable for process mining. Furthermore, for each of the steps on the checklist we provide you a detailed guide that describes the problem and shows how to fix it. You can download the download Disco from our website to follow along with the instructions.

Data quality is very important, because if you show results that are wrong due to data problems to a business person, you can lose their trust into process mining forever. And if you base your own decisions on data that is wrong, then you run the risk of drawing the wrong conclusions.

Even if you think your data is correct, you can run through the checklist to make sure you are right.

We hope this will help you to be successful in your own process mining initiatives. Your friends from Fluxicon,

Fluxicon Bomanshof 259, 5611 NS Eindhoven T +31-(0)62-436-4201 [email protected] www.fluxicon.com "1

http://fluxicon.com/disco/

mailto:[email protected]?subject=

http://www.fluxicon.com

Data Quality Checklist

1. No errors during import.

2. No gaps in the timeline.

3. Expected amount of data.

4. Expected distribution of attribute values. No unexpected empty values.

5. No cases with unexpected number of steps.

6. Expected timeframe. No unexpected long throughput times.

7. No unexpected ordering of sample cases. No unexpected flows in the process map.

8. Data validation session with process/domain expert done.

9. Documented all quality issues and data questions.

10. If you had to exclude data due to data quality problems, is the remaining set still representative?




Part I: Formatting Issues




1. No errors during import. The first check is to pay attention to any errors that you get in Disco during the import. In many situations, errors stem from improperly formatted CSV files, because writing good CSV files is harder than one may think [3].

For example, the delimiting character (“,” “;” “I” etc.) cannot be used in the content of a field without proper escaping. So, if your file is using the “,” delimiter to separate the columns, and some of the activities have names that contain a comma (like in the example below), then proper CSV requires that the “File report, notify customer” activity is enclosed in quotes to indicate that the “,” is part of the name.

Case ID, Activity case1, Register claim case1, Checkcase1, File report, notify customer

Or your file may have less columns in some rows compared to others (see below).

If Disco encounters a formatting problem, it gives you following the error message with the sad triangle and also tries to indicate in which line the problem occurs (see below).

In most cases, Disco will still import your data and you can take a first look at it, but make sure to go back and investigate the problem before you continue with any serious analysis.




We recommend to open the file in a text editor and look around the indicated line number (a bit before and afterwards, too) to see whether you can identify the root cause.


How to fix:

Occasionally, the formatting problems have no impact on your data (for example, an extra comma at the end of some of the lines in your file). Or the number of lines impacted are so few that you choose to ignore it.

But in most cases you do need to fix it. Sometimes, it is enough to use “Find and Replace” in Excel to replace a delimiter character from the content of your cells and export a new, cleaned CSV that you then import. However, in most cases it will be the easiest to point out the problem that you found to the person who extracted the data for you and ask them to give you a new file that avoids the problem.



Part I I: Missing Data




2. No gaps in the timeline. After you have imported your data, you can check the timeline in the ‘Events over time’ statistics to see whether there are any unusual gaps in the amount of data over your log timeframe.

The picture below shows an example, where we had concatenated three separate files into one file before importing it in Disco. Clearly, something went wrong and apparently the whole data from the second file is missing (see below).


How to fix:

If you made a mistake in the data pre-processing step, you can go back and make sure you include all the data there.

If you have received the data from someone else, you need to go back to that person and ask them to fix it.

If you have no way of obtaining new data, it is best to focus on an uninterrupted part of the data set (in the example above, that would be just the first or just the third part of the data). You can do that using the Timeframe filter in Disco.



3. Expected amount of data. You should have an idea about (roughly) how many rows or cases of data you are importing. Take a look at the overview statistics to see whether they match up.

For example, the picture below shows a screenshot of the overview statistics of the BPI Challenge 2013 [4] data set. Can you see anything wrong with it? In fact, the total number of event is suspiciously close to the old Excel limit of 65,000 rows. And this is what happened: In one of the data preparation steps the data (which had several hundred thousand rows) was opened with an old Excel version and saved again.

Of course, this is a bit more subtle than an obvious gap in the timeline and missing data can have all kinds of reasons. For some systems or databases, a very large data extract is aborted half-way without anyone noticing.


How to fix:

If you miss data, you must find out whether you lost it in a data pre-processing step or in the data extraction phase.

If you have received the data from someone else, you need to go back to that person and ask them to fix it.

If you have no way of obtaining new data, try to get a good overview about which part of the data you got. Is it random? Was the data sorted and you got the first X rows? How does this impact your analysis possibilities (see also Chapter 10)? Some of the BPI Challenge submissions [4] noticed that something was strange and analyzed the data pattern to better understand what was missing.



That’s why it is a very good idea to have a sense of how much data you are expecting before you start with the import (ask the person that gives you the data how they structured their query).




4. Expected distribution of attribute values. No unexpected empty values. Similarly, you should have an idea of the kind of attributes that you expect in your data. Did you request the data for all call center service requests for the Netherlands, Germany, and France from one month, but the volumes suggest that the data you got is mostly from the Netherlands?

Another example to watch out for are empty values in your attributes. For example, the resource attribute statistics in the screenshot below show that 23% of the steps have no resource attached at all.

Empty values can also be normal. Talk to a process domain expert (see also Chapter 8) and someone who knows the information system to understand the meaning of the missing values in your situation.


How to fix:

If you have unexpected distributions, this could be a hint that you are missing data and you should go back to the pre-processing and extraction steps to find out why.

If you have empty attribute values, often these values are really missing and were never recorded well in the first place. It is not uncommon to discover data quality issues in your original data source during the process mining analysis, because nobody may have looked at that data the way you do.

Make sure you understand how these missing attribute values impact your analysis possibilities (see also Chapter 10). By showing the potential benefits of analyzing the data, you are creating an incentive of improving the data quality (and, therefore, increasing the analysis possibilities) over time.



5. No cases with unexpected number of steps. As a next check, you should look out for cases with a very high number of steps (see below). In the shown example, the callcenter data from the Disco demo logs [5] was imported with the Customer ID configured as the case ID.

What you find is that while a total of 3231 customer cases had up to a maximum of 30 steps, there is this one case, (Customer 3) that had a total of 583 steps in total over a timeframe of two months. That cannot be quite right, can it?

To investigate this further, you can right-click the case ID in the table and copy it to the clipboard (see below).




You can then paste the case ID from the clipboard to the search field in the Cases view to bring up that case (see below).

It turns out there are a lot of short inbound calls that were coming in short intervals. The conformation with a domain expert confirms that this is not a real customer, but some kind of default customer ID that is assigned by the Siebel CRM system if no customer was created or associated by the callcenter agent (for example, because it was not necessary or because the customer hang up before the agent could capture their contact information). In fact, although there is technically a case ID associated, this is an example of missing case IDs. The real cases (the actual customers that called) are not captured. This will have an impact on your analysis. For example, analyzing the average number of steps per customer with this dummy customer in it will give you wrong results. You encounter similar problems if the case ID field is empty for some of your events (they will all be grouped into one case with with the ID “empty”).


How to fix:

You can simply remove the cases with such a large number of steps in Disco (see below). Make sure you keep track of how many events you are removing from the data and how representative your remaining dataset still is after doing that.



To remove the very long case from the callcenter log above, you can add a Performance filter and change it to the ‘Number of events’ metric (see below).

You can then restrict the range and tick the ‘Apply filters permanently’ option to create a new reference data set:




The result will be a new log with the very long case removed and the filter permanently applied (you have a clean start):




Part I I I: Timestamp Problems




6. Expected timeframe. No unexpected long throughput times. Now we are coming to the achilles heel of process mining: Timestamps. The whole process discovery and the performance analysis are based on the timestamps in your data. So, if the timestamps are wrong, you are getting into problems. In the following, we will look at some typical problems and what you can do about it.

One typical example that you will most certainly encounter are so-called zero timestamps, or other kind of default timestamps that are given by the system (often, because an empty value is not an option some default timestamp is given — similar to the system-assigned default customer ID before). These empty timestamps then take the form of the 1 January 1900 (see below), the Unix epoch timestamp 1 January 1970, or some future timestamp (like 2100) given by the programmer of your process management system to catch some kind of edge case.

Other reasons can be typos in manually entered timestamps (this is often an explanation if you find, for example, 2023 timestamps in your data).

You should know what timeframe you are expecting for your data set and then verify that the earliest and latest timestamp confirm the expected time period. Be aware that if you do not address a problem like the 1900 timestamp in the picture above, you will get case throughput times of more than 100 years!




To remove cases that fall outside of the expected date range, you can use the Timeframe filter in Disco. Again, it is recommended to apply the filter permanently, because you want to use the outcome as the new basis for your further analysis and do not expect to come back later to change it (see below).


How to fix:

Similar to the overly long cases (due to missing case IDs) from Chapter 5, you can remove cases that fall outside of your expected range from the data set directly in Disco and create a cleaned copy for further analysis (see below).

Again, make sure you keep track of how many cases you are removing from the data and how representative your remaining dataset still is after doing that. You may also want to communicate your findings back to the system administrator to find out how these timestamp problems can be avoided in the future.



7. No unexpected ordering in cases. No unexpected process flows. One of the biggest problems with wrong timestamps is that they mess with the ordering relationships of your process. In process mining, the transitions between the activities in your process are determined by the sequence of the events in each case.

For example, if you look at another case with a 1900 timestamp (see below), then you see that the the ‘Import…’ activity is the first one in the sequence, because the order of the activities in each case is determined by the timestamps (if two activities have the same timestamps, the original order in the file is preserved).

For all other variants the ‘STP: Generieke Hoofdworkflow’ activity is the first step in the process. The wrong ordering is not just reflected in the specific case sequence, but also shows up in the process map as an additional path with a very long average duration (see below).




Because the sequence of the activities in the cases, variants and process map are determined by the timestamps, it is very important that you validate them with a process domain expert (see also Chapter 8).

Here are some more reasons why timestamps can show up in the wrong order and what to do about it.

Errors during import Disco automatically detects most timestamps without that you having to do anything. However, especially if you are dealing with timestamp problems, it is worth verifying that the timestamps are correctly configured during import.

You can do that by going back to the import screen by re-importing your data, or by clicking on the ‘Reload’ button from the project view. If you select the timestamp column, you can press the ‘Pattern…’ button in the top-right to see a few original timestamps (as they are in your file) and a preview of how Disco interprets them (in green, on the right side).

Another timestamp problem that can result form mistakes during the import step is that you may have accidentally configured some columns as timestamp that are not actually a timestamp column in the sense of a process mining timestamp (but, for example, indicate the birthday of the customer). As a consequence, activities can show up in parallel although the are in reality not happening at the same time.




Different granularity in timestamps If your data comes from different systems, it can happen that you have different granularities for the timestamps. For example, in one system timestamps may be recorded at the level of seconds while in another one you only have a date but no time.

This can lead to unwanted ordering of activities, because the event with the date timestamp is interpreted with the time “00:00:00” and, therefore, will always show up before activities from the other system (that have a time) if they happen on the same day (even though in reality they may have occurred afterwards).

An example of such a mixed-granularity data set can be seen in the in the anonymized process below:


How to fix:

If you find that the preview does not pick up the timestamps correctly, configure the right pattern for your timestamp column in the import screen. You can empty the ‘Pattern’ field and start typing the the right pattern (use the legend on the right, and for more advanced patterns see the Java data pattern reference [5] for the precise notation). The green preview will be updated while you type, so that yo can check whether you now have it right.

Also make sure that only the right columns are configured as timestamp: For each column, the current configuration is shown in the header. Look through all your columns and make sure only your actual timstamp columns are showing the little clock symbol that indicates the timestamp configuration.

Then, press again the ‘Start import’ button.



Different clocks If your data comes from different systems, it can also happen that you get into problems due to the fact that these different systems run on different internal clocks. During the process mining analysis of an intervention management process, this was the case because parts of the data came from mobile devices from the service employees, which had different clocks form the server [6].

Time of logging not the actual time Some timestamp problems stem from the fact that the recorded timestamps do not actually reflect the time at which an activity was performed.

For example, a doctor may be walking around all day, speak with patients, write prescriptions, etc. And then by the end of the day she sits down in her office and writes up the performed tasks for the administrative system.


How to fix:

If you know the right sequence of the activities, it can make sense to ensure they are sorted correctly (Disco will respect the order in the file for same-date activities) and then initially analyze the process flow on the most granular level. This will help to get less distracted from those wrong orderings and get a first overview about the process flows on that level.

You can do that by leaving out the hours, minutes and seconds from your timestamp configuration during import in Disco (just keep the date part).

Later on, you can go into the detailed analysis of parts of the process, where you bring up the level of detail back to the more fine-grained level to see how much time was spent between these different steps.

How to fix:

This is a really tricky problem and needs to be fixed in the original data or during merging (counting in the offset).

If you cannot fix the problem in your data, then you might want to exclude cases that are wrong (for example, by excluding cases that show the wrong sequences). Be careful if you do this and keep track of how many cases you are excluding to see if the data are still representative (see also Chapter 10).



Another example is that someone records a check as completed right now, while it was actually done already a week ago (and just forgotten to enter into the system).


How to fix:

By understanding the nature of the process and validating your data with a domain expert (see Chapter 8) you can take this into account for your analysis.

For example, if the doctor writes up her activities at the end of the day, you will know that it does not make sense to analyze her activities on the minute-level (even if the recorded timestamps carry the minutes, technically). If you know that certain activities are entered manually and you see high error rates, you can discard them as not being reliable enough.



Part IV: Data Validation Session




8. Data validation session with process/domain expert done. Because trust that is lost by showing results based on wrong data can often never be won back, we highly recommend to plan a data validation session with a process expert as a part of the analysis phase in your project (see below).

You can set this up as a preparation step for the actual analysis. Communicate that the purpose of the session is explicitly not yet to analyze the process, but to ensure that the data quality is good before you proceed with the analysis itself.

Ideally, you can ask both a domain expert and a data expert to participate in the session, but especially the input of the domain expert is needed here to spot problems in the data from the perspective of the process owner for whom you are performing the analysis in the end (you can book a separate meeting with a data expert to walk through your questions). Ideally, you have access to the source system to look up individual cases together if needed.

With respect to the validation session with the domain expert, we recommend the following:

• Start by explaining briefly what process mining is. Show up to a maximum of 5 slides and consider giving a very short demo with a clean and simple example. Unless they have recently participated in a presentation about process mining, you should assume that they either do not know what process mining is at all or only have a vague idea.

• Then, restate the purpose of the session and explain that you want to validate the data with them and collect potential issues and questions on the way.

• Consider asking them to draw a (very rough) process map of the process from their perspective with up to a maximum of 7 steps at a flip-chart or whiteboard. This will be useful as a reference point, when you are trying to understand the meaning of certain process steps later on in the meeting.

• Show them the data in raw format (for example, in Excel) and explain where you got the data and how it was extracted. Point out the Case ID, Activity, and Timestamp columns that you are using.

• Then, import the data in front of their eyes and go over the summary information (showing the timeframe of the data, the attributes, etc.) and after looking at the process map inspect the top variants and look at example cases together with them. Ask them: “Does this make sense to you?” and write down any issues that they mention.




• If you find strange patterns in the process behavior, filter the data to get to some example cases for further context. Simplify the process map if needed [7] and interactively look into the issues that you find together. Try to find answers to questions right in the session if possible and otherwise write them up as action point.

• Look up a few cases in the original system together (many systems allow you to search by case number, or customer number, and inspect the history of an individual case) if you can and compare them with the case sequences that you find in Disco to see whether they match up as expected.

• You may find that the process expert brings up questions about the process that are relevant for the analysis itself. This is great and you should write them down, but do not get side-tracked by the analysis and steer the session back to your data quality checklist to make sure you get all questions answered.

• In some cases, it can be an option to go even further and follow a few cases in real life (by observing and speaking with the people performing actual activities in the process). This is great and can help mapping the logged data to the real process for your analysis. Make sure that you write down case IDs and timestamps of the activities that you are observing, so that you are able to find them back in the data extract afterwards and see how they were recorded.




Part V: Document and Assess




9. Documented all quality issues and data questions. Not just in the data validation session, but all along from starting to look at the data yourself, make sure you record all questions about the data and potential data quality issues that you find.

You can categorize the problems (see [8] for a description of 27 classes of event log quality issues) or simply walk through the checklist and write down all the issues in a Word document or Excel file.

Issue No. Description Actions Status

1 1900 Timestamps found in data set

Removed 2 (out of 12,345) cases

done

2

3

4




10. Make sure the remaining data set still representative. Finally, also keep track which of your original process questions may be affected by the data quality issues that you found and the actions that you have taken, or intend to take, to fix them.

Save the intermediate steps of your data cleaning actions and label them properly, so that you can understand what you did later on. Work with copies when applying filters in Disco, and also save your Disco project files along with the project documentation.

If you have to remove part of the data due to data quality problems, carefully check how much data you are excluding and which other effects the cleaning can have on the data. If the data ends up not being representative anymore, because it only reflects a small part of the process, then either discard the results or make sure that you clearly communicate the data basis that you are using when presenting the results.





Was this guide useful to you?

Please get in touch via [email protected] and let us know how it worked for you and which other data issues you encountered! We would love to hear from you.

The Data Quality Guide was created as part of the bi-monthly Process Mining News initiative (you can register here for free to make sure you do not miss any future editions).

The Process Mining News is brought to you by Fluxicon. Founded in 2009 by Dr. Anne Rozinat and Dr. Christian W. Günther, Fluxicon has been at the forefront of the process mining movement ever since. Our process mining software Disco is based on proven scientific research, and loved by professionals worldwide for setting the gold standard in performance and user experience.

As the most experienced process mining team in industry, Fluxicon supplies dozens of large companies, many of them in the Global Fortune 100 and Fortune 500 ranks.

Fluxicon also organizes the annual process mining conference, Process Mining Camp (www.processminingcamp.com), helps raise the visibility of process mining as a new data analysis method through numerous invited talks and articles, and supports more than 250 universities through the Fluxicon Academic Initiative (http://fluxicon.com/academic/).

mailto:[email protected]

http://fluxicon.com/s/pmnews




http://www.processminingcamp.com

http://fluxicon.com/academic/



mailto:[email protected]





http://www.processminingcamp.com

http://fluxicon.com/academic/

References [1] Anne Rozinat. Data Requirements for Process Mining, Fluxicon blog, 2012.

[2] Fluxicon. Data Extraction Guide for Process Mining, Disco User’s Guide.

[3] TBurette DevBlog. So You Want To Write Your Own CSV code?, 2014.

[4] BPI Challenge 2013, URL: http://www.win.tue.nl/bpi/2013/challenge

[5] Java SimpleDateFormat pattern reference and examples, URL: http://docs.oracle.com/javase/6/docs/api/java/text/SimpleDateFormat.html

[6] Walter Vanherle. Case study: Process Mining to Improve the Intervention Management Process at a Security Services Company, Fluxicon blog, 2014.

[7] Anne Rozinat. Managing Complexity in Process Mining, Fluxicon blog, 2015.

[8] JC Bose, Ronny Mans, and Wil van der Aalst. Wanna Improve Process Mining Results? It’s High Time We Consider Data Quality Issues Seriously. BPMCenter Report 02 2013.

Fluxicon Eindhoven, The Netherlands T +31-(0)62-436-4201 [email protected] www.fluxicon.com "30



http://fluxicon.com/blog/2012/02/data-requirements-for-process-mining/

http://fluxicon.com/disco/files/Disco-User-Guide.pdf

http://tburette.github.io/blog/2014/05/25/so-you-want-to-write-your-own-CSV-code/

http://www.win.tue.nl/bpi/2013/challenge

http://docs.oracle.com/javase/6/docs/api/java/text/SimpleDateFormat.html

http://fluxicon.com/blog/2014/03/case-study-process-mining-to-improve-the-intervention-management-process-at-a-security-services-company/

http://fluxicon.com/blog/2015/03/managing-complexity-in-process-mining-quick-simplification-methods/

http://bpmcenter.org/wp-content/uploads/reports/2013/BPM-13-02.pdf

data quality checklist for process mining -...

Documents