automated correlation discovery for semi-structured business processes

Building a Smarter Planet

Automated Correlation Discovery for Semi-Structured Business Processes

DEBS 2011

Szabolcs Rozsnyai, Aleksander Slominski, Geetika T. Lakshmanan


Agenda

• Motivation• Big Picture and Context• Related Work• Algorithm (with Examples)

– Data Pre-Processing– Statistics Calculation– Determining Correlation Candidates

• Screenshots of prototype application• Conclusion & Future Work


Motivation

• Event producing systems are• distributed, • changing rapidly• federated, • loosely coupled, • generating huge numbers

• Correlating events requires a lot of knowledge about the source systems and their data.

We present a novel algorithm to automatically determine correlation rules for the purposes of monitoring, and discovery,

and other applications

We present a novel algorithm to automatically determine correlation rules for the purposes of monitoring, and discovery,

and other applications


Solution Overview1.Correlation rules are common identifiers defined as correspondence

between the attributes of two different types.1. Correlation Rule Example: A.x = B.y where A and B are types and x and

y are attributes

2.The correlation rules are determined by a unique combination of statistics applied on event attributes such that several attribute statistics are taken into account to improve the precision of the correlation candidate detection and to calculate a confidence score.

3.The algorithm does not require input of knowledge about the structure of artifacts (E.g. Event Format could be anything such as XML etc) or the data-type of their attributes nor does it require a normalized organization of artifacts.

4.The confidence score precisely defines the significance of a correlation rule.

5.Correlation rules, discovered by our algorithm, can be used either during runtime to group related artifacts together, such as events belonging to a process instance or to create a graph of relationships that enables querying and walking the paths of relationships.


Agenda





Big Picture and Context


Agenda





Related Work

• Motahari Nezhad et. al. (HP)– Their approach takes mainly instance based measures into account to determine the

“interestingness” of correlation pairs (and groups of pairs). – That means that they first prune the large space of potential correlation pairs based on

some techniques similar to DePauw and then correlate the data with this large set of correlation rules to generate various correlated instances. Then they apply certain statistics on the instances to determine if the correlation rules make sense.

• DePauw et al (IBM)– The work by DePauw et al has at its core a certain similarity to our algorithm. For instance,

we also take the notion of Indexable and Mappable Paths into account, but with the major purpose to reduce the problem space of candidate-pair permutations that need to be checked against each other for potential correlations. In our algorithm this step is optional and instead every attribute of a type is attempted to be matched against another attribute of a type.

– In addition our correlation algorithm takes several attribute-based statistics into account to improve the precision of the correlation candidate detection and also calculates a confidence score based on those statistics.

• CORDS (IBM) is a tool making use of statistical methods to discover correlations and soft functionalities between database columns to produce a dependency graph to improve the performance of query optimizers

– In the database world there is detailed knowledge about the data available which is defined either in the schema or in metadata. . This means that there are defined relations and attributes whereas their type (e.g. integer, string, timestamp, …) is known.

– A key difference of our algorithm, to other approaches, is that our it does not assume that artifacts are grouped together in a normalized schema and nor does it have any information on meta-data that describes an artifact's attribute.


Agenda





Overview

Our algorithm for correlation discovery is divided into three major steps:•Data Pre-Processing.

– The first step of the correlation discovery process is to load and integrate the data into a data store (e.g. database, cloud storage, etc) that is then used to calculate statistics and determine correlation candidates.

•Statistics Calculation. – After the data has been loaded and integrated into the internal

representation, various statistics, mainly on attribute values, are calculated and stored into a fast accessible data structure.

•Determining Correlation Candidates. – In the last step the correlation discovery algorithm determines

correlation pairs with a certain confidence value based on the previously calculated statistics.


Data Pre-Processing

• Raw events are stored into a data storage• Attributes of events are extracted (method of extraction is

not in scope of this idea)• Events have a type assigned (e.g. OrderReceived,

ShipmentCreated, TransportStarted, …)

CommonAlias

Key Timestamp Type Raw

DateTime OrderId Product …

32123…2011-01-01T09:35:52.50 OrderReceived <OrderReceived…

2011-01-01T09:35:52.50 166635 ProductA …

DateTime ShipmentId OrderId …

213131…2011-01-

01T09:40:54.50ShipmentCreated

<ShipmentCreated…

2011-01-01T09:31:52.50 253355 166635 …

Raw Event Event AttributesEventType


Statistics Calculation 1/2


Statistics Calculation 2/2Attribute Cardinality It contains a map of each value and how often each of those values occurred.

Card Determines the number of different values for the attribute.

Cnt Represents the total number of instances in which the attribute occurs. As the data structure does not work on a defined schema it is possible that the attribute does not occur in every instance.

AvgAttributeLength Represents the average attribute length of the current attribute. This is an indicator about the potential uniqueness of a value. A long value might be the sign that attribute might be a unique identifier. Unique identifiers such as OrderId is a potential attribute that occurs in other types and thus forms a correlation. This may also be misleading since a textual description may be very long and is in fact unique but it is never used for correlating artefacts.

InferencedType Defines the type of an attribute. The type of an attribute is an important characteristic for correlation discovery to reduce the problem space of correlation candidates. The chances that a type would correlate with another attribute given that the type contains mostly alpha-numeric attributes are very low.

The determination of the type is made with a fault tolerance of 0.9 (e.g. min. 90% of the values must be numeric), and we refer to this as a parameter Phi.

Currently following type distinctions are supported:- Numeric or Alphanumeric- Timestamp/DateTime- Boolean- Descriptiontext

NoOfNumeric Depending on the InferencedType this variable contains the number of values that are of a numeric type.

NoOfAlphaNum Depending on the InferencedType this variable contains the number of values that are of an alpha-numeric type.


ExampleOrderReceived DateTime OrderId Product Amount DeliveryUntil CustomerId

2011-01-01T09:35:52.50 166635 ProductA 10 2011-01-10T23:59:59.00 465465462011-01-01T09:40:54.50 166636 ProductB 2 2011-01-10T23:59:59.00 412312342011-01-01T09:41:51.30 166637 ProductC 1 2011-01-10T23:59:59.00 461231232011-01-01T09:43:32.50 166638 ProductD 7 2011-01-10T23:59:59.00 721231232011-01-01T09:43:42.50 166639 ProductA 2 2011-01-10T23:59:59.00 12312544

ShipmentCreated DateTime ShipmentId OrderId Carrier2011-01-01T09:31:52.50 253355 166635 IntlAirCargo2011-01-01T09:41:54.50 253356 166636 InTimeTruck Ltd2011-01-01T09:42:51.30 253357 166637 InTimeTruck Ltd2011-01-01T09:44:32.50 253358 166638 IntlAirCargo2011-01-01T09:44:42.50 253359 166639 IntlAirCargo

TransportStarted DateTime TransportId ShipmentId StartLocation2011-01-02T07:00:00.00 8889994 253355 New York2011-01-02T07:00:00.00 8889995 253356 New York2011-01-02T07:00:00.00 8889996 253357 New York2011-01-02T07:00:00.00 8889997 253358 New York2011-01-02T07:00:00.00 8889998 253359 New York

TransportEnded DateTime TransportId ShipmentId EndLocation2011-01-02T14:35:52.50 8889994 253355 Miami2011-01-02T15:41:54.50 8889995 253356 Washington D.C.2011-01-02T11:42:51.30 8889996 253357 Boston2011-01-01T12:44:32.50 8889997 253358 Baltimore2011-01-01T11:33:42.50 8889998 253359 Chicago

Example


Example - IndexOrderReceived DateTime OrderId Product Amount DeliveryUntil CustomerId


OrderReceived DateTime OrderId Product Amount DeliveryUntil CustomerIdIndex <<Map>> <<Map>> <<Map>> <<Map>> <<Map>> <<Map>>Card 5 5 4 4 1 5Cnt 5 5 5 5 5 5AvgAttributeLength 22 6 8 1.2 22 6InferencedType DateTime Numeric Alphanumeric Numeric DateTime NumericNoOfNumeric 0 5 0 5 0 5NoOfAlphaNumeric 0 0 5 0 0 0

Product_Index Value CardinalityProductA 2ProductB 1ProductC 1ProductD 1

The attribute cardinality (i.e. Index) contains a map of each value and how often each of those values

occurred.

OrderReceived DateTime OrderId Product Amount DeliveryUntil CustomerId2011-01-01T09:35:52.50 166635 ProductA 10 2011-01-10T23:59:59.00 465465462011-01-01T09:40:54.50 166636 ProductB 2 2011-01-10T23:59:59.00 412312342011-01-01T09:41:51.30 166637 ProductC 1 2011-01-10T23:59:59.00 461231232011-01-01T09:43:32.50 166638 ProductD 7 2011-01-10T23:59:59.00 721231232011-01-01T09:43:42.50 166639 ProductA 2 2011-01-10T23:59:59.00 12312544


Example


Example - CardOrderReceived DateTime OrderId Product Amount DeliveryUntil CustomerId




4 Unique Values

Determines the number of different values for the attribute.

Example


Example - CntOrderReceived DateTime OrderId Product Amount DeliveryUntil CustomerId




Cnt=5For certain attributes the number might be smaller as they can be null or missing



Represents the total number of instances in which the attribute occurs. As the data structure does not work on a

defined schema it is possible that the attribute does not occur in every instance.

Represents the total number of instances in which the attribute occurs. As the data structure does not work on a

defined schema it is possible that the attribute does not occur in every instance.

Example


Example - AvgAttributeLengthOrderReceived DateTime OrderId Product Amount DeliveryUntil CustomerId




AvgAttributeLength is calculated

Represents the average attribute length of the current attribute. This is an indicator about the potential uniqueness of a value. A long value might be the sign that

attribute might be a unique identifier. Unique identifiers such as OrderId is a potential attribute that occurs in other types and thus forms a correlation. This may also be

misleading since a textual description may be very long and is in fact unique but it is never used for correlating artefacts.

Example


Example - InferencedTypeOrderReceived DateTime OrderId Product Amount DeliveryUntil CustomerId




Determines DataType

Defines the type of an attribute. The type of an attribute is an important characteristic for correlation discovery to reduce the problem space of correlation candidates. The chances that a type would correlate with another attribute given that the type contains mostly alpha-numeric attributes are very low. The determination of the type is made with a fault tolerance of 0.9 (e.g. min. 90% of the values must be numeric), and we refer to this as a parameter Phi.

Example


Example – The rest of the types…ShipmentCreated DateTime ShipmentId OrderId Carrier

2011-01-01T09:31:52.50 253355 166635 IntlAirCargo2011-01-01T09:41:54.50 253356 166636 InTimeTruck Ltd2011-01-01T09:42:51.30 253357 166637 InTimeTruck Ltd2011-01-01T09:44:32.50 253358 166638 IntlAirCargo2011-01-01T09:44:42.50 253359 166639 IntlAirCargo

ShipmentCreated DateTime ShipmentId OrderId CarrierIndex <<Map>> <<Map>> <<Map>> <<Map>>Card 5 5 5 3Cnt 5 5 5 5AvgAttributeLength 22 6 6 13.2InferencedType DateTime Numeric Numeric AlphanumericNoOfNumeric 0 5 5 0NoOfAlphaNumeric 0 0 0 5

TransportStarted DateTime TransportId ShipmentId StartLocation2011-01-02T07:00:00.00 8889994 253355 New York2011-01-02T07:00:00.00 8889995 253356 New York2011-01-02T07:00:00.00 8889996 253357 New York2011-01-02T07:00:00.00 8889997 253358 New York2011-01-02T07:00:00.00 8889998 253359 New York

TransportStarted DateTime TransportId ShipmentId StartLocationIndex <<Map>> <<Map>> <<Map>> <<Map>>Card 1 5 5 1Cnt 5 5 5 5AvgAttributeLength 22 7 6 8InferencedType DateTime Numeric Numeric AlphanumericNoOfNumeric 0 5 5 0NoOfAlphaNumeric 0 0 0 5

Example


Example – The rest of the types…

TransportEnded DateTime TransportId ShipmentId EndLocation2011-01-02T14:35:52.50 8889994 253355 Miami2011-01-02T15:41:54.50 8889995 253356 Washington D.C.2011-01-02T11:42:51.30 8889996 253357 Boston2011-01-01T12:44:32.50 8889997 253358 Baltimore2011-01-01T11:33:42.50 8889998 253359 Chicago

TransportEnded DateTime TransportId ShipmentId EndLocationIndex <<Map>> <<Map>> <<Map>> <<Map>>Card 5 5 5 1Cnt 5 5 5 5AvgAttributeLength 22 7 6 8.4InferencedType DateTime Numeric Numeric AlphanumericNoOfNumeric 0 5 5 0NoOfAlphaNumeric 0 0 0 5

Example


Determining Correlation Candidates

• The confidence score of correlation candidates is determined by the following three parameters with a default set of weights.– Set Difference. A set difference determines the

difference between two correlation candidates and is assigned a weight of 60%.

– Difference between AvgAttributeLength. The difference between the lengths of values of two correlation candidates is assigned a weight of 20%.

– LevenshteinDistance. The Levenshtein distance between attribute names is assigned a weight of 20%


Difference Set 1/2

• The first confidence score is calculated by creating the difference set of all permutations of pairs of all attribute candidates.

• To reduce the search space of candidates we applied an approach similar to [1][2], where we first want to determine so called Highly Indexable Attributes for each type and then Mappable Attributes to form pair candidates.

• Highly Indexable Attribute:A Highly Indexable Attribute is an attribute that is potentially unique for each instance of a type. This attribute is determined by the following formula:

Card / Cnt > Alpha AvgAttribtueLength > Epsilon

– Alpha is a threshold parameter that determines the minimum ratio (i.e. uniqueness) of Card / Cnt and thus allows a small deviation that can be caused for instance by duplicates.

– Epsilon is an additional parameter that defines the minimum average length of an attribute. • Mappable Attribute

The Mappable Attribute can be seen as means to reduced search space of potentially correlating attributes of a type. One approach is to set an upper threshold of how often a value of an attribute can occur. The assumption is that if it occurs more then x times it is unlikely that it is a correlation candidate.

x… Cardinality of a valuei… Attribute of a type{ xi | x < Gamma }

– Gamma is a threshold parameter that can be set experimentally and customized to the application scenario based on knowledge of the artefacts.

[1] I. Ilyas, V. Markl, P. Haas, P. Brown. (2004). CORDS: Automatic discovery of correlations and soft functional dependencies.[2] A. Rostin, O. Albrecht, F. Naumann, J. Bauckmann, and U. Leser. (2009). A Machine Learning Approach to Foreign Key Discovery, (WebDB).


Example – Determining Highly IndexablesCard / Cnt > Alpha AvgAttribtueLength > Epsilon

• Alpha is a threshold parameter that determines the minimum ratio (i.e. uniqueness) of Card / Cnt and thus allows a small deviation that can be caused for instance by duplicates.

• Epsilon is an additional parameter that defines the minimum average length of an attribute.


1 1 0.8 0.8 0.2 1

Calculate Card/Cnt

Example


Example – Determining Highly IndexablesCard / Cnt > Alpha AvgAttribtueLength > Epsilon

• Alpha is a threshold parameter that determines the minimum ratio (i.e. uniqueness) of Card / Cnt and thus allows a small deviation that can be caused for instance by duplicates.

• Epsilon is an additional parameter that defines the minimum average length of an attribute.


1 1 0.8 0.8 0.2 1

Card / Cnt > Alphawhere Alpha = 0.9

AvgAttributeLength > Epsilon

where Epsilon = 5

Example


Example – Determining MappablesThe Mappable Attribute can be seen as means to reduced search space of potentially correlating attributes of a type. One approach is to set an upper threshold of how often a value of an attribute can occur. The assumption is that if it occurs more then x times it is unlikely that it is a correlation candidate.

x… Cardinality of a valuei… Attribute of a type{ xi | x < Gamma }

Card < Gammawhere Gamma = 10

For instance in this domain it might be unlikely that a shipment has more than 10

orders. However this might cause problems in other domains or for certain

relationships (one customer has definitely more than 10 orders).

ShipmentCreated DateTime ShipmentId OrderId CarrierIndex <<Map>> <<Map>> <<Map>> <<Map>>Card 5 5 5 3Cnt 5 5 5 5AvgAttributeLength 22 6 6 13.2InferencedType DateTime Numeric Numeric AlphanumericNoOfNumeric 0 5 5 0NoOfAlphaNumeric 0 0 0 5

Example


Difference Set 2/2

• By determining all the Indexable and Mappable Attributes of all types the next step is to find candidates of pairs of attributes that potentially correlate with each other.

• Therefore a difference set A/B = {x | xA xB} between all permutations of attribute candidates A and B is created.

• A/B must be below a certain threshold in order to be taken into account:

|A/B| <= DiffTreshold• Candidate Pairs of the permutation mixes are excluded if

they have a mismatch of types based on the previously determined InferencedType.


Example• Indexable Attributes

– OrderReceived• DateTime• OrderId• CustomerId

– ShipmentCreated• DateTime• ShipmentId• OrderId

– TransportStarted• TransportId• ShipmentId

– TransportEnd• DateTime• TransportId• ShipmentId

• Mappable Attributes– In our Scenario every attribute is considered as a Mappable Attribute as

the total number of instances is lower than the threshold in order to reduce the complexity of the examples

Example

DateTime’s are excluded as they are a timestamp which are of a type that are not suitable for correlation pairs. This also applies for booleans and description texts.


Example – DifferenceSet for all Permutations

Example

OrderReceived.OrderId = ShipmentCreated.ShipmentIdOrderReceived.OrderId = ShipmentCreated.OrderIdOrderReceived.OrderId = TransportStarted.TransportIdOrderReceived.OrderId = TransportStarted.ShipmentId…

A/B = {x | xA xB}|A/B| <= DiffTreshold

100%0%100%100%…

DiffTreshold = 0.95

OrderReceived.OrderId = ShipmentCreated.OrderIdShipmentCreated.ShipmentId = TransportStarted.ShipmentIdShipmentCreated.ShipmentId = TransportEnded.ShipmentIdTransportStarted.TransportId = TransportEnded.TransportIdTransportEnded.TransportId = TransportStarted.TransportId

Resulting candidates of Correlation Pairs with 100% overlapping

SetDiff

SetDiff0%0%0%0%0%


Example – DifferenceSet for all Permutations

Example

A/B = {x | xA xB}|A/B| <= DiffTreshold


•A difference often occurs especially when processes are not completed, have been prematurely terminated/aborted or events are not generated always because of decision forks. Bear in mind that this is a very simplified example!

•Pairs that are associative are removed! • In this case every pair has the same type – In practice this is not the case! If they are not of the same type they are excluded from the permutation set and thus the difference set is not calculated.

SetDiff0%0%0%0%0%


Difference between AvgAttributeLength

• The second weighting factor for the confidence is the difference between the AvgAttributeLength of the two correlation candidates.

• If the difference of the attribute lengths has a strong variance it might mean that they won’t share significant relationships.


Example – AvgAttributeLength

Example


SetDiff

0%0%0%0%0%

AvgAttrLength

00000


LevenshteinDistance

• The last variable that influences confidence weighting is the Levenshtein distance between the names of two attributes.

• It is common that attribute names from different sources might have the same or comparable names if they have the same meaning.

• For example, in one system the attribute that contains the identifier for an order is named OrderId and in the other it is named order-id.


Example – LevenshteinDistance

Example


SetDiff

0%0%0%0%0%

AvgAttrLength

00000

LevenshteinDistance

00000


Example – Weight Calculation

Example


SetDiff

0%0%0%0%0%

AvgAttributeLength

00000

LevenshteinDistance

00000

SetDiffAvgAttrLenght

LevenshteinDistance

60%20%20%

Confidence

100%100%100%100%100%

Weight is adjustable!


Agenda




Building a Smarter Planet 37

Correlation Discovery

Building a Smarter Planet 38

Correlation Discovery Refinement


Agenda





Evaluation, Conclusion and Future Work

Export compliance regulation

•Wide range of heterogeneous systems

• Order Management, • Document Management, • E-Mail, • Export Violation Detection

Services• Workflow-supported human-

driven interactions (Process Management System).

•24 EventTypes•95 Attributes

Precision: 99.56%

False Positive Example: correlation by “orderVolume”

Always similar size and attributes has a min. length

(No.of.RelevantCorrelationRules / (No.of. RelevantCorrelationRules + FalsePositives) * 100).


THANK YOU!

Questions?

automated correlation discovery for semi-structured business processes

Technology