automated correlation discovery for semi-structured business processes
DESCRIPTION
TRANSCRIPT
Building a Smarter Planet
Automated Correlation Discovery for Semi-Structured Business Processes
DEBS 2011
Szabolcs Rozsnyai, Aleksander Slominski, Geetika T. Lakshmanan
Building a Smarter Planet
Agenda
• Motivation• Big Picture and Context• Related Work• Algorithm (with Examples)
– Data Pre-Processing– Statistics Calculation– Determining Correlation Candidates
• Screenshots of prototype application• Conclusion & Future Work
Building a Smarter Planet
Motivation
• Event producing systems are• distributed, • changing rapidly• federated, • loosely coupled, • generating huge numbers
• Correlating events requires a lot of knowledge about the source systems and their data.
We present a novel algorithm to automatically determine correlation rules for the purposes of monitoring, and discovery,
and other applications
We present a novel algorithm to automatically determine correlation rules for the purposes of monitoring, and discovery,
and other applications
Building a Smarter Planet
Solution Overview1.Correlation rules are common identifiers defined as correspondence
between the attributes of two different types.1. Correlation Rule Example: A.x = B.y where A and B are types and x and
y are attributes
2.The correlation rules are determined by a unique combination of statistics applied on event attributes such that several attribute statistics are taken into account to improve the precision of the correlation candidate detection and to calculate a confidence score.
3.The algorithm does not require input of knowledge about the structure of artifacts (E.g. Event Format could be anything such as XML etc) or the data-type of their attributes nor does it require a normalized organization of artifacts.
4.The confidence score precisely defines the significance of a correlation rule.
5.Correlation rules, discovered by our algorithm, can be used either during runtime to group related artifacts together, such as events belonging to a process instance or to create a graph of relationships that enables querying and walking the paths of relationships.
Building a Smarter Planet
Agenda
• Motivation• Big Picture and Context• Related Work• Algorithm (with Examples)
– Data Pre-Processing– Statistics Calculation– Determining Correlation Candidates
• Screenshots of prototype application• Conclusion & Future Work
Building a Smarter Planet
Big Picture and Context
Building a Smarter Planet
Agenda
• Motivation• Big Picture and Context• Related Work• Algorithm (with Examples)
– Data Pre-Processing– Statistics Calculation– Determining Correlation Candidates
• Screenshots of prototype application• Conclusion & Future Work
Building a Smarter Planet
Related Work
• Motahari Nezhad et. al. (HP)– Their approach takes mainly instance based measures into account to determine the
“interestingness” of correlation pairs (and groups of pairs). – That means that they first prune the large space of potential correlation pairs based on
some techniques similar to DePauw and then correlate the data with this large set of correlation rules to generate various correlated instances. Then they apply certain statistics on the instances to determine if the correlation rules make sense.
• DePauw et al (IBM)– The work by DePauw et al has at its core a certain similarity to our algorithm. For instance,
we also take the notion of Indexable and Mappable Paths into account, but with the major purpose to reduce the problem space of candidate-pair permutations that need to be checked against each other for potential correlations. In our algorithm this step is optional and instead every attribute of a type is attempted to be matched against another attribute of a type.
– In addition our correlation algorithm takes several attribute-based statistics into account to improve the precision of the correlation candidate detection and also calculates a confidence score based on those statistics.
• CORDS (IBM) is a tool making use of statistical methods to discover correlations and soft functionalities between database columns to produce a dependency graph to improve the performance of query optimizers
– In the database world there is detailed knowledge about the data available which is defined either in the schema or in metadata. . This means that there are defined relations and attributes whereas their type (e.g. integer, string, timestamp, …) is known.
– A key difference of our algorithm, to other approaches, is that our it does not assume that artifacts are grouped together in a normalized schema and nor does it have any information on meta-data that describes an artifact's attribute.
Building a Smarter Planet
Agenda
• Motivation• Big Picture and Context• Related Work• Algorithm (with Examples)
– Data Pre-Processing– Statistics Calculation– Determining Correlation Candidates
• Screenshots of prototype application• Conclusion & Future Work
Building a Smarter Planet
Overview
Our algorithm for correlation discovery is divided into three major steps:•Data Pre-Processing.
– The first step of the correlation discovery process is to load and integrate the data into a data store (e.g. database, cloud storage, etc) that is then used to calculate statistics and determine correlation candidates.
•Statistics Calculation. – After the data has been loaded and integrated into the internal
representation, various statistics, mainly on attribute values, are calculated and stored into a fast accessible data structure.
•Determining Correlation Candidates. – In the last step the correlation discovery algorithm determines
correlation pairs with a certain confidence value based on the previously calculated statistics.
Building a Smarter Planet
Data Pre-Processing
• Raw events are stored into a data storage• Attributes of events are extracted (method of extraction is
not in scope of this idea)• Events have a type assigned (e.g. OrderReceived,
ShipmentCreated, TransportStarted, …)
CommonAlias
Key Timestamp Type Raw
DateTime OrderId Product …
32123…2011-01-01T09:35:52.50 OrderReceived <OrderReceived…
2011-01-01T09:35:52.50 166635 ProductA …
DateTime ShipmentId OrderId …
213131…2011-01-
01T09:40:54.50ShipmentCreated
<ShipmentCreated…
2011-01-01T09:31:52.50 253355 166635 …
Raw Event Event AttributesEventType
Building a Smarter Planet
Statistics Calculation 1/2
Building a Smarter Planet
Statistics Calculation 2/2Attribute Cardinality It contains a map of each value and how often each of those values occurred.
Card Determines the number of different values for the attribute.
Cnt Represents the total number of instances in which the attribute occurs. As the data structure does not work on a defined schema it is possible that the attribute does not occur in every instance.
AvgAttributeLength Represents the average attribute length of the current attribute. This is an indicator about the potential uniqueness of a value. A long value might be the sign that attribute might be a unique identifier. Unique identifiers such as OrderId is a potential attribute that occurs in other types and thus forms a correlation. This may also be misleading since a textual description may be very long and is in fact unique but it is never used for correlating artefacts.
InferencedType Defines the type of an attribute. The type of an attribute is an important characteristic for correlation discovery to reduce the problem space of correlation candidates. The chances that a type would correlate with another attribute given that the type contains mostly alpha-numeric attributes are very low.
The determination of the type is made with a fault tolerance of 0.9 (e.g. min. 90% of the values must be numeric), and we refer to this as a parameter Phi.
Currently following type distinctions are supported:- Numeric or Alphanumeric- Timestamp/DateTime- Boolean- Descriptiontext
NoOfNumeric Depending on the InferencedType this variable contains the number of values that are of a numeric type.
NoOfAlphaNum Depending on the InferencedType this variable contains the number of values that are of an alpha-numeric type.
Building a Smarter Planet
ExampleOrderReceived DateTime OrderId Product Amount DeliveryUntil CustomerId
2011-01-01T09:35:52.50 166635 ProductA 10 2011-01-10T23:59:59.00 465465462011-01-01T09:40:54.50 166636 ProductB 2 2011-01-10T23:59:59.00 412312342011-01-01T09:41:51.30 166637 ProductC 1 2011-01-10T23:59:59.00 461231232011-01-01T09:43:32.50 166638 ProductD 7 2011-01-10T23:59:59.00 721231232011-01-01T09:43:42.50 166639 ProductA 2 2011-01-10T23:59:59.00 12312544
ShipmentCreated DateTime ShipmentId OrderId Carrier2011-01-01T09:31:52.50 253355 166635 IntlAirCargo2011-01-01T09:41:54.50 253356 166636 InTimeTruck Ltd2011-01-01T09:42:51.30 253357 166637 InTimeTruck Ltd2011-01-01T09:44:32.50 253358 166638 IntlAirCargo2011-01-01T09:44:42.50 253359 166639 IntlAirCargo
TransportStarted DateTime TransportId ShipmentId StartLocation2011-01-02T07:00:00.00 8889994 253355 New York2011-01-02T07:00:00.00 8889995 253356 New York2011-01-02T07:00:00.00 8889996 253357 New York2011-01-02T07:00:00.00 8889997 253358 New York2011-01-02T07:00:00.00 8889998 253359 New York
TransportEnded DateTime TransportId ShipmentId EndLocation2011-01-02T14:35:52.50 8889994 253355 Miami2011-01-02T15:41:54.50 8889995 253356 Washington D.C.2011-01-02T11:42:51.30 8889996 253357 Boston2011-01-01T12:44:32.50 8889997 253358 Baltimore2011-01-01T11:33:42.50 8889998 253359 Chicago
Example
Building a Smarter Planet
Example - IndexOrderReceived DateTime OrderId Product Amount DeliveryUntil CustomerId
2011-01-01T09:35:52.50 166635 ProductA 10 2011-01-10T23:59:59.00 465465462011-01-01T09:40:54.50 166636 ProductB 2 2011-01-10T23:59:59.00 412312342011-01-01T09:41:51.30 166637 ProductC 1 2011-01-10T23:59:59.00 461231232011-01-01T09:43:32.50 166638 ProductD 7 2011-01-10T23:59:59.00 721231232011-01-01T09:43:42.50 166639 ProductA 2 2011-01-10T23:59:59.00 12312544
OrderReceived DateTime OrderId Product Amount DeliveryUntil CustomerIdIndex <<Map>> <<Map>> <<Map>> <<Map>> <<Map>> <<Map>>Card 5 5 4 4 1 5Cnt 5 5 5 5 5 5AvgAttributeLength 22 6 8 1.2 22 6InferencedType DateTime Numeric Alphanumeric Numeric DateTime NumericNoOfNumeric 0 5 0 5 0 5NoOfAlphaNumeric 0 0 5 0 0 0
Product_Index Value CardinalityProductA 2ProductB 1ProductC 1ProductD 1
The attribute cardinality (i.e. Index) contains a map of each value and how often each of those values
occurred.
OrderReceived DateTime OrderId Product Amount DeliveryUntil CustomerId2011-01-01T09:35:52.50 166635 ProductA 10 2011-01-10T23:59:59.00 465465462011-01-01T09:40:54.50 166636 ProductB 2 2011-01-10T23:59:59.00 412312342011-01-01T09:41:51.30 166637 ProductC 1 2011-01-10T23:59:59.00 461231232011-01-01T09:43:32.50 166638 ProductD 7 2011-01-10T23:59:59.00 721231232011-01-01T09:43:42.50 166639 ProductA 2 2011-01-10T23:59:59.00 12312544
OrderReceived DateTime OrderId Product Amount DeliveryUntil CustomerIdIndex <<Map>> <<Map>> <<Map>> <<Map>> <<Map>> <<Map>>Card 5 5 4 4 1 5Cnt 5 5 5 5 5 5AvgAttributeLength 22 6 8 1.2 22 6InferencedType DateTime Numeric Alphanumeric Numeric DateTime NumericNoOfNumeric 0 5 0 5 0 5NoOfAlphaNumeric 0 0 5 0 0 0
Example
Building a Smarter Planet
Example - CardOrderReceived DateTime OrderId Product Amount DeliveryUntil CustomerId
2011-01-01T09:35:52.50 166635 ProductA 10 2011-01-10T23:59:59.00 465465462011-01-01T09:40:54.50 166636 ProductB 2 2011-01-10T23:59:59.00 412312342011-01-01T09:41:51.30 166637 ProductC 1 2011-01-10T23:59:59.00 461231232011-01-01T09:43:32.50 166638 ProductD 7 2011-01-10T23:59:59.00 721231232011-01-01T09:43:42.50 166639 ProductA 2 2011-01-10T23:59:59.00 12312544
OrderReceived DateTime OrderId Product Amount DeliveryUntil CustomerIdIndex <<Map>> <<Map>> <<Map>> <<Map>> <<Map>> <<Map>>Card 5 5 4 4 1 5Cnt 5 5 5 5 5 5AvgAttributeLength 22 6 8 1.2 22 6InferencedType DateTime Numeric Alphanumeric Numeric DateTime NumericNoOfNumeric 0 5 0 5 0 5NoOfAlphaNumeric 0 0 5 0 0 0
Product_Index Value CardinalityProductA 2ProductB 1ProductC 1ProductD 1
4 Unique Values
Determines the number of different values for the attribute.
Example
Building a Smarter Planet
Example - CntOrderReceived DateTime OrderId Product Amount DeliveryUntil CustomerId
2011-01-01T09:35:52.50 166635 ProductA 10 2011-01-10T23:59:59.00 465465462011-01-01T09:40:54.50 166636 ProductB 2 2011-01-10T23:59:59.00 412312342011-01-01T09:41:51.30 166637 ProductC 1 2011-01-10T23:59:59.00 461231232011-01-01T09:43:32.50 166638 ProductD 7 2011-01-10T23:59:59.00 721231232011-01-01T09:43:42.50 166639 ProductA 2 2011-01-10T23:59:59.00 12312544
OrderReceived DateTime OrderId Product Amount DeliveryUntil CustomerIdIndex <<Map>> <<Map>> <<Map>> <<Map>> <<Map>> <<Map>>Card 5 5 4 4 1 5Cnt 5 5 5 5 5 5AvgAttributeLength 22 6 8 1.2 22 6InferencedType DateTime Numeric Alphanumeric Numeric DateTime NumericNoOfNumeric 0 5 0 5 0 5NoOfAlphaNumeric 0 0 5 0 0 0
Product_Index Value CardinalityProductA 2ProductB 1ProductC 1ProductD 1
Cnt=5For certain attributes the number might be smaller as they can be null or missing
Cnt=5For certain attributes the number might be smaller as they can be null or missing
Cnt=5For certain attributes the number might be smaller as they can be null or missing
Represents the total number of instances in which the attribute occurs. As the data structure does not work on a
defined schema it is possible that the attribute does not occur in every instance.
Represents the total number of instances in which the attribute occurs. As the data structure does not work on a
defined schema it is possible that the attribute does not occur in every instance.
Example
Building a Smarter Planet
Example - AvgAttributeLengthOrderReceived DateTime OrderId Product Amount DeliveryUntil CustomerId
2011-01-01T09:35:52.50 166635 ProductA 10 2011-01-10T23:59:59.00 465465462011-01-01T09:40:54.50 166636 ProductB 2 2011-01-10T23:59:59.00 412312342011-01-01T09:41:51.30 166637 ProductC 1 2011-01-10T23:59:59.00 461231232011-01-01T09:43:32.50 166638 ProductD 7 2011-01-10T23:59:59.00 721231232011-01-01T09:43:42.50 166639 ProductA 2 2011-01-10T23:59:59.00 12312544
OrderReceived DateTime OrderId Product Amount DeliveryUntil CustomerIdIndex <<Map>> <<Map>> <<Map>> <<Map>> <<Map>> <<Map>>Card 5 5 4 4 1 5Cnt 5 5 5 5 5 5AvgAttributeLength 22 6 8 1.2 22 6InferencedType DateTime Numeric Alphanumeric Numeric DateTime NumericNoOfNumeric 0 5 0 5 0 5NoOfAlphaNumeric 0 0 5 0 0 0
Product_Index Value CardinalityProductA 2ProductB 1ProductC 1ProductD 1
AvgAttributeLength is calculated
Represents the average attribute length of the current attribute. This is an indicator about the potential uniqueness of a value. A long value might be the sign that
attribute might be a unique identifier. Unique identifiers such as OrderId is a potential attribute that occurs in other types and thus forms a correlation. This may also be
misleading since a textual description may be very long and is in fact unique but it is never used for correlating artefacts.
Example
Building a Smarter Planet
Example - InferencedTypeOrderReceived DateTime OrderId Product Amount DeliveryUntil CustomerId
2011-01-01T09:35:52.50 166635 ProductA 10 2011-01-10T23:59:59.00 465465462011-01-01T09:40:54.50 166636 ProductB 2 2011-01-10T23:59:59.00 412312342011-01-01T09:41:51.30 166637 ProductC 1 2011-01-10T23:59:59.00 461231232011-01-01T09:43:32.50 166638 ProductD 7 2011-01-10T23:59:59.00 721231232011-01-01T09:43:42.50 166639 ProductA 2 2011-01-10T23:59:59.00 12312544
OrderReceived DateTime OrderId Product Amount DeliveryUntil CustomerIdIndex <<Map>> <<Map>> <<Map>> <<Map>> <<Map>> <<Map>>Card 5 5 4 4 1 5Cnt 5 5 5 5 5 5AvgAttributeLength 22 6 8 1.2 22 6InferencedType DateTime Numeric Alphanumeric Numeric DateTime NumericNoOfNumeric 0 5 0 5 0 5NoOfAlphaNumeric 0 0 5 0 0 0
Product_Index Value CardinalityProductA 2ProductB 1ProductC 1ProductD 1
Determines DataType
Defines the type of an attribute. The type of an attribute is an important characteristic for correlation discovery to reduce the problem space of correlation candidates. The chances that a type would correlate with another attribute given that the type contains mostly alpha-numeric attributes are very low. The determination of the type is made with a fault tolerance of 0.9 (e.g. min. 90% of the values must be numeric), and we refer to this as a parameter Phi.
Example
Building a Smarter Planet
Example – The rest of the types…ShipmentCreated DateTime ShipmentId OrderId Carrier
2011-01-01T09:31:52.50 253355 166635 IntlAirCargo2011-01-01T09:41:54.50 253356 166636 InTimeTruck Ltd2011-01-01T09:42:51.30 253357 166637 InTimeTruck Ltd2011-01-01T09:44:32.50 253358 166638 IntlAirCargo2011-01-01T09:44:42.50 253359 166639 IntlAirCargo
ShipmentCreated DateTime ShipmentId OrderId CarrierIndex <<Map>> <<Map>> <<Map>> <<Map>>Card 5 5 5 3Cnt 5 5 5 5AvgAttributeLength 22 6 6 13.2InferencedType DateTime Numeric Numeric AlphanumericNoOfNumeric 0 5 5 0NoOfAlphaNumeric 0 0 0 5
TransportStarted DateTime TransportId ShipmentId StartLocation2011-01-02T07:00:00.00 8889994 253355 New York2011-01-02T07:00:00.00 8889995 253356 New York2011-01-02T07:00:00.00 8889996 253357 New York2011-01-02T07:00:00.00 8889997 253358 New York2011-01-02T07:00:00.00 8889998 253359 New York
TransportStarted DateTime TransportId ShipmentId StartLocationIndex <<Map>> <<Map>> <<Map>> <<Map>>Card 1 5 5 1Cnt 5 5 5 5AvgAttributeLength 22 7 6 8InferencedType DateTime Numeric Numeric AlphanumericNoOfNumeric 0 5 5 0NoOfAlphaNumeric 0 0 0 5
Example
Building a Smarter Planet
Example – The rest of the types…
TransportEnded DateTime TransportId ShipmentId EndLocation2011-01-02T14:35:52.50 8889994 253355 Miami2011-01-02T15:41:54.50 8889995 253356 Washington D.C.2011-01-02T11:42:51.30 8889996 253357 Boston2011-01-01T12:44:32.50 8889997 253358 Baltimore2011-01-01T11:33:42.50 8889998 253359 Chicago
TransportEnded DateTime TransportId ShipmentId EndLocationIndex <<Map>> <<Map>> <<Map>> <<Map>>Card 5 5 5 1Cnt 5 5 5 5AvgAttributeLength 22 7 6 8.4InferencedType DateTime Numeric Numeric AlphanumericNoOfNumeric 0 5 5 0NoOfAlphaNumeric 0 0 0 5
Example
Building a Smarter Planet
Determining Correlation Candidates
• The confidence score of correlation candidates is determined by the following three parameters with a default set of weights.– Set Difference. A set difference determines the
difference between two correlation candidates and is assigned a weight of 60%.
– Difference between AvgAttributeLength. The difference between the lengths of values of two correlation candidates is assigned a weight of 20%.
– LevenshteinDistance. The Levenshtein distance between attribute names is assigned a weight of 20%
Building a Smarter Planet
Difference Set 1/2
• The first confidence score is calculated by creating the difference set of all permutations of pairs of all attribute candidates.
• To reduce the search space of candidates we applied an approach similar to [1][2], where we first want to determine so called Highly Indexable Attributes for each type and then Mappable Attributes to form pair candidates.
• Highly Indexable Attribute:A Highly Indexable Attribute is an attribute that is potentially unique for each instance of a type. This attribute is determined by the following formula:
Card / Cnt > Alpha AvgAttribtueLength > Epsilon
– Alpha is a threshold parameter that determines the minimum ratio (i.e. uniqueness) of Card / Cnt and thus allows a small deviation that can be caused for instance by duplicates.
– Epsilon is an additional parameter that defines the minimum average length of an attribute. • Mappable Attribute
The Mappable Attribute can be seen as means to reduced search space of potentially correlating attributes of a type. One approach is to set an upper threshold of how often a value of an attribute can occur. The assumption is that if it occurs more then x times it is unlikely that it is a correlation candidate.
x… Cardinality of a valuei… Attribute of a type{ xi | x < Gamma }
– Gamma is a threshold parameter that can be set experimentally and customized to the application scenario based on knowledge of the artefacts.
[1] I. Ilyas, V. Markl, P. Haas, P. Brown. (2004). CORDS: Automatic discovery of correlations and soft functional dependencies.[2] A. Rostin, O. Albrecht, F. Naumann, J. Bauckmann, and U. Leser. (2009). A Machine Learning Approach to Foreign Key Discovery, (WebDB).
Building a Smarter Planet
Example – Determining Highly IndexablesCard / Cnt > Alpha AvgAttribtueLength > Epsilon
• Alpha is a threshold parameter that determines the minimum ratio (i.e. uniqueness) of Card / Cnt and thus allows a small deviation that can be caused for instance by duplicates.
• Epsilon is an additional parameter that defines the minimum average length of an attribute.
OrderReceived DateTime OrderId Product Amount DeliveryUntil CustomerIdIndex <<Map>> <<Map>> <<Map>> <<Map>> <<Map>> <<Map>>Card 5 5 4 4 1 5Cnt 5 5 5 5 5 5AvgAttributeLength 22 6 8 1.2 22 6InferencedType DateTime Numeric Alphanumeric Numeric DateTime NumericNoOfNumeric 0 5 0 5 0 5NoOfAlphaNumeric 0 0 5 0 0 0
1 1 0.8 0.8 0.2 1
Calculate Card/Cnt
Example
Building a Smarter Planet
Example – Determining Highly IndexablesCard / Cnt > Alpha AvgAttribtueLength > Epsilon
• Alpha is a threshold parameter that determines the minimum ratio (i.e. uniqueness) of Card / Cnt and thus allows a small deviation that can be caused for instance by duplicates.
• Epsilon is an additional parameter that defines the minimum average length of an attribute.
OrderReceived DateTime OrderId Product Amount DeliveryUntil CustomerIdIndex <<Map>> <<Map>> <<Map>> <<Map>> <<Map>> <<Map>>Card 5 5 4 4 1 5Cnt 5 5 5 5 5 5AvgAttributeLength 22 6 8 1.2 22 6InferencedType DateTime Numeric Alphanumeric Numeric DateTime NumericNoOfNumeric 0 5 0 5 0 5NoOfAlphaNumeric 0 0 5 0 0 0
1 1 0.8 0.8 0.2 1
Card / Cnt > Alphawhere Alpha = 0.9
AvgAttributeLength > Epsilon
where Epsilon = 5
Example
Building a Smarter Planet
Example – Determining MappablesThe Mappable Attribute can be seen as means to reduced search space of potentially correlating attributes of a type. One approach is to set an upper threshold of how often a value of an attribute can occur. The assumption is that if it occurs more then x times it is unlikely that it is a correlation candidate.
x… Cardinality of a valuei… Attribute of a type{ xi | x < Gamma }
Card < Gammawhere Gamma = 10
For instance in this domain it might be unlikely that a shipment has more than 10
orders. However this might cause problems in other domains or for certain
relationships (one customer has definitely more than 10 orders).
ShipmentCreated DateTime ShipmentId OrderId CarrierIndex <<Map>> <<Map>> <<Map>> <<Map>>Card 5 5 5 3Cnt 5 5 5 5AvgAttributeLength 22 6 6 13.2InferencedType DateTime Numeric Numeric AlphanumericNoOfNumeric 0 5 5 0NoOfAlphaNumeric 0 0 0 5
Example
Building a Smarter Planet
Difference Set 2/2
• By determining all the Indexable and Mappable Attributes of all types the next step is to find candidates of pairs of attributes that potentially correlate with each other.
• Therefore a difference set A/B = {x | xA xB} between all permutations of attribute candidates A and B is created.
• A/B must be below a certain threshold in order to be taken into account:
|A/B| <= DiffTreshold• Candidate Pairs of the permutation mixes are excluded if
they have a mismatch of types based on the previously determined InferencedType.
Building a Smarter Planet
Example• Indexable Attributes
– OrderReceived• DateTime• OrderId• CustomerId
– ShipmentCreated• DateTime• ShipmentId• OrderId
– TransportStarted• TransportId• ShipmentId
– TransportEnd• DateTime• TransportId• ShipmentId
• Mappable Attributes– In our Scenario every attribute is considered as a Mappable Attribute as
the total number of instances is lower than the threshold in order to reduce the complexity of the examples
Example
DateTime’s are excluded as they are a timestamp which are of a type that are not suitable for correlation pairs. This also applies for booleans and description texts.
Building a Smarter Planet
Example – DifferenceSet for all Permutations
Example
OrderReceived.OrderId = ShipmentCreated.ShipmentIdOrderReceived.OrderId = ShipmentCreated.OrderIdOrderReceived.OrderId = TransportStarted.TransportIdOrderReceived.OrderId = TransportStarted.ShipmentId…
A/B = {x | xA xB}|A/B| <= DiffTreshold
100%0%100%100%…
DiffTreshold = 0.95
OrderReceived.OrderId = ShipmentCreated.OrderIdShipmentCreated.ShipmentId = TransportStarted.ShipmentIdShipmentCreated.ShipmentId = TransportEnded.ShipmentIdTransportStarted.TransportId = TransportEnded.TransportIdTransportEnded.TransportId = TransportStarted.TransportId
Resulting candidates of Correlation Pairs with 100% overlapping
SetDiff
SetDiff0%0%0%0%0%
Building a Smarter Planet
Example – DifferenceSet for all Permutations
Example
A/B = {x | xA xB}|A/B| <= DiffTreshold
OrderReceived.OrderId = ShipmentCreated.OrderIdShipmentCreated.ShipmentId = TransportStarted.ShipmentIdShipmentCreated.ShipmentId = TransportEnded.ShipmentIdTransportStarted.TransportId = TransportEnded.TransportIdTransportEnded.TransportId = TransportStarted.TransportId
•A difference often occurs especially when processes are not completed, have been prematurely terminated/aborted or events are not generated always because of decision forks. Bear in mind that this is a very simplified example!
•Pairs that are associative are removed! • In this case every pair has the same type – In practice this is not the case! If they are not of the same type they are excluded from the permutation set and thus the difference set is not calculated.
SetDiff0%0%0%0%0%
Building a Smarter Planet
Difference between AvgAttributeLength
• The second weighting factor for the confidence is the difference between the AvgAttributeLength of the two correlation candidates.
• If the difference of the attribute lengths has a strong variance it might mean that they won’t share significant relationships.
Building a Smarter Planet
Example – AvgAttributeLength
Example
OrderReceived.OrderId = ShipmentCreated.OrderIdShipmentCreated.ShipmentId = TransportStarted.ShipmentIdShipmentCreated.ShipmentId = TransportEnded.ShipmentIdTransportStarted.TransportId = TransportEnded.TransportIdTransportEnded.TransportId = TransportStarted.TransportId
SetDiff
0%0%0%0%0%
AvgAttrLength
00000
Building a Smarter Planet
LevenshteinDistance
• The last variable that influences confidence weighting is the Levenshtein distance between the names of two attributes.
• It is common that attribute names from different sources might have the same or comparable names if they have the same meaning.
• For example, in one system the attribute that contains the identifier for an order is named OrderId and in the other it is named order-id.
Building a Smarter Planet
Example – LevenshteinDistance
Example
OrderReceived.OrderId = ShipmentCreated.OrderIdShipmentCreated.ShipmentId = TransportStarted.ShipmentIdShipmentCreated.ShipmentId = TransportEnded.ShipmentIdTransportStarted.TransportId = TransportEnded.TransportIdTransportEnded.TransportId = TransportStarted.TransportId
SetDiff
0%0%0%0%0%
AvgAttrLength
00000
LevenshteinDistance
00000
Building a Smarter Planet
Example – Weight Calculation
Example
OrderReceived.OrderId = ShipmentCreated.OrderIdShipmentCreated.ShipmentId = TransportStarted.ShipmentIdShipmentCreated.ShipmentId = TransportEnded.ShipmentIdTransportStarted.TransportId = TransportEnded.TransportIdTransportEnded.TransportId = TransportStarted.TransportId
SetDiff
0%0%0%0%0%
AvgAttributeLength
00000
LevenshteinDistance
00000
SetDiffAvgAttrLenght
LevenshteinDistance
60%20%20%
Confidence
100%100%100%100%100%
Weight is adjustable!
Building a Smarter Planet
Agenda
• Motivation• Big Picture and Context• Related Work• Algorithm (with Examples)
– Data Pre-Processing– Statistics Calculation– Determining Correlation Candidates
• Screenshots of prototype application• Conclusion & Future Work
Building a Smarter Planet 37
Correlation Discovery
Building a Smarter Planet 38
Correlation Discovery Refinement
Building a Smarter Planet
Agenda
• Motivation• Big Picture and Context• Related Work• Algorithm (with Examples)
– Data Pre-Processing– Statistics Calculation– Determining Correlation Candidates
• Screenshots of prototype application• Conclusion & Future Work
Building a Smarter Planet
Evaluation, Conclusion and Future Work
Export compliance regulation
•Wide range of heterogeneous systems
• Order Management, • Document Management, • E-Mail, • Export Violation Detection
Services• Workflow-supported human-
driven interactions (Process Management System).
•24 EventTypes•95 Attributes
Precision: 99.56%
False Positive Example: correlation by “orderVolume”
Always similar size and attributes has a min. length
(No.of.RelevantCorrelationRules / (No.of. RelevantCorrelationRules + FalsePositives) * 100).
Building a Smarter Planet
THANK YOU!
Questions?