project report: message type extraction from log files

1. INTRODUCTION

A log file is a file that records either the events that take place which happens while an

operating system or other software runs. Event logs record events taking place in the execution

of a system in order to provide an audit trail that can be used to understand the activity of the

system and to diagnose problems. They are essential to understand the activities of complex

systems, particularly in the case of applications with little user interaction (such as server

applications). System administrators can trawl through event logs to detect critical alerts and

failures and fix them.[2]

1.1 Supercomputer Logs

Figure 1.1: Sample entries of a log file.

Supercomputers generate event logs with millions of entries. This makes it difficult for an

administrator to identify possible alerts manually. Hence, there is a need to streamline the

process for convenient detection of alerts. To this end, extraction of Message types is to be done.

A message type is basically a cluster of similar event descriptions. Message type descriptions are

the templates on which the individual unstructured messages in any event log are built.

Extraction of message types serves several purposes. Message types can serve as abstractions for

compression of log files, visualization of alerts occurring in the system and simplifying the

search procedure in log files.

1

1.2 Message Type Extraction

Message types can be define by understanding the following example:

Consider the following events of a log file

Figure 1.2: Entries of an event log.

The entries can be grouped under a cluster “generating *” as among all the entries, the token

generating remains constant and the other token varies and hence it is generalized using *. At

the outset, message type extraction serves as a classic problem of clustering. Many existing tools

do establish this objective but with either little accuracy or poor performance. To this end, a

novel multi-level clustering technique is proposed which scales well on large datasets and

provided accurate message type descriptions [1]. This model is extended to show the

visualization of the alerts in the file. Event logs generated by applications that run on a system

consist of independent lines of text data, which contain information that pertains to events that

occur within a system. This makes them an important source of information to system

administrators in fault management and for intrusion detection and prevention. With regard to

autonomic systems, these two tasks are important cornerstones for self-healing and self-

protection, respectively. Therefore, to move towards the goal of building systems that are capable

of self-healing and self-protection, an important step would be to build systems that are capable

of automatically analyzing the contents of their log files, in addition to measured system metrics

to provide useful information to the system administrators.

1.3 Motivation

Extraction of message types makes it possible to abstract the unstructured content of event

logs, which constitutes a key challenge to achieving fully automatic analysis of system logs.

2

Message type descriptions are the templates on which the individual unstructured messages in

any event log are built. Message types can abstract the contents of system logs. We can therefore

use them to obtain more concise and compact representations of log entries. This leads to

memory and space savings. Each unique message type can be assigned an Identifier Index (ID),

which in turn can be used to index historical system logs leading to faster searches. The building

of computational models on the log data, which usually requires the input of structured data, can

be facilitated by the initial extraction of message type information. Message types are used to

impose structure on the unstructured messages in the log data before they are used as input into

the model building algorithm. Visualization is an important component of the analysis of large

data sets. Visualization of the contents of systems logs can be made more meaningful to a human

observer by using message types as a feature of the visualization. For the visualization to be

meaningful to a human observer, the message types must be interpretable. This fact provides a

strong incentive for the production of message types that have meaning to a human observer.

1.4 Objective

The objective is to create a tool that takes a log file as input, performs preprocessing on it,

and then performs the message type extraction. Using the results obtained from it, statistical tools

like pie charts and runtime curves are used. Among the message types, ones which are alerts are

taken and monitored by viewing the messages in those clusters.

1.5 Scope

◦ Works on 3 datasets (as of now) but can be extended to other log files after making

minor modifications

◦ Displays runtime curves for a given dataset and also pie charts showing the alerts.

1.6 Literature Survey

Chapter 1 gives overview of the project problem statement. It includes introduction to the

domain, motivation, scope and objective of the project.

3

Chapter 2 describes literature survey of the project where the study regarding the

terminologies involved in the project, various vulnerabilities and information about third party

tools used are mentioned.

Chapter 3 gives overview of the system required for the project where the problem

specification, various modules and their functionalities and software and hardware requirements

are mentioned.

Chapter 4 gives design of the project where introduction and UML diagrams involved in the

project are specified.

Chapter 5 describes the implementation of project. This includes classes and methods used,

algorithms implemented with examples.

Chapter 6 includes the results of the project execution and its analysis.

Chapter 7 gives the conclusions and future work related to the project.

4

2. LITERATURE SURVEY

2.1 Previous Work

Data clustering as a technique in data mining or machine learning is a process whereby

entities are sorted into groups called clusters, where members of each cluster are similar to each

other and dissimilar from members of other groups. Clustering can be useful in the interpretation

and classification of data sets too large to analyze manually. Clustering therefore can be a useful

first step in the automatic analysis of event logs. If each textual line in an event log is considered

a data point and its individual words considered attributes, then the clustering task reduces to one

in which similar log messages are grouped together.

2.1.1 Existing Techniques

While several algorithms like CLIQUE[3], CURE, and MAFIA[5] have been designed for

clustering high dimensional data, these algorithms are still not quite suitable for log files because

an algorithm suitable for clustering event logs needs to not just be able to deal with high-

dimensional data, but it also needs to be able to deal with data with different attribute types. On

the other hand SLCT and Loghound are two algorithms, which were designed specifically for

automatically clustering log files, and discovering event formats. Because both SLCT and

Loghound are similar to the Apriori algorithm, they require the user to provide a support

threshold value as input.

2.1.2 SLCT

SLCT works through a three step process - It firsts identifies the frequent words (words that

occur more frequently than a support threshold value) or 1-item sets from the data. It then

extracts the combinations of these 1-item sets that occur in each line in the data set. These 1-item

set combinations are cluster candidates. Finally, those cluster candidates that occur more

frequently than the support value are then selected as the clusters in the data set. Risto Vaarandi's

SLCT5 uses an algorithm specifically designed to detect word clusters in log messages. It makes

three passes through the data to accomplish this objective. A hash counting all words and their

5

position in the line is generated on the rst pass through the data (“the dog ran” is be hashed into

three keys: .1.the, 2.dog, 3.ran.). Words having a support less than s are then pruned from the

hash, and a new hash of message word clusters is generated during a second pass through the

data (the messages .the dog ran. and .the deer ran. would generate a key of 1.the 2.* 3.ran for

s=2 - the second word is the wild card “*” since dog and deer only appear once). An optional

third pass can be performed in which wild card positions are refined with constant heads or tails

if possible (in our example, 2.“*” becomes 2.”d*”. because both dog and deer begin with d). The

resulting word cluster and their support is output, and any lines not matching any word cluster

are saved to a separate file for review (“outlier” lines).[4]

2.1.3 Loghound

Loghound on the other hand discovers frequent patterns from event logs by utilizing a

frequent item set mining algorithm, which mirrors the Apriori algorithm more closely than SLCT

because it works by finding item sets which may contain more than 1 word up to a maximum

value provided by the user. With both SLCT and Loghound, lines that do not match any of the

frequent patterns discovered are classified as outliers. The shortcomings of Loghound and SLCT

are two fold. Firstly, they both focus on finding only frequent message patterns in log data but

not infrequent patterns. While this might suffice most times, it may sometimes be necessary to

also find infrequent patterns for analysis. Infrequent patterns may be more interesting to find in

applications such as anomaly detection. Secondly comes the issue of semantics. Patterns found

by Loghound and SLCT are all valid but may not necessarily make sense to a human observer.

This observation becomes relevant if the patterns found will be used in a visualization tool such

as LogView. It is therefore important to extend the work of tools like Loghound and SLCT by

designing an algorithm that will allow the discovery of infrequent patterns and also patterns that

are meaningful to a human observer.

2.2 Datasets

In this work, log files of 3 different supercomputers – LA HPC, Blue Gene/P, Blue Gene/L are

used for evaluation.

6

2.2.1 Blue gene

Blue Gene is an IBM project aimed at designing supercomputers that can reach operating

speeds in the PFLOPS (petaFLOPS) range, with low power consumption. The project created

three generations of supercomputers, Blue Gene/L, Blue Gene/P, and Blue Gene/Q. Blue Gene

systems have often led the TOP500 and Green500 rankings of the most powerful and most power

efficient supercomputers, respectively. Blue Gene systems have also consistently scored top

positions in the Graph500 list. The project was awarded the 2009 National Medal of Technology

and Innovation. The Blue Gene/P dataset[6] used in the current work has 4.7 million entries,

while the Blue Gene/L dataset has 1.7 million entries. The entries have been collected over a six

month period. The dataset size is large enough to pose a data mining problem. The data consists

of RAS log messages collected over a period of 6 months on the Blue Gene/P Intrepid system at.

Each message in the log contains 15 fields as follows: RECID, MSG_ID, COMPONENT,

SUBCOMPONENT, ERRCODE, SEVERITY, EVENT_TIME, FLAGS, PROCESSOR, NODE,

BLOCK, LOCATION, SERIALNUMBER, ECID, MESSAGE.

2.2.2 LA HPC-1

The HPC-1 is a supercomputer located at the Los Alamos National Laboratory. Its dataset

has a total of 0.4 million entries.

2.3 C#

C# (pronounced as see sharp) is a multi-paradigm programming language encompassing

strong typing, imperative, declarative, functional, procedural, generic, object-oriented (class-

based), and component-oriented programming disciplines. It was developed by Microsoft within

its .NET initiative and later approved as a standard by Ecma (ECMA-334) and ISO (ISO/IEC

23270:2006). C# is one of the programming languages designed for the Common Language

Infrastructure. C# is built on the syntax and semantics of C++, allowing C programmers to take

advantage of .NET and the common language runtime. C# is intended to be a simple, modern,

general-purpose, object-oriented programming language. Its development team is led by Anders

Hejlsberg. The most recent version is C# 5.0, which was released on August 15, 2012. C# is the

programming language that most directly reflects the underlying Common Language

7

Infrastructure (CLI). Most of its intrinsic types correspond to value-types implemented by the

CLI framework. However, the language specification does not state the code generation

requirements of the compiler: that is, it does not state that a C# compiler must target a Common

Language Runtime, or generate Common Intermediate Language (CIL), or generate any other

specific format. Theoretically, a C# compiler could generate machine code like traditional

compilers of C++ or Fortran. . Some notable features of C# that distinguish it from C and C++

(and Java, where noted) are:

• C# supports strongly typed implicit variable declarations with the keyword var, and

implicitly typed arrays with the keyword new[] followed by a collection initializer.

• Meta programming via C# attributes is part of the language. Many of these attributes

duplicate the functionality of GCC's and VisualC++'s platform-dependent preprocessor

directives.

• Like C++, and unlike Java, C# programmers must use the keyword virtual to allow

methods to be overridden by subclasses.

• Extension methods in C# allow programmers to use static methods as if they were

methods from a class's method table, allowing programmers to add methods to an object

that they feel should exist on that object and its derivatives.

• The type dynamic allows for run-time method binding, allowing for JavaScript like

method calls and run-time object composition.

• C# has support for strongly-typed function pointers via the keyword delegate.

• Like the Qt framework's pseudo-C++ signal and slot, C# has semantics specifically

surrounding publish-subscribe style events, though C# uses delegates to do so.

• C# offers Java-like synchronized method calls, via the attribute

[MethodImpl(MethodImplOptions.Synchronized)], and has support for mutually-

exclusive locks via the keyword lock.

• The C# languages does not allow for global variables or functions. All methods and

members must be declared within classes. Static members of public classes can

substitute for global variables and functions.

• Local variables cannot shadow variables of the enclosing block, unlike C and C++.

8

• A C# namespace provides the same level of code isolation as a Java package or a C++

namespace, with very similar rules and features to a package.

• C# supports a strict Boolean data type, bool. Statements that take conditions, such as

while and if, require an expression of a type that implements the true operator, such as

the boolean type. While C++ also has a boolean type, it can be freely converted to and

from integers, and expressions such as if(a) require only that a is convertible to bool,

allowing a to be an int, or a pointer. C# disallows this "integer meaning true or false"

approach, on the grounds that forcing programmers to use expressions that return

exactly bool can prevent certain types of programming mistakes common in C or C++

such as if (a = b) (use of assignment = instead of equality ==).

• In C#, memory address pointers can only be used within blocks specifically marked as

unsafe, and programs with unsafe code need appropriate permissions to run. Most

object access is done through safe object references, which always either point to a

"live" object or have the well-defined null value; it is impossible to obtain a reference to

a "dead" object (one that has been garbage collected), or to a random block of memory.

An unsafe pointer can point to an instance of a value-type, array, string, or a block of

memory allocated on a stack. Code that is not marked as unsafe can still store and

manipulate pointers through the System.IntPtr type, but it cannot dereference them.

• Managed memory cannot be explicitly freed; instead, it is automatically garbage

collected. Garbage collection addresses the problem of memory leaks by freeing the

programmer of responsibility for releasing memory that is no longer needed.

2.4 WPF

Windows Presentation Foundation (or WPF) is a graphical subsystem for rendering user

interfaces in Windows-based applications by Microsoft. WPF, previously known as "Avalon",

was initially released as part of .NET Framework 3.0. Rather than relying on the older GDI

subsystem, WPF uses DirectX. WPF attempts to provide a consistent programming model for

building applications and separates the user interface from business logic. It resembles similar

XML-oriented object models, such as those implemented in XUL and SVG. WPF employs

XAML, an XML-based language, to define and link various interface elements. WPF

9

applications can also be deployed as standalone desktop programs, or hosted as an embedded

object in a website. WPF aims to unify a number of common user interface elements, such as

2D/3D rendering, fixed and adaptive documents, typography, vector graphics, runtime

animation, and pre-rendered media. These elements can then be linked and manipulated based on

various events, user interactions, and data bindings. WPF runtime libraries are included with all

versions of Microsoft Windows since Windows Vista and Windows Server 2008. Users of

Windows XP SP2/SP3 and Windows Server 2003 can optionally install the necessary libraries.

2.5 XAML

Extensible Application Markup Language (XAML) is a declarative XML-based language

developed by Microsoft that is used for initializing structured values and objects. It is available

under Microsoft's Open Specification Promise. The acronym originally stood for Extensible

Avalon Markup Language - Avalon being the code-name for Windows Presentation Foundation

(WPF). XAML is used extensively in .NET Framework 3.0 & .NET Framework 4.0

technologies, particularly Windows Presentation Foundation (WPF), Silverlight, Windows

Workflow Foundation (WF) and Windows Runtime XAML Framework and Windows Store

apps. In WPF, XAML forms a user interface markup language to define UI elements, data

binding, eventing, and other features. In WF, workflows can be defined using XAML. XAML

can also be used in Silverlight applications, Windows Phone apps and Windows Store apps.

XAML elements map directly to Common Language Runtime object instances, while XAML

attributes map to Common Language Runtime properties and events on those objects. XAML

files can be created and edited with visual design tools like Microsoft Expression Blend,

Microsoft Visual Studio, and the hostable Windows Workflow Foundation visual designer. They

can also be created and edited with a standard text editor, a code editor like XAMLPad, or a

graphical editor like Vector Architect. Anything that is created or implemented in XAML can be

expressed using a more traditional .NET language, such as C# or Visual Basic.NET. However, a

key aspect of the technology is the reduced complexity needed for tools to process XAML,

because it is based on XML. Consequently, a variety of products are emerging, particularly in the

WPF space, which create XAML-based applications. As XAML is simply based on XML,

developers and designers are able to share and edit content freely amongst themselves without

10

requiring compilation. Since it is strongly linked to the .NET Framework 3.0 technologies, the

only fully compliant implementation at present is Microsoft's.

2.6 WPF Toolkit

WPF Toolkit is the number one collection of WPF controls, components and utilities for creating

next generation Windows applications. It provides controls for creating pie charts, bar charts,

histograms etc. Like other WPF controls, they use XAML for creation and specifying their

properties.

11

3. SYSTEM ANALYSIS

3.1 Terminology

To understand the problem at hand, it is important to define the following terms.

3.1.1 Event log

A text-based audit trail of events that occur within the system or application processes on a

computer system.

3.1.2 Event

An independent line of text within an event log which details a single occurrence on the

system. An event typically contains not only a message but other fields of information like a

Date, Source, and Tag. For message type extraction, we are only interested in the message field

of the event. This is why events are sometimes referred to in the literature as messages. In Figure

3.1, the first five fields (delimited by whitespace) represent the Timestamp, Host, Class, Facility,

and Severity of each event. We omit these types of fields from the message type extraction

process as they are already sufficiently structured. However, they are still useful for further log

analysis, e.g., the Timestamp and Host fields for time series analysis of the unique message types

extracted.

Figure 3.1: General structure of a log entry.

12

3.1.3 Token

A single word delimited by white space within the message field of an event. The tokens and

the relationship between them is considered while clustering the events.

3.1.4 Event size

The number of individual tokens in the “message” field of an event. The event size is one of

heuristics used while clustering the logs.

3.1.5 Message Type

These are message fields of entries within an event log produced by the same print statement.

Nonoverlapping consecutive pairs of lines in the log belong to the same event cluster. Due to the

subjectivity of determining what constitutes a message type, it is possible that a human observer

might consider messages produced by a single message type as belonging to different message

types or treat messages produced by different print statements as belonging to the same message

type. It is also possible that the same print statement is present in different parts of the code,

producing different messages types with the same message type description. However, we

consider these scenarios as relatively rare, so we will use this definition for the sake of

simplicity.

3.1.6 Constant Token

A token within the message field of an event which is not represented by a wildcard value in

its associated message type description.

3.1.7 Variable Token

A token within the message field which is represented by a wild card (“*”) and is part of the

message type description.

13

3.2 Problem

The problem can be defined as follows – given a log file L, consisting of entries {L1,L

2, L

3,......},

we need to extract message types M={M1,M

2,M

3,....} where each M

i represents a unique non-

empty subset of L. Among all such subsets, we need to find the largest and represent them

visually using charts.

3.3 Modules

There are mainly 3 modules in the application.

3.3.1 Pre-processing Module

In this module, the user can select the log file as input and perform pre-processing. Pre-

processing removes all the irrelevant attributes and only keeps the message part of the entries.

3.3.2 Clustering Module

In this module, the pre-processed dataset is given to the mining algorithms. There are 4

algorithms which are called sequentially one after the another – Partition By Event size, Partition

By Token Position, Partition By Bijection, Message Type Extraction. Each algorithm generates a

series of partitions that are given to the next algorithm. At the end of all the algorithms, the

required message types are generated. This module is space and time intensive and involves the

use of several data structures.

3.3.3 Visualization module

This module involves taking the message types extracted by the Clustering module and

applying visualization techniques to better understand the results. This involves using pie charts

and curves for runtime analysis of operations.

3.4 Requirements

Requirements are categorized into two types – Hardware and Software requirements.

14

3.4.1 Hardware Requirements

▪ Pentium Dual-Core Processor

▪ 2 GB RAM

▪ 5 GB storage space

3.4.2 Software Requirements

▪ .NET 4.5 SDK and runtime

▪ WPF toolkit (for visualization)

▪ Visual Studio 2013

▪ Windows 8

▪ StarUML (for design)

▪ LTFViewer (for viewing the datasets)

15

4. SYSTEM DESIGN

4.1 Design

Design can be described with the help of UML diagrams.

4.1.1 Use Case diagram

Figure 4.1: Usecase diagram of the system

The only actor involved here is the administrator – who is interested in accessing the message

types so that he can get to the root of the problems in the system.

The use cases involved in Figure 4.1 are:

• Preprocess – calls the pre-processing module to remove irrelevant attributes from the

selected dataset.

• Extract Message Types – calls the clustering module to perform IPLOM on the pre-

16

processed dataset.

• View Clusters – the extracted message types and messages belonging to each type can be

viewed in a hierarchial manner.

• View Graphs – the administrator can view the most frequently occurring alerts using a pie

chart and also see the run-time performance of the algorithms.

4.1.2 Class Diagram

The class diagram shows the relationships and abstractions involved in mining the log

files. The various classes in the Figure 4.2 are:

• MainWindow – this is the main GUI where the user selects the dataset and starts the

operations.

• Partition – this encompasses the algorithms which preprocess and then cluster the entries

in the dataset. It is invoked in the MainWindow class and hence has an association

relationship.

• TokenPosition – each object of this class stores the tokens occurring in corresponding

token position and calculates cardinality required for the Partition By Token algorithm.

• Pair – objects of this class are also used in Partition By Token algorithm. Each object

stores the token position chosen and the number of unique tokens in that position.

• BijectionPair – objects of this class used in Partition By Search for Bijection. Each object

considers two token positions and stores tokens occurring in those positions and

determines whether there exists a bijective relationship.

• Bijection – This object stores the bijective relationship between two positions (if it exists)

along with the set of unique tokens in each position.

• NonBijection – if there exists no bijective relationship in a partition, then this object is

used to create outlier and place the entries in them.

• MessageType – each object stores final value for each token position after clustering.

• Window1 – This generates the pie charts which shows classification of messages

according to event size and the most frequently occurring alerts.

• Window2 – this window shows each message type and messages belonging to it.

17

Figure 4.2: Class diagram

18

4.1.3 Activity Diagrams

Two activity diagrams are shown in Figure 4.3 and Figure 4.4.

Figure 4.3: Activity diagram for Partition by event

19

Figure 4.4: Activity diagram for Partition by Token operation

Figure 4.3 shows the activity diagram for Partition by event operation. Figure 4.4 shows

the activity diagram for Partition by token operation. The operations are as explained in the

methodology.

20

4.1.4 Sequence Diagram

Figure 4.5: Sequence diagram

Figure 4.5 shows the sequence diagram. An instance of MainWindow starts the process

by creating the Partition object p, which in turns starts sequentially self-invoking operations like

partitionByEventSize(), partitionByTokenPosition() etc. p then indicates the completion of the

clustering. The MainWindow object w then invokes the visualization module and viewing the

clusters separately.

21

4.1.5 Component diagram

Figure 4.6: Component diagram

Figure 4.6 shows the component diagram, which shows interaction between groups of

classes. The initial GUI and event handlers are grouped into MainWindow component. The

Partition component is the group of classes which accomplish the clustering. It consists of

several classes like TokenPosition, Bijection, NonBijection, MessageTypes etc. It realizes the

required interface IPLOM. Visualization component encompasses the set of classes required for

displaying the statistics and charts. The MainWindow has connectors to Partition and

Visualization modules.

22

5. IMPLEMENTATION

5.1 Methodology

The algorithms involved in the clustering module are illustrated below:

5.1.1 Partition by Event Size

The pre-processed data set is read line by line. Messages with same event sizes are

grouped into one partition. Thus, several partitions are created. The reasoning behind this is that,

messages belonging to same message type have the same number of tokens. This is can be

illustrated from the Figure 5.1.

Figure 5.1: Illustration of partition by event size, creating 3 partitions for 3 messages

The first step of the partitioning process works on the assumption that log messages that have the

same message type description are likely to have the same event size. For this reason, IPLoM’s

first step uses the event size heuristic to partition the log messages. By partition, we mean

nonoverlapping groupings of the messages. Additional heuristic criteria are used in the remaining

steps to further divide the initial partitions. The partitioning process induces a hierarchy of

maximum depth 4 on the messages and the number of nodes on each level is data dependent.

Consider the cluster description “Connection from *,” which contains three tokens. It can be

intuitively concluded that all the instances of this cluster, e.g., “Connection from

255.255.255.255” and “Connection from 0.0.0.0” would also contain the same number of tokens.

By partitioning our data first by event size, we are taking advantage of the property of most

cluster instances of having the same event size. Therefore, the resultant partitions of this heuristic

are likely to contain the instances of the different clusters, which have the same event size.

23

Sometimes, it is possible that clusters with events of variable size exist in the event log. Since

IPLoM assumes that messages belonging to the same cluster should have the same number of

tokens or event size, this step of the algorithm would separate such clusters. This does not occur

too often, and variable size message types can still be found by postprocessing IPLoM’s results.

The process of finding variable size message types can be computationally expensive.

Nevertheless, performing this process on the templates produced by IPLoM rather than on the

complete log would require less computation.

5.1.2 Partition By Token Position

At this point, each partition of the log data contains log messages, which are of the same

size and can therefore be viewed as n-tuples, with n being the event size of the log messages in

the partition. This step of the algorithm works on the assumption that the column with the least

number of variables (unique words) is likely to contain words, which are constant in that position

of the message type descriptions that produced them. Our heuristic is therefore to find the token

position with the least number of unique values and further split each partition using the unique

values in this token position, i.e., each resultant partition will contain only one of those unique

values in the token position discovered, as can be seen in the example outlined in Figure 5.2 .

Figure 5.2 : Selecting the token position for partitioning in partition by token position operation.

The algorithm is elaborated in pseudo code below:

Algorithm 2. IPLoM Step 2: Selects the token position with the lowest cardinality and then

separates the lines in the partition based on the unique values in the token position. Backtracks

on partitions with lines that fall below the partition support threshold.

Input: Collection of log file partitions from Step-1.

24

Real number PST as partition support threshold. {Range for PST is assumed to be between 0 and

1.}

Output: Collection of log file partitions derived at Step-2

1: for every log file partition do {Assume lines in each partition have same event size.}

2: Determine token position P with lowest cardinality with respect to set of unique tokens.

3: Create a partition for each token value in the set of unique tokens that appear in position P.

4: Separate contents of partition based on unique token values in token position P into separate

partitions.

5: end for

6: for each partition derived at Step-2 do {}

7: if PSR < PS then

8: Add lines from partition to Outlier partition

9: end if

10: end for

11: Return() {Output is collection of pruned new partitions}

The memory requirement of unique token counting is a potential concern with the

algorithm. While the problem of unique token counting is not specific to IPLoM, it is believed

that IPLoM has an advantage in this respect. Since IPLoM partitions the database, only the

contents of the partition being handled need be stored in memory. This greatly reduces the

memory requirements of the algorithm. Moreover, other workarounds can be implemented to

further reduce the memory requirements. For example, in this Step 2 of the algorithm, by

determining an upper bound (UB) on the lowest token count in Step 1, we can drastically reduce

the memory requirements of this step, further counts of unique tokens in any token position that

exceeds the upper bound can be eliminated.

Despite the fact that we use the token position with the least number of unique tokens, it

is still possible that some of the values in the token position might actually be variables in the

original message type descriptions. While an error of this type may have little effect on Recall, it

could adversely affect Precision. To mitigate the effects of this error, a partition support ratio

25

(PSR) for each partition produced could be introduced. The PSR is calculated using in regard to

the original partition that it was derived from. We can then define a partition support ratio

threshold (PST). We group any partition with a PSR that falls below the PST into one partition

(Algorithm 2). The intuition here is that a child partition that is produced using a variable token

value may not have enough lines to exceed a certain percentage (the partition support ratio

threshold) of the log messages in the parent partition. It should be noted that this threshold is not

necessary for the algorithm to function and is only introduced to give the system administrators

the flexibility to influence the partitioning based on expert knowledge they may have and avoid

errors in the partitioning process.

Figure 5.3: Equation for calculating partition support ratio.

5.1.3 Partition By Search For Bijection

Consider the example the messages below as a log partition.

Command has completed successfully

Command has been aborted



Command failed on starting.

This partition has event size equal to 4. We need to select two token positions to perform

the search for bijection on. The first token position has one unique token, {Command}. The

second token position has two unique tokens, {has, failed}. The third token position has three

unique tokens, {completed, been, on}. While the fourth token position has three unique tokens,

{successfully, aborted, starting}. We notice in this example that token count 3 appears most

frequently, twice, once in position 3 and once in position 4. The heuristic would therefore select

token positions 3 and 4 in this example. In the third and final partitioning step, we partition by

searching for bijective relationships between the set of unique tokens in two token positions

selected using a heuristic as described in below algorithm.

Algorithm 3. IPLoM Step 3: Selects the two token positions and then separates the lines in the

partition based on the relational mappings of unique values in the token positions. Backtracks on

26

partitions with lines that fall below the partition support threshold.

Input: Collection of partitions from Step 2. {Partitions of event size 1 or 2 are not processed

here}

Real number CT as cluster goodness threshold. {Range for CT is assumed to be between 0-1.}

Output: Collection of partitions derived at Step-3.

1: for every log file partition do

2: if CGR >= CT then {See (2)}

3: Add partition to collection of output partitions

4: Move to next partition.

5: end if

6: Determine token positions using heuristic as P1 and P2. {Heuristic is explained in the text. We

assume token position P1 occurs before P2.}

7: Determine mappings of unique token values P1 in respect of token values in P2 and vice

versa.

8: if mapping is 1-1 then

9: Create partitions for event lines that meet each 1-1

relationship.

10: else if mapping is 1-M or M-1 then

11: Determine variable state of M side of relationship.

12: if variable state of M side is CONSTANT then

13: Create partitions for event lines that meet relationship.

14: else {variable state of M side is VARIABLE}

15: Create new partitions for unique tokens in M side of the relationship.

16: end if

17: else {mapping is M-M}

18: All lines with meet M-M relationships are placed in one partition.

19: end if

20: end for

21: for each partition derived at Step-3 do {}

22: if PSR < PS then

27

23: Add lines from partition to Outlier partition

24: end if

25: end for

26: Return() {Output is collection of pruned new partitions}

To summarize the steps of the heuristic, we first determine the number of unique tokens

in each token position of a partition. We then determine the most frequently occurring token

count among all the token positions. This value must be greater than 1. The token count that

occurs most frequently is likely indicative of the number of message types that exist in the

partition. If this is true, then a bijective relationship should exist between the tokens in the token

positions that have this token count. Once the most frequently occurring token count value is

determined, the token positions chosen will be the first two token positions, which have a token

count value equivalent to the most frequently occurring token count. A bijective function is a 1-1

relation that is both injective and surjective. When a bijection exists between two elements in the

sets of tokens, this usually implies that a strong relationship exists between them and log

messages that have these token values in the corresponding token positions are separated into a

new partition.

Figure 5.4: Searching for bijective relationship between two token positions

A bijective function is a 1-1 relation that is both injective and surjective. When a bijection

exists between two elements in the sets of tokens, this usually implies that a strong relationship

exists between them and log messages that have these token values in the corresponding token

positions are separated into a new partition.

28

5.1.4 Extraction of Message Types

In this step of the algorithm, partitioning is complete and we assume that each partition

represents a cluster, i.e., every log message in the partition was produced using the same line

format. A message type description or line format consists of a line of text where constant values

are represented literally and variable values are represented using wildcards. This is done by

counting the number of unique tokens in each token position of a partition. If a token position

has only one value then it is considered a constant value in the line format, while if it is more

than one then it is considered a variable. Since our goal is to find all message types that may

exist in an event log or ensure that the presence of every message type contained in an event log

is reflected in the message types produced, we are not concerned about the occurrence of

“outliers” interfering with the formats produced at this step. Hence, we set the threshold for

determining a variable token position as any token position with more than one unique token.

Figure 5.5: Extracting the message types by generalizing tokens in each position

5.2 Implementation

5.2.1 Packages

In C#, packages are referred to as namespaces. The default namespace is System. Some

of the namespaces used in this work are:

• System – the default namespace, classes like String

• System.Collections.Generic – data structures like Dictionary, List

• System.Windows.Controls – WPF controls

• System.Windows.Documents

• System.IO – the classes for reading and writing files

29

• System.Diagnostics – for calculating runtime, StopWatch is used.

• Microsoft.Windows.Controls

• System.ComponentModel – BackgroundWorker is used for memory intensive operations

5.2.2 Classes and Methods

Some of the user-define classes used here are:

• Partition – this class encompasses the methods required for IPLOM. Some of the methods

are:

◦ preprocess() - removes the irrelevant attributes and only keeps the message part of the

log file

◦ partitionByEvent() - splits the preprocessed log file into several partitions(files) based

on event size.

◦ partitionByToken() - splits the partitions generated in above step into more partitions

based on the heuristic explained already.

◦ partitionByBijection() - splits the partitions generated in above step into more

partitions based on another heuristic.

◦ extractMessageTypes() - each partition thus generated is generalized to derive labels,

which are called message types.

• TokenPosition – each object represents a token position and holds a list of all unique

tokens in that positions

• BijectionPair – each object is used to trace mappings between two token positions and

determine if a bijective relationship exists.

• Bijection – if a bijection exists, then this object stores the properties of that relationship.

• NonBijection – if no bijection exists, then this object stores details of that partition to

indicate it as an outlier.

• Pair – in the partitionByToken() operation, it stores the position found and the cardinality.

• MainWindow – this hosts the GUI which appears when the program launches. It mainly

consists of event handler and background threads for running the clustering module.

30

6. RESULTS AND ANALYSIS

6.1 Data Sets

We use 3 log files of supercomputers as datasets to discover the clusters and analyze

them. Table shows an overview of the datasets.

Name of Dataset No. of entries Size of file

Bluegene/P 47,13,498 ~725 MB

Bluegene/L 16,95,371 ~1 GB

LA HPC-1 4,33,448 ~32 MBTable 6.1. Overview of Datasets

6.2 Screenshots

In this section, we'll discuss the clustering process for each data set with screenshots.

31

6.2.1 Initial GUI

Figure 6.1: GUI of the program

Figure 6.1 shows the initial GUI presented to the user. Any one of the 3 datasets can be

selected and preprocessing is performed. On clicking “Preprocess”, the program starts reading

the file related to the dataset. Each line is split into several strings using the space delimiter. The

only relevant attributes are picked and stored entry-by-entry in a new file.

32

6.2.2 Preprocessing

Figure 6.2: Snapshot of raw log file

Figure 6.2 shows a snapshot of the HPC raw log file with several attributes. After clicking

on the Preprocess button as shown in Figure 6.1, preprocessing starts. In each HPC log file, each

field in an entry is separated from another by comma. The first 6 attributes are removed and only

the message is extracted and written to a new file.

33

Figure 6.3: HPC log file after preprocessing

After preprocessing, the dataset changes from Figure 6.2 to Figure 6.3. It shows the

preprocessed HPC log file with only the event description. The file size is reduced from 32 MB

to 12 MB. This also helps improve the runtime for the mining algorithm.

34

6.2.3 Options for visualization

Figure 6.4: Options for visualization

In Figure 6.4, After extracting the message types, there are options to:

• Partition By Token – runtime analysis – this gives a graph that plots the values of time

taken by the Partition By Token Position operation and its variation with event size.

• Distribution of messages – event size – This shows a pie chart which classifies messages

according to their event sizes.

• Partition By Bijection vs Event Size – this shows a graph that plots the values of time

taken by the operation with respect to event size.

• Most frequently occurring alerts – This is the most important result of the whole process.

It gives a pie chart indicating messages which are alerts and the largest among them.

35

6.2.4 Bluegene/P

Figure 6.5: Runtime analysis of Partition by Token operation for Bluegene/P

Figure 6.5 shows the variation of runtime of Partition by token operation with respect to

event size for Bluegene/P dataset. The runtime is given in milliseconds along the Y-axis. X-axis

shows the various event sizes. The maximum time is taken by messages with event size 2. When

ranges are considered, event sizes 1-5 hog most of the CPU time. Some event sizes do not occur

in the dataset and hence have 0 as their runtime. The maximum event size is 68.

36

Figure 6.6: Pie chart of distribution of messages

Figure 6.6 shows a pie chart which gives the distribution of messages in the dataset based

on their event sizes. The statistics are summarized in table 6.2.

Event size % Size

[1-5] 63.78 3006066

[5-10] 19.91 938610

[10-20] 9.36 450607

[20-50] 6.66 287913

>50 0.39 18297Table 6.2: Distribution of messages- Bluegene/P

As observed in Table 6.2, majority of the messages are in 1-5 range and hence consume

more CPU in clustering.

37

Figure 6.7: Runtime analysis of Partition by bijection operation

Figure 6.7 shows a similar runtime curve for Partition by search for bijection operation

for Bluegene/P dataset. Again, peak value is observed for message size [1-5] and 25. This is

because majority of messages in the dataset belong to the mentioned event sizes. The X-axis

indicates the event sizes and Y-axis indicates the runtimes in milliseconds. For other event sizes,

the runtime is either small or negligible.

38

Figure 6.8: Most frequently occurring alerts – Bluegene/P

Figure 6.8 shows the most frequently occurring alerts for Bluegene/P dataset. They can be

summarized as in table

Ciod: Error loading *invalid or missing program image. No such file

or directory

27.20% 152735

* total interrupts. * critical input interrupts. * microseconds total

spent on critical input interrupts, * microseconds max time in a

critical input interrupt.

24.06% 135092

Program interrupt: * * 10.31% 57784

* TLB error interrupt 8.27% 46416

Instruction cache parity error corrected 18.36% 105924

Data storage interrupt 11.31% 63493Table 6.3: Most frequently occurring alerts for Bluegene/P

By referring to Table 6.3, the system administrator can select the given clusters and

actually analyze the messages under each of the alert clusters.

39

Figure 6.9: Window showing clusters

Figure 6.9 shows each cluster in the top list and its corresponding elements in below lists.

This makes it simple for a system administrator to narrow down the message he/she might be

searching for. The first list box shows the list of message types extracted. On double-clicking on

any of the message types, the corresponding messages of the cluster are shown in a text area

below. The text area can display a maximum of 50 messages at a time. If there are more

messages in the cluster, the administrator can view them by clicking on a Next button which is

provided.

40

6.2.5 Bluegene/L dataset

Figure 6.10: Runtime analysis for partition by token for Bluegene/L

Figure 6.10 shows the variation of runtime versus event size for partition by token

operation for Bluegene/L dataset. Unlike the other two datasets, the majority of the time is spent

on messages of event size 13 and 60. This is different compared to Bluegene/P which peaks at 1-

5 message sizes. This is due to different message distribution. Again, X-axis represents event

sizes and Y-axis represents time in milliseconds.

41

Figure 6.11: Distribution of messages with event sizes for Bluegene/L dataset

Figure 6.11 shows the distribution of the messages for Bluegene/L dataset. Most of the

messages lie between 10-20 size. Otherwise, the dataset is uniformly distributed over all event

sizes. Each sector represents one of the various pre-define ranges – [1-5], [5-10], [10-20], [20-

50] and >50.

Event size % Size

[1-5] 22.83 386995

[5-10] 11.28 191162

[10-20] 48.94 778920

[20-50] 10.7 181434

>50 9.25 156859Table 6.4: Distribution of messages – Bluegene/L

Table 6.4 shows the fraction of messages present for each of pre-defined labels.

Maximum amount is seen in [20-50] range which is indicated in green colour in the chart.

42

Figure 6.12: Runtime analysis of Partition by bijection for Bluegene/L

Figure 6.12 shows the variation of runtime for Partition by bijection. The peak occurs at

event size 60. The runtime greatly increases with event sizes as more tokens have to be parsed

and stored in TokenPosition objects and more prospective bijective relationships must be

considered. Therefore, values are more for higher event sizes, especially if the number of

messages for that event size is larger than others. Messages with lower event sizes have fewer

prospective bijective relationships that need to be searched. Hence, the low values.

43

Figure 6.13: Most frequently occurring alerts for Bluegene/L dataset

Figure 6.13 shows the most frequently occurring alerts for Bluegene/L dataset, with each

alert indicated by a unique colour. Their statistics can be summarized in the table below.

Instruction cache parity error analysis: Tag bit * * * 19.21% 85119

L1 * cache parity error has occurred in TAG bit * * * 13.24% 58671

Receiver errors: Node * * * had * correctable errors in the * direction 21.87% 96900

Single symbol error(s): DDR Controller * failing SDRAM address *

BPC pin * transfer * bit * BPC module pin * compute trace * DRAM

chip * DRAM pin *

9.29% 41155

Cache parity error analyis: Tag bit * * * 29.52% 130802

* symbol error count. Controller * chipselect 0 6.88% 30494Table 6.5: Most frequently occurring alerts for Bluegene/L dataset

Table 6.5 shows the most frequently occurring alerts for Bluegene/L with statistics. Each

of these labels can be searched in the “View Message Types” window with the messages.

44

6.2.6 LA HPC-1

Figure 6.14: Runtime analysis of Partition By Token for LA HPC-1

Figure 6.14 shows the variation of runtime of Partition by token operation. Peak is seen at

event size 1 and it is negilible for messages with event size > 20. This can be understood by

seeing the message distribution for HPC-1. There are very few messages with event size>50.

Compared to other datasets, this has the fewest entries and hence clustering completes much

quicker as I/O takes lesser time. Again, X-axis denotes event sizes and Y-axis denotes

milliseconds.

45

Figure 6.15: Distribution of messages – LA HPC-1

Figure 6.15 shows the distribution of messages for LA HPC-1. Most of messages lie in 1-

5 range. More can be inferred from table 5.6. Messages here come under only 3 labels – [1-5],

[5-10] and [10-20]. There are very few messages with event sizes > 20.

Event size % Size

[1-5] 83.2 360658

[5-10] 13.98 66610

[10-20] 2.72 11811

[20-50] 0 5

>50 0 10Table 4.6: Distribution of messages – LA HPC-1

Table 6.6 shows that overwhelming majority of the messages are in 1-5 range. This has

major influence on the runtime.

46

Figure 6.16: Runtime analysis of Partition By Bijection for LA-HPC-1

Figure 6.16 shows the runtime performance of Partition by bijection with respect to event

size. As expected, the peaks occur at 1-5 and for other values, time in negligible, in fact almost

near zero. This is obvious because majority of the messages are found within that range. Again,

X-axis denotes the event sizes and Y-axis denotes the runtimes in milliseconds. This operation in

total takes around 8 seconds for the dataset.

47

Figure 6.17: Most frequently occurring alerts for LA HPC-1

Figure 6.17 shows the most frequently occurring alerts for LA HPC-1. This is

summarized in table.

Linkerror interval expired 56.26% 84894

Link errors remain current 9.04% 13645

Temperature * exceeds warning threshold 0.62% 941

Link error on broadcast tree * 0.77% 1167

Warning 32.96% 49741

Psu failure\ 0.34% 517Table 6.7: Most frequently occurring alerts for HPC-1

Table 6.7 shows the most frequently occurring alerts for HPC-1. All the alerts also have

event sizes < 5. These alerts can be checked in another window along with the messages

belonging to each label.

48

Figure 6.18 Window to view HPC clusters

Figure 6.18 shows a window with cluster labels and their corresponding elements in the

text area below when a user clicks on it. This allows for more efficient searching. The text area

can show a maximum of 50 messages at a time. This is because the text area control could crash

if thousands of messages are shown at the same time. If more messages are available, they can be

seen by clicking on a Next button provided.

49

6.2.7 Partitions generated

Figure 6.19: Partitions generated after 1st step

Figure 6.19 shows the partitions(files) generated after Partition by event size operation.

For each event size, there exists a file. The above image is for HPC dataset. Since there are a lot

of files, significant time is spent in reading and writing. Files are named using the notation

“partition<eventsize>”. These files are subsequently read for the next operation.

50

Figure 6.20: Partitions generated during 2nd step

Figure 6.20 shows the partitions generated after Partition by token position operation.

These are larger in number and are spawned from the partitions in 1st step. The naming notation

for the files here is “partition<event size>_<token position 1>_<token position 2>”, where the

two token positions are derived from the algorithm. This notation makes it easier for

programming as well as general understanding.

51

Figure 6.21: Partitions generated during 3rd step

Figure 6.21 generates partitions after Partition by bijection operations, this contains

outliers i.e. non-bijective partitions as well. The naming notation for the files here is

“partition<event size>_<token position 1>_<token position 2>_<bijection 1>_<bijection

2>_<unique id>”. Bijection 1 and Bijection 2 are indications of a bijective mapping between the

two mentioned positions. An additional string is added for outliers to indicate them. These files

are used for message type extraction.

52

7. CONCLUSIONS & FUTURE WORK

7.1 Conclusions

• Due to the size and complexity of sources of information used by system administrators

in fault management, it has become imperative to find ways to manage these sources of

information automatically.

• Application logs are one such source. This work is based on a novel algorithm for

message type extraction from log files, IPLoM. So far, there is no standard approach to

tackling this problem in the literature.

• Message types are semantic groupings of system log messages. They are important to

system administrators, as they aid their understanding of the contents of log files.

Administrators become familiar with message types over time and through experience.

• This work provides a way of finding these message types automatically. In conjunction

with the other fields in an event (host names, severity), message types can be used for

more detailed analysis of log files.

• Through a 3-step hierarchical partitioning process, IPLoM partitions log data into its

respective clusters. In its fourth and final stage, IPLoM produces message type

descriptions or line formats for each of the clusters produced.

• IPLoM is able to find message clusters whether or not its instances are frequent. We

demonstrate that IPLoM produces cluster descriptions, which match human judgement

more closely when compared to SLCT, Loghound, and Teiresias. However, our results

show that a specialized algorithm such as IPLoM can significantly improve the

abstraction level of the unstructured message types extracted from the data.

• Message types are fundamental units in any application log file. Determining what

message types can be produced by an application accurately and efficiently is therefore a

fundamental step in the automatic analysis of log files.

• Message types, once determined, not only provide groupings for categorizing and

summarizing log data, which simplifies further processing steps like visualization or

mathematical modeling, but also a way of labeling the individual terms (distinct word and

position pairs) in the data.

53

7.2 Future work

Future work on IPLoM will involve using the information derived from the results of

IPLoM in other automatic log analysis tasks which help with fault management. Work can be

done to optimize the time and space complexity of the algorithms. Further more, the large log

files can be compressed using the message types as abstractions.

54

REFERENCES

[1]. Adetokunbo Makanju, Member, IEEE, A. Nur Zincir-Heywood, Member, IEEE, and

Evangelos E. Milios, Senior Member, IEEE, “A Lightweight Algorithm for Message Type

Extraction in System Application Logs”, IEEE Transaction on Knowledge and Data Engineering,

Vol. 24, No. 11,pp. 1921-1936, November 2012.

[2]. J. Stearley, “Towards Informatic Analysis of Syslogs,” Proc. IEEE Int’l Conf. Cluster

Computing, pp. 309-318, 2004.

[3]. W. Xu, L. Huang, A. Fox, D. Patterson, and M.I. Jordan, “Detecting Large-Scale System

Problems by Mining Console Logs,” SOSP ’09: Proc. ACM SIGOPS 22nd Symp. Operating

Systems Principles, pp. 117-132, 2009.

[4]. R. Vaarandi, “A Data Clustering Algorithm for Mining Patterns from Event Logs,” Proc.

IEEE Workshop IP Operations and Management, pp. 119-126, 2003.

[5]. S. Goil, H. Nagesh, and A. Choudhary, “MAFIA: Efficient and Scalable Subspace Clustering

for Very Large Data Sets,” technical report, Northwestern Univ., 1999.

[6]. J. Stearley, “Sisyphus Log Data Mining Toolkit,” http://www.cs.sandia.gov/sisyphus, Jan.

2009.

55

APPENDIX

Code Snippets:

Partition by token position method:

public void PartitionByTokenPosition()

{

TokenPosition[] tp = new TokenPosition[1000];

StreamReader[] sr = new StreamReader[maxEventSize + 1];

p = new Pair[maxEventSize + 1];

Stopwatch[] stw = new Stopwatch[100];

try

{

int tokenSplitPosition = -1, numberTokens = -1;

for (int i = 1; i <= maxEventSize; i++)

{

if (events[i])

{

sr[i] = new StreamReader(new

FileStream(destination+@"\PartitionsByEvent\partition" + i + ".log", FileMode.Open,

FileAccess.Read));

stw[i] = new Stopwatch();

}

}

for (int i = 1; i <100; i++)

tp[i] = new TokenPosition();


{

int currentMax;

if (events[i])

{

stw[i].Start();

56

String s;

currentMax = i;

while ((s = sr[i].ReadLine()) != null)

{

if (i == 1)

{

tp[1].add(s);

}

else

{

String[] splitString = s.Split(' ');

for (int j = 0; j < splitString.Length; j++)

tp[j + 1].add(splitString[j]);

}

}

tokenSplitPosition = 1;

numberTokens = tp[1].getNumberOfTokens();

for (int k = 2; k <=currentMax; k++)

{

if (numberTokens > tp[k].getNumberOfTokens())

{

numberTokens = tp[k].getNumberOfTokens();

tokenSplitPosition = k;

}

}

p[i] = new Pair();

p[i].tokenpos = tokenSplitPosition;

p[i].tokens = tp[tokenSplitPosition].getTokens();

for (int j = 1; j <= maxEventSize; j++)

tp[j].clear();

57

stw[i].Stop();

// runtimes1[i] = stw[i].ElapsedMilliseconds;

}

}

StreamWriter[,] swr=new StreamWriter[100,5000];

for(int i=1;i<=maxEventSize;i++)

{

if(events[i])

{

if (i == 1)

{

String[] tok = p[i].tokens;

stw[i].Start();

for (int j = 0; j < tok.Length; j++)

{

swr[1, j] = new StreamWriter(new FileStream(destination +

@"\PartitionByToken\partition_" + i + "_" + 1 + "_" + (j + 1) + ".log", FileMode.Append,

FileAccess.Write));

swr[1, j].AutoFlush = true;

}

stw[i].Stop();

}

else

{


int pos = p[i].tokenpos;

stw[i].Start();

for (int j = 0; j < tok.Length && tok[j] != ""; j++)

{

swr[i, j] = new StreamWriter(new FileStream(destination +

58

@"\PartitionByToken\partition_" + i + "_" + pos + "_" + (j + 1) + ".log", FileMode.Append,

FileAccess.Write));

swr[i, j].AutoFlush = true;

}

stw[i].Stop();

}

}

}


{

if (events[i])

{

if (i == 1)

{


String s;

stw[i].Start();

StreamReader sre = new StreamReader(new

FileStream(destination+@"\PartitionsByEvent\partition1.log", FileMode.Open,

FileAccess.Read));

while ((s = sre.ReadLine()) != null)

{

for (int j = 0; j < tok.Length; j++)

if (tok[j].Equals(s))

{

swr[i, j].WriteLine(s);

break;

}

}

stw[i].Stop();

59

runtimes1[i] = stw[i].ElapsedMilliseconds;

}

else

{

int tokensplitpos = p[i].tokenpos;

String[] tokens = p[i].tokens;

stw[i].Start();

using (StreamReader sr1 = new StreamReader(new

FileStream(destination+@"\PartitionsByEvent\partition" + i + ".log", FileMode.Open,

FileAccess.Read)))

{

String s;

while ((s = sr1.ReadLine()) != null)

{

String[] splitString = s.Split(' ');

String a = splitString[tokensplitpos - 1];

for (int j = 0; j < tokens.Length; j++)

if (tokens[j].Equals(a))

{

swr[i, j].WriteLine(s);

break;

}

}

}

stw[i].Stop();


}

}

}


60

{

if (events[i])

{

String[] t = p[i].tokens;

for (int j = 0; j < t.Length; j++)

swr[i, j].Close();

}

}

}

catch (Exception e)

{

MessageBox.Show(e.ToString());

}

// MessageBox.Show(Convert.ToString(d));

}

Code for Partition by Bijection:

public void PartitionByBijection()

{

Stopwatch[] stw = new Stopwatch[100];

runtimes2 = new long[100];

for (int i = 0; i < 100; i++)

runtimes2[i] = 0;

try

{

StreamReader[,] str = new StreamReader[100, 5000];

TokenPosition[] tp = new TokenPosition[1000];

b = new Bijection[100, 5000];

bp=new BijectionPair[100,5000];

NonBijection[,] nb = new NonBijection[100, 5000];

61

int[] bijectioncount = new int[100];


{

if (events[i])

{

stw[i] = new Stopwatch();

stw[i].Start();

int tokenpos = p[i].tokenpos;



{

str[i, j] = new StreamReader(new

FileStream(destination+@"\PartitionByToken\partition_" + i + "_" + tokenpos + "_" + (j + 1) +

".log", FileMode.Open, FileAccess.Read));

b[i, j] = new Bijection();

bp[i,j]=new BijectionPair();

}

stw[i].Stop();

}

}

for(int i=1;i<100;i++)

tp[i]=new TokenPosition();


{

if (events[i])

{

stw[i].Start();

int tokenpos = p[i].tokenpos;


62


{

String s;

while ((s = str[i, j].ReadLine()) != null)

{

String[] split = s.Split(' ');

for (int k = 0; k < split.Length; k++)

tp[k + 1].add(split[k]);

}

// Dictionary<int, int> positions = new Dictionary<int, int>();

int mostFreqCount = getMostFrequentCount(tp,i);

bool flag = false;

if (mostFreqCount == -1)

{

int pos1=0;

b[i, j].setBijection(false);

nb[i, j] = new NonBijection();

for (int k = 1; k <=i; k++)

{

if (pos1 == 2)

{

flag = true;

break;

}

if (tp[k].getNumberOfTokens() > 1)

{

if (pos1 == 0)

b[i, j].setFirstTokenPosition(k);

else

b[i, j].setSecondTokenPosition(k);

63

pos1++;

}

}

}

if (flag == false)

{

int[] pos = getTokenPositions(mostFreqCount, tp, i);

// if (i == 5) MessageBox.Show(Convert.ToString(mostFreqCount));

b[i, j].setBijection(true);

if (pos != null)

{

// MessageBox.Show(Convert.ToString(pos[0]) + " " +

Convert.ToString(pos[1]));

b[i, j].setFirstTokenPosition(pos[0]);

b[i, j].setSecondTokenPosition(pos[1]);

b[i, j].setMostFrequentCount(mostFreqCount);

}

}

for (int l = 1; l <= i; l++)

tp[l].clear();

stw[i].Stop();

}

}

}


{

if (events[i])

{


64

stw[i].Start();



{

if (b[i, j].isBijection())

{

int pos1 = b[i, j].getFirstTokenPosition();

int pos2 = b[i, j].getSecondTokenPosition();

if (pos1 != 0 && pos1 != pos2)

{

String s;

str[i, j].BaseStream.Position = 0;


{


String first = split[pos1 - 1];

String second = split[pos2 - 1];

bp[i, j].add(first, second);

}

}

}

else

{



String s;



{


65


String second = split[pos2 - 1];

nb[i, j].add(first, second);

}

}

stw[i].Stop();

}

}

}

Dictionary<String, StreamWriter> dtw = new Dictionary<String, StreamWriter>();

StreamWriter[] swtr = new StreamWriter[10000];

dict = new Dictionary<String,int>();

int e = 0;


{

if (events[i])

{



stw[i].Start();


{


{



Dictionary<String, String> dt = bp[i, j].getBijections();

Dictionary<String, bool> dtb = bp[i, j].getOutliers();

foreach (KeyValuePair<String, bool> kvp in dtb)

{

66

if (kvp.Value == true)

{

String s1 = kvp.Key;

//StreamWriter sw;

q = s1;

String s2;

dt.TryGetValue(s1, out s2);

if (s2 != null)

{

swtr[e] = new StreamWriter(new FileStream(destination +

@"\PartitionByBijection\partition_" + i + "_" + pos + "_" + pos1 + "_" + pos2 + "_" + e + ".log",

FileMode.Append, FileAccess.Write));

swtr[e].AutoFlush = true;

if (!dict.ContainsKey(s1 + s2 + pos1 + pos2 + pos + i))

dict.Add(s1 + s2 + pos1 + pos2 + pos + i, e);

e++;

}

}

}

stw[i].Stop();

}

}

}

}

int f = 0;

for(int i=3;i<=maxEventSize;i++)

{

67

if(events[i])

{

String[] t=p[i].tokens;

stw[i].Start();

int pos=p[i].tokenpos;

for(int j=0;j<t.Length;j++)

{


{



if (pos1 > 0 && pos2 > pos1)

{


Dictionary<String, String> dt = bp[i, j].getBijections();

Dictionary<String, bool> dtb = bp[i, j].getOutliers();

String s;


{


bool w = false;

String s1 = split[pos1 - 1];

String s2 = split[pos2 - 1];

q = split[pos1 - 1] + "_" + split[pos2 - 1] + "_" + i + "_" + pos + "_" +

pos1 + "_" + pos2;

if (dtb.ContainsKey(split[pos1 - 1]))

dtb.TryGetValue(split[pos1 - 1], out w);

if (w)

{

int c;

68

dict.TryGetValue(s1 + s2 + pos1 + pos2 + pos + i, out c);

swtr[c].WriteLine(s);

}

}

}

}

else

{

String s;





{



Dictionary<String, List<String>> dt = nb[i, j].getMappings();

List<String> l;

String second=split[pos2-1];

dt.TryGetValue(first, out l);

foreach (String k in l)

{

if (second.Equals(k))

{

using (StreamWriter sw = new StreamWriter(new

FileStream(destination + @"\PartitionByBijection\partition_" + i + "_" + j + "_" + pos + "_" +

pos1 + "_" + pos2 + "_outlier_" + first.Substring(0,1) + "_"+k.Substring(0,1)+".log",

FileMode.Append, FileAccess.Write)))

sw.WriteLine(s);

69

break;

}

}

}

}

stw[i].Stop();

}

}

}


if (events[i])


for (int i = 0; i < e; i++)

swtr[i].Close();

}

catch (Exception e)

{

MessageBox.Show(e.ToString());

MessageBox.Show(q);

}

}

70