engine fault diagnosis. a novelty detection problem

Universita degli Studi di Firenze

Dipartimento di Matematica e Informatica

Engine fault diagnosisbased on novelty detection

Author:

Leonardo Torres Hansa

Director:

Fabio Schoen, Ph.D.

Florence, February 2014

Contents

1 Introduction 2

2 Fault prevention in industrial machinery 3

3 Novelty detection 5

3.1 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3.2 Novelty detection techniques . . . . . . . . . . . . . . . . . . . . . 9

4 ARMA modelling for time series 10

4.1 Stationary stochastic processes . . . . . . . . . . . . . . . . . . . . 11

4.2 ARMA models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4.3 Automatic modelling . . . . . . . . . . . . . . . . . . . . . . . . . 15

4.3.1 Identification of the model . . . . . . . . . . . . . . . . . . 15

4.3.2 Estimation of the parameters . . . . . . . . . . . . . . . . 16

5 Support Vector Machines 18

6 Resolution proposed 24

6.1 Preparation of the data . . . . . . . . . . . . . . . . . . . . . . . . 25

6.2 Implementation in R . . . . . . . . . . . . . . . . . . . . . . . . . 26

7 Results 38

7.1 Cross-validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

7.2 Unsupervised problem . . . . . . . . . . . . . . . . . . . . . . . . 40

8 Conclusions 42

R codes 44

References 50

1 Introduction

1 Introduction

In industrial machinery, it becomes indispensable the ability of predicting, given

a short period of time, whether the engines will collapse in this period of time,

so that the engineers could react and fix any issue that has been detected.

For decades, engineers have worried about automatic failure detection on en-

gines. Concretely, it has been the analysis of their vibrations the one which has

starred in the framework [27],[34],[33]. However, a proper procedure may not be

clear yet [20].

For the purpose of real-time analysis of the engines, a new device has been

developed. It works as follows. It is connected to the tank containing the oil and

it extracts data from it. Specifically, we have worked with 15 channels which pro-

vide us with the quantity of different chemical elements dissolved in the oil: iron,

chromium, nickel, molybdenum, aluminium, lead, copper, magnesium, phospho-

rus, zinc, boron, silicon and sodium.

Every minute the device sends online all this data, all these quantities. The

goal of this project is the analysis of the information provided by the device and

decide whether the engine which is under surveillance is in good conditions or

not. In this second case, the engineers should proceed with changing the oil or

the engine, as they consider.

In some previous works on engine fault diagnosis we can see that the use of

Support Vector Machines are widely spread in the framework. We find examples

of this in [1] or [20]. However, the environment in which these works have been

accomplished are related to a kind of data, to some degree, different from ours

—e.g., vibration, sound, voltage, temperature—. Moreover, let us bear in mind

that, from a more general point of view, this is a novelty detection problem. In

this framework, SVM are also a choice which is often taken to face the problems of

classification, as it can be read in [6], [12], [35], [3], [23], [29] or [15]. The intuitive

idea why OC-SVM are good approach to be applied to anomalous trajectory

detection is based on the nature of the procedure. The sequences associated with

each channel are initially transformed in fixed-dimension feature vectors. Then

the training data are clustered using the OC-SVM. This is how a hypervolume

– 2 –

2 Fault prevention in industrial machinery

in the feature space containing the normal trajectories is detected. Identifying

anomalous states of an engine is just a matter of checking if a new piece of the

sequences falls outside the computed hypervolume.

According to this, though our problem is demarcated within the motor fault

diagnosis, we have opted for a more general novelty detection approach. Con-

cretely, following some of the steps in [7], the input for the SVM has been the

parameters associated to an ARMA model.

The rest of the report will be structured as follows. We will see some examples

of engine diagnostics in section 2. Since we have decided to demarcate the problem

within novelty detection framework, we will present some of its approaches in

section 3. Then, we will explain the particular mathematical techniques we have

required. In section 4 we will explain the theory under ARMA modelling, since

it is the first tool we have implemented. Section 5 is focused on the theory by

which the One-Class SVMs are built, the main technique used for our forecasting.

Every detail about the procedure used during the project is explained in section

6. Its results are presented in section 7. We end the report with some conclusions

and an appendix with the R codes that have been employed and the consulted

bibliography .


In [2] the authors introduce a engine diagnosis method based on Dempster-Shafer

evidence theory. They consider engine diagnostics as a multi-sensor fusion prob-

lem, for which each channel is a piece of evidence, so that the main goal is learning

how to use all these pieces as a whole in order to explain the conditions of the

engine. So, the authors aim to acquire reliable information about potential faults

by incorporating complementary sensors. This consists, in general, on extrac-

ting features from multiple sensors and deciding which scheme should be used to

represent them. However, several complementary sensors can bring conflicting

data. Therefore, the quality of the decisions can decrease. The challenge is how

to detect conflicts among the sensors and how to fuse their decisions into one

coherent decision. Evidence theory is used to try to solve this issue.

– 3 –


We see again the idea of multi-sensor data fusion in [1]. What is proposed in

this paper is a hybrid method using Support Vector Machines and Short Term

Fourier Transform (STFT) techniques. In a first phase the signal obtained from

different channels is separated by STFT according to its frequency level and

amplitude. Next the system faults are modelled as changes in the sensor gain

with the magnitude given by a non-linear function of the measurable output and

input signals. An adaptive time-based observer is proposed in order to monitor

the system for unanticipated sensor failures. Then a fusion block comes into play

to combine the sensor data. Finally the information provided by the previous two

steps and the fused data is used to train the SVM classifier for fault-predictive

system modelling.

In [26] we find another example of data fusion strategy. The paper pro-

poses a decision fusion system for fault diagnosis, which integrates data sources

from different types of sensors and decisions of multiple classifiers. First, non-

commensurate sensor data sets are combined using relativity theory. The channels

they use are two types of vibrations and three of currents. This data must be

fused. Usually, the classes of a vector are diverse for different classifiers with a

same data set. Utilising the relativity theory, they remark that the outputs could

also be changed for different data sets which are classified by a same classifier.

The generated decision vectors are then selected based on correlation measure

of classifiers in order to find an optimal sequence of classifiers fusion, which can

lead to the best fusion performance. As a result, an optimal team of classifiers

containing classes of information from both the vibration and the current signals

is formed to improve classification accuracy. Finally, multi-agent classifiers fu-

sion algorithm is employed as the core of the whole fault diagnosis system. The

efficiency of the proposed system was demonstrated through fault diagnosis of

induction motors.

Again, vibration analysis stars in fault diagnostics in [20]. In this paper, the

authors focus on bearing faults, alleging that is the most frequent type of fault

on induction engines. The followed procedure begins with simulating faults with

Machinery Fault Simulator (MFS), a tool for simulating various types of induc-

tion motor faults initially fitted with a healthy motor and a motor with faulted

– 4 –

3 Novelty detection

bearings of same specification. After the data acquisition, done for both healthy

motor and a motor with faulted bearings under the same running conditions,

they proceeded with the signal processing, carried out by Continuous Wavelet

Transform (CWT). The methods applied for the diagnosis of faults are support

vector machines and artificial neural networks, whose classification results are

finally compared.

3 Novelty detection

The problem of the anomaly or novelty detection has been researched within

diverse research areas and application domains. Many anomaly detection tech-

niques have been specifically developed for certain application domains, while

others are more generic. Anomaly detection refers to the problem of finding

patterns in data that do not conform to expected behaviour [5].

The applications of novelty detection go over music segmentation [15], speech

recognition [22], intrusion detection in networks on the Internet [6], detecting

forbidden U-turn trajectories on urban roads [29], epileptic seizure prediction [7]

or electrocardiogram time series [12].

The importance of novelty detection resides on the fact that it allows to pre-

vent anomalies in a wide variety of systems, so that proper measures can be taken

soon enough.

We can define novelties as patterns in data that do not conform to a well

defined notion of normal behaviour. Let us illustrate this idea. In figure 1 a

simple two dimensional example has been plotted. The data has two normal

regions, N1 and N2 , since most observations lie on these two regions. Points

which are sufficiently far away from the regions, namely points o1 and o2 , and

points in region O3, are anomalies.

The presence of anomalies or novelties might be induced in the data for a vari-

ety of reasons, such as malicious activity, e.g., credit card fraud, cyber-intrusion,

terrorist activity or breakdown of a system, but all of the reasons have a common

characteristic: they are interesting for the analysis.

In abstract terms, the main novelty detection approach consists on defining

– 5 –

3 Novelty detection

Figure 1: A simple example of anomalies in a 2-dimensional data set

a region representing normal behaviour and declare any observation in the data

which does not belong to this normal region as an anomaly. Which challenges

are there within such a naıve approach?

First of all, a normal region which encompasses every possible normal be-

haviour has to be define, task which may be not easy. In addition, the bound-

ary between expected and anomalous behaviour is often not precise. Thus an

anomalous observation which lies close to the boundary can actually be normal,

and vice versa. Let us remark that, in the case that an anomaly is the result

of malicious actions, the malicious adversaries may have adapted themselves to

make the anomalous observations appear like normal. Let us also bear in mind

that expected behaviour may keep on evolving and a current notion of what is

expected behaviour might not be sufficiently representative in the future. Let

us, as well, take into account that labelled data for training and validation of

mathematical models are not always available, which radically complicates the

evaluation of the performance of the analysis. We can consider the cases when

the data contains noise which tends to be similar to the actual anomalies and

hence is difficult to distinguish and remove.

Because of the difficulty of the target, most of the existing anomaly detection

– 6 –

3 Novelty detection

techniques solve a specific formulation of the problem. The formulation is induced

by various factors such as nature of the data, availability of labelled data, type

of anomalies to be detected. . . Researchers have adopted concepts from diverse

disciplines such as statistics, machine learning, data mining, information theory,

spectral theory, and have applied them to specific problem formulations.

Some of the aspects that have to be consider when dealing with novelty de-

tection are the nature of the data, the data labels —if any—, the format of the

output that is desired and the type of anomalies that is being looked for. For the

latter, we can give some definitions:

Point anomalies. If an individual data instance can be considered as anomalous

with respect to the rest of data, then the instance is termed as a point

anomaly.

Contextual anomalies. If a data instance is anomalous in a specific context,

but not otherwise, then it is termed as a contextual or conditional anomaly.

E.g., a temperature of 0C might be expected in winter in Paris but, in

summer, it should be considered as an anomaly.

Collective anomalies. If a collection of related data instances is anomalous

with respect to the entire data set, it is termed as a collective anomaly. The

individual data instances in a collective anomaly may not be anomalies by

themselves, but their occurrence together as a collection is anomalous.

3.1 Applications

A way of dividing the fields in which anomaly detection can be utilized is consid-

ering the following properties of its particular framework:

• The notion of anomaly or novelty

• Nature of the data

• Challenges of the anomaly detection

• Existing techniques for solving the problem

– 7 –

3 Novelty detection

Let us see some examples which will help to illustrate this.

In intrusion detection, the goal is the detection of malicious activity in a

computer related system. These malicious activities or intrusions are interesting

from a computer security perspective.The key challenge for anomaly detection in

this domain is the huge volume of data. The data typically comes in streaming,

thereby requiring online analysis. Another issue which arises because of the large

sized input is the false alarm rate.

Fraud detection refers to detection of criminal activities occurring in commer-

cial organizations such as banks, credit card companies, insurance agencies, cell

phone companies, stock market. . . The malicious users might be the actual cus-

tomers of the organization or might be posing as a customer. The fraud occurs

when these users consume the resources provided by the organization in an unau-

thorized way. The organizations are interested in immediate detection of such

frauds to prevent economic losses. The term activity monitoring as a general

approach to fraud detection in these domains. The typical approach of anomaly

detection techniques is to maintain a usage profile for each customer and monitor

the profiles to detect any deviations. Some examples of fraud detection problems

are credit card fraud, mobile phone fraud, insurance claim fraud or insider trading

fraud —using non-public inside information in stock markets—.

Anomaly detection in the medical and public health domains typically work

with patient records. The data can have anomalies due to several reasons such as

abnormal patient condition or instrumentation errors or recording errors. Several

techniques have also focussed on detecting disease outbreaks in a specific area.

Thus the novelty detection is a very critical problem in this domain and requires

high degree of accuracy. The data typically consists of records which may have

several different types of features such as patient age, blood group, weight. The

data might also have temporal as well as spatial aspect to it. Most of the current

novelty detection techniques in this domain aim at detecting anomalous records

—point anomalies. Typically the labelled data belongs to the healthy patients,

hence most of the techniques adopt semi-supervised approach. Another form of

data handled by anomaly detection techniques in this domain is time series data,

such as Electrocardiograms (ECG) and Electroencephalograms (EEG) [7]. The

– 8 –

3 Novelty detection

most challenging aspect of the anomaly detection problem in this domain is the

cost of classifying a novelty, usually very high.

Industrial units suffer damage due to continuous usage and the normal wear

and tear. Such damages need to be detected early to prevent further escalation

and losses. The data in this domain is usually referred to as sensor data because

it is recorded using different sensors and collected for analysis. Novelty detection

techniques have been extensively applied in this domain to detect such damages.

Industrial damage detection can be further classified into two domains, one which

deals with defects in mechanical components such as motors, engines. . . and the

other which deals with defects in physical structures. The former domain is also

referred to as system health management. This is the field we are more interested

in for this project.

3.2 Novelty detection techniques

We can consider novelty detection as a branch of the classification problems —it

is not the only approach, but is the most related to this project. Classification

is used to learn a model, or classifier, from a set of data instances, the training

phase. This data instances can be labelled —supervised classification— or unla-

belled —unsupervised. Then, a test instance is classified into one of the classes

using the learnt model, the testing phase. Classification based novelty detection

techniques operate in the same two-phase method. The classification problems

can be grouped in two main categories: multi-class classification or one-class

classification. For this project, we are interested in the latter.

One-class classification based novelty detection techniques assume that all

training instances have only one class label, namely, they belong to a one single

class or they do not. Such techniques learn a discriminative boundary around the

normal instances using a one-class classification algorithm, e.g., one-class SVMs.

Any test instance that does not fall within the learnt boundary is declared as

anomalous, novelty or outlier.

As examples of frequent classification based novelty detection techniques we

can mention neural networks, bayesian networks or support vector machines.

– 9 –

4 ARMA modelling for time series


Definition 1. We will define a time series as a sequence of N observations,

taken from one or several variables, which are chronologically ordered and equidis-

tant on time.

When we are dealing with univariate time series, i.e., time series build from

a single variable, we will denote them as sequences with this format:

xtNt=1 ,

where N will be the length or size of the time series and, for a given instant

t (1 ≤ t ≤ N), xt will be the observation measured in t. All N observations can

be written as a column vector x = (x1, . . . , xN)T .

In the other hand, we can also work with multivariate time series, which can

be represented as

xtNt=1 .

The vector xt = (xt1 , xt2 , . . . , xtM )T will be the observation taken in the instant

t, t ∈ 1, . . . , N , andN will be the length of the sequence. All theN observations

may be represented with a matrix N ×M :

X =

xT1

xT2...

xTN

=

x11 x12 · · · x1M

x21 x22 · · · x2M...

.... . .

...

xN1 xN2 · · · xNM

(4.1)

where xtj is the observation of the variable j on the instant t, ∀t = 1, . . . , N ∀j =

1, . . . ,M.

The main task we are interested on is building a mathematical model which

will help explaining these observations and will identify a pattern in order to

forecast.

The starting point when it comes to elaborate a model for a time series consists

on considerating the sequence as a particular finite realization of a stochastic

process.

– 10 –


Definition 2. We will define a stochastic process as a sequence of random

variables, which are chronologically ordered and equidistant on time. A stochastic

process may be univariate or multivariate.

Formally, it is an application

X : Ω× T 7−→ S

(ω, t) 7−→ X(ω, t)

S is called space of states and it will be S ⊂ Z+ or S ⊂ R. T represents discret

time (T = 0, 1, 2, . . .) or continuous time (T = [0,∞))

Those stochastic processes which are univariate will be represented with Xt ,t = 0,±1,±2, . . ., where Xt is a random variable referred to the measurable unit

observed by the stochastic process in the instant t.

Deal we with multivariate processes, we will denote them with Xt , t =

0,±1,±2, . . ., where Xt = (Xt1 , Xt2 , . . . , XtM )T (M ≥ 2, j ∈ 1, 2, . . . ,M) is a

random variable referred to an observation of the system on t.

From now on, we are going to focus on univariate stochastic processes, since

they are the ones we have worked with.

A stochastic process will not be completely described unless we know the

distribution functions

F (Xt1 , Xt2 , . . . , XtN ) ∀t1, t2, . . . , tN ∈ Z, N ∈ N.

Such a goal cannot be considered possible unless we assume some relaxations.

4.1 Stationary stochastic processes

Definition 3. We will say that a stochastic process Xt is strictly stationary

when for every n ≥ 1 instants t1 < t2 < . . . < tn from its history, the joint

probability distribution for (Xt1 , . . . , Xtn)T and the one for (Xt1+h, . . . , Xtn+h)T

∀h = ±1,±2, . . . are the same.

A stochastic process Xt such as E [Xt] < ∞ ∀t = 0,±1,±2, . . . will be

defined as first-order weak-sense stationary when E [Xt] is constant ∀t =

0,±1,±2, . . .

– 11 –


A stochastic process Xt such as E [X2t ] < ∞ ∀t = 0,±1,±2, . . . will be

defined as second-order weak-sense stationary when:

• E [Xt] and Var [Xt] are constants ∀t = 0,±1,±2, . . .

• Cov [Xt, Xt+k] depends as much on k ∈ Z but it is independent ∀t =

0,±1,±2, . . .

We will say that a stochastic process Xt is Gaussian when, for every n ≥ 1

instants t1 < t2 < . . . < tn from its history, the joint probability distribution for

(Xt1 , Xt2 , . . . , Xtn)T is a Normal n-variate distribution.

Unless we say the opposite, the second-order weak-sense stationarity will suf-

fice for our work.

Recall that the mean of the process will be denoted as µX = E[Xt] and the

variance, σ2X = Var[Xt] = E[(Xt − µX)2].

Definition 4. Given a stationary process Xt, its k-order autocovariance

(k > 0) will be

γk = Cov [Xt, Xt+k] = E [(Xt − µX)(Xt+k − µX)] .

Remark 1. Let us notice that the k-order autocovariance γk does not depend on t.

Definition 5. Given a stationary process Xt, its k-order simple autocor-

relation (k > 0) will be define as

ρk =Cov[Xt, Xt+k]√Var[Xt]

√Var[Xt]

=γkγ0

(4.2)

If we consider the sequence ρk : k = 1, 2, . . . as a function of k, we will call

it simple autocorrelation function (ACF).

Definition 6. The k-order parcial autocorrelation (k > 0) of a stationary

process Xt will be represented as φkk and will be define by the regression

Xt = φk1Xt−1 + φk2Xt−2 + . . .+ φkkXt−k + Ut (4.3)

where Xt−i = Xt−i − µX (i = 0, 1, . . . , k) and Ut is independent on Xt−i ∀i ≥ 1.

– 12 –


4.2 ARMA models

Definition 7. A stationary stochastic process Xt will allow an autorregresive

moving average model of order (p, q) (ARMA(p, q)) when

Xt = µ+φ1Xt−1+φ2Xt−2+ . . .+φpXt−p+at−θ2at−2−θ2at−2− . . .−θqat−q (4.4)

for every t = 0,±1,±2, . . ., where at ∼ IID(0, σ2a) and µ, φ1, φ2, . . . , φp,θ1, θ2, . . . , θq

are such as every root for

1− φ1x− φ2x2 − . . .− φpxp = 0 (4.5)

are outside the unity circle. This will be called stationarity condition.

Definition 8. An ARMA(p, q) model given (4.4) is invertible when every root

in

1− θ1x− θ2x2 − . . .− θpxp = 0 (4.6)

are outside the unity circle (invertibility condition).

We will also define a lag operator —denotated by B or L— as

BXt = BXt−1, BdXt = Xt−d (d ≥ 2 integer) (4.7)

where Xt may be a random or real variable, referred to an instant t.

We can then rewrite the expression (4.4) as

φ(B)Xt = µ+ θ(B)at, (4.8)

where

φ(B) = 1− φ1B − φ2B2 − . . .− φpBp (4.9)

is the autorregresive polynomial of the model and

θ(B) = 1− θ1B − θ2B2 − . . .− θqBq (4.10)

is the moving average polynomial of the model.

– 13 –


Stationarity and invertibility When a stationary process Xt allows an

ARMA(p, q) model written as in (4.8), its non-conditional expected value µX can

be obtained as follows:

E[φ(B)Xt] = µ+

0︷︸︸︷E[θ(B)at]

(1− φ1 − φ2 − . . .− φp) E[Xt] = µ

E[Xt] =µ

(1− φ1 − φ2 − . . .− φp)

µX =µ

φ(1)

where φ(1) is the value of the autorregresive operator when B = 1. Therefore,

(4.8) can be rewritten as

φ(B)(Xt − µX) = θ(B)at, and also φ(B)Xt = θ(B)at, (4.11)

where Xt = Xt − E[Xt] = Xt − µX for every t = 0,±1,±2, . . .

Theorem (Wolf’s Theorem). The stationarity condition (4.5) guarantees that

the coefficients ψ0, ψ1, ψ2 . . . of the infinite-order polynomial

ψ(B) =θ(B)

φ(B)= 1 + ψ1B + ψ2B

2 + . . . =∞∑i=0

ψiBi (ψ0 = 1) (4.12)

satisfy that∑∞

i=0|ψi| <∞, which is a sufficient condition so that Xt = ψ(B)at is

a stationary process.

Theorem. The invertibility condition (4.6) guarantees that the coefficients π0,

π1, π2 . . . of the infinite-order polynomial

π(B) =θ(B)

π(B)= 1− π1B − π2B2 − . . . = −

∞∑i=0

ψiBi (ψ0 = −1) (4.13)

satisfy that∑∞

i=0|πi| <∞, so that when we write (4.11) as π(B)Xt = at, Xt is

a stationary process such as

φkk −→ 0 when k →∞

– 14 –


4.3 Automatic modelling

A common obstacle when dealing with ARIMA models is that the order selection

process may be considered subjective and difficult to apply. But it does not have

to be. There have been several attempts to automate ARIMA modelling in the

last 25 years.

In [16] a method to identify the order of an ARMA model for stationary series

is proposed. The idea is that the innovations can be obtained by fitting a long

autoregressive model to the data, and then the likelihood of potential models

is computed via a series of standard regressions. The asymptotic properties of

the procedure under very general conditions were established. Years later, an

extension of this automatic identification procedure method was implemented in

the software TRAMO and SEATS [14]. For a given series, the algorithm attempts

to find the model with the minimum BIC.

4.3.1 Identification of the model

A non-seasonal ARIMA(p,d,q) process is given by

φ(B)Xt = µ+ θ(B)at, (4.14)

where at is a white noise process with mean zero and variance σ2, B is the lag

operator, and φ(z) and θ(z) are polynomials of order p and q respectively. To

ensure causality and invertibility, it is assumed that φ(z) and θ(z) have no roots

for |z| < 1, as seen in the previous section.

The main task in automatic ARIMA forecasting is selecting an appropriate

model order, namely p, q, and d. If d is known, the orders p, q can be selected

by an information criterion such as the AIC:

AIC = −2 log(L) + 2(p+ q + k) (4.15)

where k = 1 if µ = 0 and 0 otherwise, L is the maximized likelihood of the model

fitted to the differenced data

(1−B)dxt.

The likelihood of the full model for xt is not actually defined and so the value of the

AIC for different levels of differencing d are not comparable. One solution to this

– 15 –


difficulty is presented in [11] and it is implemented in the arima() —which is part

from the main function we have used in this prject, auto.arima()— function in

R [30]. In this approach, the initial values of the time series —before the observed

values— are assumed to have mean zero and a large variance. However, choosing

d minimizing the AIC and using this approach tends to lead to over-differencing.

An alternate approach for choosing d are the unit-root tests. Most unit-

root tests are based on a null hypothesis that a unit root exists, which biases

results towards more differences rather than fewer differences [17]. For example,

variations on the Dickey-Fuller test [10] assume there is a unit root at lag 1.

Instead, in [17] unit-root tests based on a null hypothesis of no unit-root are

preferred. For non-seasonal data, what are considered are ARIMA(p,d,q) models

where d is selected based on successive KPSS unit-root tests [21]. We give a short

explanation. The data is tested for a unit root. If the test result is significant, the

differenced data is tested for a unit root. And so on. This procedure is stopped

when a first insignificant result is obtained.

Once d is selected, the next step is the selection of the values of p and q by

minimizing the AIC, as mentioned above.

4.3.2 Estimation of the parameters

We proceed now with the estimation of the parameters µ, φi, i = 1, . . . , p and θi,

i = 1, . . . , q. Let us denote the sample autocorrelation function as

rk =

∑n−kt=1 (xt − x) (xt+k − x)∑n

t=1 (xt − x)2, (4.16)

which provides us with an estimate for the true autocorrelation ρk.

For estimating the mean factor —or drift— µ we can take

x =1

n

n∑t=1

xt

Let us also calculate the variance of this estimator when the process is sta-

tionary:

– 16 –


Var(x) =γ0n

n−1∑k=−n+1

(1− |k|

n

)ρk (4.17)

=γ0n

[1 + 2

k=1∑n−1

(1− k

n

)]ρk (4.18)

The variance is inversely proportional to the sample size n.

For the estimation of the rest of the parameters, we will focus on the least

squares method, which is the one implemented in the R tools that we have used

[13].

Least squares estimation The AR(1) model

xt − µ = φ(xt−1 − µ) + at

can be viewed as a regression with independent variable xt−1 and dependent

variable xt. Thus, we can estimate φ minimizing the sum of squares

S∗(φ, µ) =n∑t=2

[(xt − µ)− φ(xt−1 − µ)]2. (4.19)

We will call this function S∗ conditional sum of squares function. Mini-

mizing it we get

µ =

∑nt=2 xt − φ

∑nt=2 xt−1

(n− 1)(φ− 1), (4.20)

which for large values of the sample size n can be approximated as

µ ≈ x− φx1− φ

= x. (4.21)

We finally substitute x for µ in (4.20) and obtain, again, for large values of n,

φ =

∑nt=2(xt − x)(xt−1 − x)∑n

t=2(xt−1 − x)2(4.22)

In AR(p) again we have that

µ = x

– 17 –

5 Support Vector Machines

The estimators of φ2,. . .,φp are given by the Yule-Walker equations. For ex-

amples, for an AR(2) model we should solve

r1 = φ1 + r1φ2 (4.23)

r2 = r1φ1 + φ2 (4.24)

in order to obtain φ1 and φ2.

Estimating the parameters of the MA(q) model give more difficulties and force

us to resort to numerical techniques to estimate θ1, θ2,. . .,θq.

Let us try to estimate the parameter θ for a MA(1) model

xt = at − θ1at−1

Recall that if this model in invertible, then it can be rewritten as an AR(∞)

model

xt = at − θ1xt−1 − θ21xt2 − . . .

We have reached again a regression model. However, in this case we are

dealing with a infinite order regression and non-linear in θ1. We will not be able

to minimize the sum of squares function

S∗(θ1) =∑

at

in an analytically way, so we will have to use numerical methods, as we have

already mention.

In general, for calculating the parameters of MA(q) and ARMA(p,q) models,

we will have to use numerical techniques in order to solve the sum of squares

equations.


Given a set of observations which belong to a certain distribution, we may be

interested on finding a region which best fits to this set, so that if a new piece of

data is given and it is in this region, we will be able to say —with a probability—

– 18 –


that it follows the same distribution as the others, and if it is not, we will be able

to say that it does not.

Support Vector Machines (SVM) consist on several mathematical techniques

whose final purpose is the classification or regression of data. They are based

on statistical learning theory and risk minimization and were initially proposed

by Vapnik et al. [35] for linear problems. Kernel methods [3] were later used to

improve the training of the SVMs so that they could be extended to nonlinear

problems.

We have been especially interested on One-Class SVMs. For this kind of

SVM, the data used for training is considered to belong to a single probability

distribution. The target is finding its support so that outliers and anomalies will

be discarded, which make OC-SVM a proper tool for novelty detection.

The basis of SVM arises from the statistical learning theory applied to linear

classification. Let us consider two-class hyperplane classifiers in some dot product

(·) space H

(w · x) + b = 0, w,x ∈ H, b ∈ R (5.1)

with a decision function associated, f : Rn 7→ ±1 (where n is the dimension in

which the data are expressed),

f(x) = sgn ((w · x) + b) .

This situation is represented in figure 2. We define margin as the distance

along the direction of w between the two hyperplanes parallel to the classification

plane and passing though the points in the two class sets nearest to the classifi-

cation plane (2). The objective is finding the optimal classifier, in the sense that

it will have a maximum margin.

In order to express this as an optimization problem, let us denote the class

of the observations which belong to the original distribution with y = 1 and the

ones which does not, with y = −1. Recall that we have been provided with a

training data set (xi, yi) ∈ Rn × ±1 i = 1, . . . ,m, where m is the number of

training data. Now, we can write the optimization problem as

– 19 –


Figure 2: Separation of two classes by means of classification hyperplanes. The

optimal classification function is found by maximizing the distance between the

two dashed hyperplanes in the direction of w, which is called margin.

minw

1

2‖w‖2

subject to yi ((w · xi) + b) ≥ 1, i = 1, . . . ,m

where yi ∈ −1,+1 is the label for each training element. Lagrangian multipliers

αi ≥ 0 can help solving the problem leading to the dual optimization one

maxα

W (α) =m∑i=1

αi −1

2

m∑i,j=1

αiαjyiyj(xi · xj)

subject to αi ≥ 0, i = 1, . . . ,m (5.2)m∑i=1

αiyi = 0

which can be solved by the KKT method. Now the decision function is rede-

fined as

f(x) = sgn

(m∑i=1

yiαi(x · xi) + b

)(5.3)

where αi are solutions of (5.2) and b can be calculated knowing that for any xi

lying on the margin borders, the following equation must be satisfied:

yi(w · xi + b)− 1 = 0.

– 20 –


These points lying on the margin borders are called support vectors. The

αi associated to them will be nonzero values. Therefore the final solution (5.3)

will be defined only in terms of a small subset of the training data.

The main drawbacks of this approach are:

• It only resolves linear problems.

• A dot product is required in the space of definition of the data.

In order to solve both inconvenients, a map Φ : X 7→ H from the nonempty set

of the original input data X to a dot product space H should be defined. We will

call H feature space and the problem will have a linear solution in it.

Let us notice that, in the dual problem (5.2) and in the decision function (5.3),

the only operations performed in H are dot products. If we have a formula for

dot products in H, there is no need to do any explicit computation in H when

training the model. This formula for the dot product is called kernel and it is

defined as

k(x, x′) = Φ(x) · Φ(x′), x, x′ ∈ X (5.4)

With this notation, the decision function in (5.3) can rewritten as

f(x) = sgn

(m∑i=1

yiαik(x, x′) + b

)(5.5)

where α is a solution of the optimization problem

maxα

W (α) =m∑i=1

αi −1

2

m∑i,j=1

αiαjyiyjk(xi, xj)

subject to αi ≥ 0, i = 1, . . . ,m (5.6)m∑i=1

αiyi = 0

Recall that we have worked with an unsupervised problem, i.e., unlabeled

data. Before applying this approach to our case, a proper transformation is

necessary. Let P be an unknown probability distribution for a set of unlabeled

measures. OC-SVMs can be used to estimate an appropriate region in the space

– 21 –


X which contains the majority of the data drawn from P , leaving outliers outside

the region, if possible.

In the SVM framework, this idea works as follows. We wish to maximize the

distance from the decision hyperplane in the feature space H to its origin, while

a small fraction of data —the outliers— will fall between the hyperplane and the

origin. In terms of a minimization problem we can write

minw,ξ,ρ

1

2‖w‖2 +

1

νn

∑i

ξi − b

subject to w · Φ(xi) ≥ b− ξi, ξi ≥ 0

(5.7)

where xi ∈ X , (i ∈ 1, . . . , n) are n training observations in the data space

X , Φ : X 7→ H is the function mapping vectors xi into the feature space Hand (w · Φ(x)) − b = 0 is the decision hyperplane in H. Outliers are linearly

penalized by the variables ξi, ponderated by the parameter ν ∈ (0, ]. Applying

again Lagrangian multipliers method, the problem (5.7) becomes:

minα

1

2

∑i,j

αiαjk(xi,xj)

subject to 0 ≤ αi ≤1

(νn)∑i

αi = 1.

(5.8)

Quadratic programming techniques can lead to solve the problem, thus ob-

taining the values αi. Following the notation from (5.7), w is given by

w =∑i

αiΦ(x) (5.9)

and ∀ Φ(xi) such as αi 6= 0, the following equations are satisfied:

b− ξi = (w · Φ(x)) =∑j

αjk(xi,xj) (5.10)

with ξi > 0 for outliers and ξi = 0 for support vectors lying on the decision plane.

The decision plane. We can finally define the decision function in the data space

X as

– 22 –


f(x) = sgn((w · Φ(xi))− b)

= sgn

(∑i

αik(x,xi)− b).

(5.11)

Recall that ∀ xi lying within the support region, we have αi = 0. Thus a

big number of training vectors do not contribute to the definition of the decision

function (5.11). For the rest of the vectors —namely, support vectors— , two

options may occur:

• If the vector xi lies on the decision hyperplane, αi will be such as 0 < αi <

1/(νn).

• If the vector xi is an outlier, αi = 1/(νn).

Let us now analize, intuitively, the meaning of the parameter ν, which is going

to play a main roll when tuning the number of outliers we are willing to accept.

From the objective function in (5.7), it can be seen that

ν → 0 =⇒ 1

νn→∞,

that is, the outliers’ penalization factors grow to infinity: no outliers would be

allowed.

On the other hand, the case where ν = 1 allows a single solution. Because

of the restrictions in (5.8), what would happen would be that αi = 1/n ∀i, its

maximum value, so that all vectors would be identified as outliers.

In general, we can say that ν is:

• an upper bound on the fraction of margin errors;

• a lower bound of the fraction of support vectors relative to the total number

of training examples.

A proof for these properties is shown in [31].

– 23 –

6 Resolution proposed


As we have said on section 1, we have decided to implement an OC-SVM in order

to decide whether a given engine is in proper conditions or is not. The input that

we have used is not the direct information extracted from the oil of the engine,

but the parameters associated to an ARMA model adjusted after the observation

of several states of the engine along time. These states provide us with several

time series, which have been modelled.

Concretely, we have worked with fifteen different channels, mentioned in sec-

tion 1. Some of them have been plotted in figure 3.

Figure 3: Plots of the evolution of the quantities of some of the elements we

have dealt with.

These 2305 node sequences are synthetic data extrapolated from real analyses.

The online analysis tool is still not available, so this has been the data set we

have worked with.

Though we have dealt with a unsupervised problem, we have followed the

– 24 –


idea developed in [7]. We have divided all the data into different time windows.

All the time windows have the same length, l. We have overlapped all of them,

so that when a new piece of data arrives from the device, we use it together

with the l− 1 nodes before in order to decide whether the engine is still working

correctly. For each channel and each time window, an ARMA model has been

adjusted. Its parameters will be the input for the SVM: one time window will be

one observation in the data set used to train and validate the SVM.

Let us remark that l-node time windows imply that, in order to decide whether

the engine is in good or bad conditions at a given moment t, we will use infor-

mation from the last l nodes, namely, instants. If we recover information every

minute, it means only information from the last l minutes is utilized. Since this

is a real-time problem, we have had to take short enough windows so that the

classification could be done in a short period of time, but long enough in order to

take advantage of as much past information as possible. Thus, we have worked

with l = 200 and l = 150, since they gave the best forecasting results, which have

been checked cross-validating the data as explained below.

6.1 Preparation of the data

We have worked with a data matrix whose rows represent the state of the engine

in a given time window.

The input data for the SVM consists on the parameters of ARMA models

for these channels for each time window (2305 − l in total). Though the order

of the model may reach the third level or higher, we have had to decide how

many parameters will be used as input. Our first consideration was working

with the parameters until the second order, which are usually more significant

in ARMA modelling [17]. So we would have had, for each channel, a parameter

for the component AR(1), another one for the component AR(2), and so on for

MA(1), MA(2) and the intercept; i.e., five columns for each channel (5× 15 = 75

independent variables). Since it is a linear model we are dealing with, in the case

that one of these parameters has not been included, we will take it as 0 into the

model. However, as we explain below, we have finally decided to work only with

the parameters φ1 and θ1, namely, the first AR and MA components. Therefore,

– 25 –


we have worked with 30 variables, an important reduction on the dimension of

the problem but, as we will see, has allowed to make a proper classification.

Because of the information provided by the engineers, we know that the first

700 observations —or the first 700 − l time windows— belong to correct states

of the engine used to take the data, so these ones have been used for training the

model. Since we know that there are no novelties or anomalies in this set, the

setting up of one of the parameters for the SVM is simple, as we will explain in

section 6.2.

6.2 Implementation in R

We have implemented the whole project in R. Particularly, the ARMA model-

ling has been developed with its function auto.arima, provided by the package

‘forecast’. The SVM-based classification, with its function svm, given with the

package ‘e1071’. We give now a few details about the use of this functions.

ARMA in R

The main input that is needed for the auto.arima function is a vector define

as a time series in R. Concretely, we have a time series per channel and per

time window. All these sequences have been modelled with auto.arima, which

estimates the parameters with the Least Squares Method, minimizing the residual

sum of squares, as we saw in section 4.3.2 [13]. We have specified the non-seasonal

behaviour of the time series. For this project, no seasonal behaviour has been

taken into account. Let us bear in mind that the data is extracted every minute.

Therefore, it does seem reasonable considering no seasonality in the model.

Though no extra input has been necessary for the function, it is important

the final result, namely, the object created by the R function auto.arima. Not

only gives us the parameters of the ARMA model, which will be the input for

the SVM, but also it allows us calculating the residuals.

The mean of the residuals —for each channel and each window— is around 5%.

So it seems that the adjust that has been made is acceptable. We have tested other

models built after longer time windows —up to 200 nodes per time window— and

– 26 –


the mean of the residuals is usually around 5%, as well. So we conclude that taken

time windows longer than the ones from our first considerations does not improve

the fitted values.

The goal of the SVM has been the detection of anomalies in the parameters

of these ARMA models. The idea extracted from [7] is that novelties in the state

of the engines will be properly identified by means of the parameters φ1, φ2, θ1,

θ2,. . . Figures 4–10 show how the evolution on time of these parameters looks

like.

In first place, the time windows between, approximately, the 800th and the

1000th stand out among the rest, since the variance of the associated parameters

seem to the decrease in this period of time. This can also be observed, in a

lower measure, in the period next to the 300th time window. The fact that these

changes occur for several channels suggest that ARMA modelling, as a tool for

representing these time series, is a consistent method. It is also observable that,

for the last time windows, given a parameter φ or θ, it seems to keep the sign,

positive or negative, namely, the modelling seems to be very stable in this period.

(a) Iron (b) Chromium

Figure 4: Evolution of the parameters φ1 and θ1

– 27 –


(a) Nickel (b) Molybdenum


(a) Aluminium (b) Lead


– 28 –


(a) Copper (b) Silver


(a) Calcium (b) Magnesium


– 29 –


(a) Phosphorus (b) Zinc


(a) Boron (b) Silicon


– 30 –


Figure 11: Relation between φ1 and θ1. Four examples

We can also take a brief sight to the relation between some of these indepen-

dent variables, by whom the SVM decides whether the conditions of the engine

are good or bad. In figures 11–13 we can see, given a channel, relations between

some of the parameters from the ARMA models.

Let us remark the number of cases for which the parameters φi or θj are zero,

for some i and some j, event that takes place when no AR or MA components are

computed. It is, as well, worthy of mention the relation shown in figures 11 and

13, where the compared parameters seem to follow a linear trend. Visually, it is

noticeable that some points do not follow this regression, which may be related

with the appearance of anomalies.

– 31 –



– 32 –



– 33 –


We finally present some plots which can help understanding the distribution

of the independent variables, φ1, θ1,. . . Recall that an ARMA(p,q) model will

have p + q + 1 parameters. If p = 0, the ARMA model will have no AR part,

and it can be written as a MA(q) model. However, in our datamart this case is

represented as if there were AR components φi = 0, (i = 1, . . . , p) —we will have

an analogous case when q = 0—. Therefore, for plotting the histograms (figures

14 and 15) we have worked eliminating the cases where φi = 0 or θj = 0, for any

i ∈ 1, . . . , p and any j ∈ 1, . . . , q.From the distribution of each parameter, we can emphasize the general idea

that they do not have heavy tails nor high width of peak, which should be related

to a high kurtosis. The skewness does not seem high either, but visually we

cannot say that we are dealing with symmetric distributions. Anyway, we should

be aware of the great number of changes in the values of the parameters, though

they are not so big, according to the soft tails.

– 34 –


(a) Lead (b) Copper

(c) Silver (d) Calcium

Figure 14: Histograms for φ1 variable

– 35 –


(a) Lead (b) Copper

(c) Silver (d) Calcium

Figure 15: Histograms for θ1 variable

– 36 –


SVM in R

In general terms, we have worked with the R function svm introducing the training

data set and the type of SVM we wanted to proceed with, namely, single-class

classification.

For this type of SVM, the most used kernel function —defined generally in

(5.4)— is the radial-basis one as the authors in [29] explain, mainly because of

the role of the Euclidean distance, which gives an intuitive interpretation to the

function. Recall that a radial-basis kernel function is defined as

k(x, y) = exp(−γ ‖x− y‖2

). (6.1)

Since this is the kernel function we have used, we explain how to choose those

parameters, ν and γ. The parameter γ is defined as

γ =1

2σ2

where σ represents the scale factor at which the data should be clustered [29].

Though a standard way of estimating γ consists on setting it up as the inverse

of the number of dimensions [25], cross-validating the data is a common method

for perform the estimation [6] [9] [32] [19]. The inconvenient we had to face was

that the unsupervised nature of the problem would not allow us to carry out the

cross-validation. To solve the issue, we decided to use those first 700 observations

—that were known as good— for doing the cross-validation. Thus, we were able

to tune γ and finally apply the implemented model.

This cross-validation has been done as follows. From the 700 observations

classified as correct, we have randomly chosen a 70%. With this set, we have

trained a OC-SVM model and have tested it over the other 30% of the observa-

tions. We have proceed in this way for several γ candidate values: 0,01, 0,02, 0,03,

0,04, 0,05,. . ., 0,1. However, we have considered that the number of independent

variables was too high in relation with the number of observations used for the

cross-validation. Because of this, we decided to keep just the more significant vari-

ables, as mentioned above, φ1 and θ1, for each channel. Actually, including more

independent variables to the data set leaded us to low-quality results. For this

reason, we have finally decided to work only with those 30 independent variables.

– 37 –

7 Results

Under this new environment, the optimal estimation for γ has been 0,03.

This cross-validation has also been used to study the optimal length of the

time windows. We have been able to check that short time windows, like 50 or 100

nodes long —in time, 50 or 100 minutes—, though they are quite useful in real-

time terms, did not give good enough percentage of correct SVM-classification

—recall that the error for the ARMA models was not high—. Therefore, this

lengths have been dismissed. Instead 150-node and 200-node time windows have

been considered.

The other parameter associated with the OC-SVM model is ν. We have to

remember that 0 ≤ ν ≤ 1, as well as its interpretation, explained in section 5.

It is both an upper bound on the fraction of margin errors and a lower bound of

the fraction of support vectors relative to the total number of training examples.

Recall that in the training data set —the first 700 nodes— there is no anomaly.

This allow us to take ν as low as we wish. However, recall that it cannot be 0,

as seen in 5. There is also one consideration worth enough of being taken into

account. It is true that, for training, we have no outliers or novelties, but this

will not happen in the future —not even in the rest of the data, used to check

the results forecast by the model—. Taken a too low value for ν may lead us to

overadjustment. To avoid this, we have chosen to set ν = 0,1, so that the type

of error in which a motor in bad conditions is classified as a good case, will be

penalized.

7 Results

7.1 Cross-validation

Regarding the cross-validation carried out for setting the γ parameter, we can

give some comments about the results. In table 1 we present the percentages of

correctly classified in training and validation in the procedure in the 200-node-

per-window case (ν = 0,1).

We have considered that those high percentages in the validation phase may be

– 38 –

7 Results

γ 0,01 0,02 0,03 0,04 0,05 0,06 0,1

T (%) 89,428 89,428 88,86 89,143 82,86 78,57 61,42

V (%) 94,0 92,67 90,67 89,3 80,0 73,33 45,33

Table 1: Training and validation percentages for the cross-validating phase.

200-node time windows. ν = 0,1

due to overadjustment and that is why we have preferred not to take γ < 0,03.

With more comparisons, we see that γ > 0,04 does not lead to good results.

Table 2 shows the 150-node-per-window, 100-node-per-window and 50-node-per-

window cases, respectively (ν = 0,1).

γ 0,01 0,02 0,03 0,04 0,05 0,06 0,1

150T (%) 90,389 90,129 89,35 86,49 84,15 76,36 64,93

V (%) 93,3 92,12 87,87 77,57 65,45 58,18 32,12

100T (%) 90,23 89,28 89,047 89,5 87,38 82,14 72,62

V (%) 74,4 68,3 58,89 43,3 34,4 28,33 8,89

50T (%) 90,088 90,528 89,20 89,867 85,9 83,25 71,36

V (%) 60,204 56,12 48,98 34,69 24,489 15,816 1,02

Table 2: Training and validation percentages for the cross-validating phase.

150-node, 100-node and 50-node time windows. ν = 0,1

We can see that in some cases, the accuracy is especially low, up to 1,02%.

This means that the model classifies a great number of instances as bad cases. Let

us remark that it happens not only when the length of the time windows decrease

—which can be related with the employment of a lower quantity of information—

but also with the higher γ is. According to what we have explained in section 5,

when γ increases its value, the final decision of the model, given by the function

f in (5.11), will be always the same since the kernel function (6.1) will tend to 0.

– 39 –

7 Results

7.2 Unsupervised problem

We can finally proceed to present some of the results achieved with the final

model, the one which should be put into practice.

As we have already mentioned on section 6.1, the training data for this model

has been the first 700 − l time windows, where l is its length in nodes, 150 or

200. The rest of them have been classified with the model obtained by SVM

techniques. Again, the parameters were set up as ν = 0,1 and γ = 0,03.

Because of the unsupervised nature of the problem, there is no way for eva-

luating whether the forecasting obtained with the SVM is right or wrong. What

we have done is comparing predictions from different SVMs —table 3—, whose

training data sets have been build from 150-node time windows and 200-node

time windows.

γ 0,01 0,02 0,03 0,04 0,05 0,06 0,1

150T (%) 89,818 90,545 90,364 89,273 86,182 82,0 64,545

V (%) 76,698 73,084 67,414 58,442 47,289 38,131 16,635

200T (%) 90,2 90,0 89,6 87,8 85,8 80,0 63,0

V (%) 78,816 76,137 71,153 61,184 50,093 39,065 14,143

Table 3: Percentages of good-classified observations for several γ values. 150-

node and 200-node time windows cases. ν = 0,1

We have been able to check that our first considered model, the one associated

with 150-node time windows, is the one for which there are more observations

classified as bad cases —if γ = 0,03. Concretely, 67,41433% of the instances have

been classified as good. This leads us to think that in this way, we will minimize

type I error, i.e., minimize the number of false positives classifications. Namely,

if the model classifies a time window as good, there will more certainty that it is

actually good than if the model had been built with longer time windows. This

approach seems to be more reliable than the opposite —the one which would

allow a higher number of false positive cases—, since the latter could lead to

situations in which a motor is classified as correct, when it is not, so that the

engineers would not be prepared for the eventual failure.

– 40 –

7 Results

Again, as explained in section 7.1, the higher the γ is, the lower the studied

percentage is.

On the other hand, the percentage of time windows classified as good ones

for the 200-node-per-window case (γ = 0,03) has been 71,1526%. So, as we have

already mentioned, if we prefer to be prudent, in order to prevent classifying bad

engines as good ones, choosing 150-node time windows will solve the issue.

As a curiosity, we present in table 4 the analogous percentages to the previous

ones, now associated to the problems with 50-node and 100-node time windows.

γ 0,01 0,02 0,03 0,04 0,05 0,06 0,1

50T (%) 90,0 88,93 88,15 89,076 86,307 84,615 73,846

V (%) 67,103 61,931 56,137 47,227 37,321 29,034 10,218

100T (%) 89,67 89,67 89,167 88,3 88,0 86,167 73,17

V (%) 64,424 61,059 53,769 44,611 35,264 26,67 8,972

Table 4: Percentages of good-classified observations for several γ values. 50-node

and 100-node time windows cases. ν = 0,1

It is true that these cases could be more plausible, since less past information

is needed, but as we have commented in section 7.1, the quality of the forecasting

decreases dangerously.

Last but not least, another way of minimizing the type I error is redefining the

parameters of the SVM based model. Recall that ν represents a lower bound for

the number of anomalies in the training set. We know that we have no anomaly

in the set, but we can take a higher ν, say 0,2, in order to force the model to

classify more time windows as bad ones.

When working with 150-node time windows, the percentage of time win-

dows classified as correct is 55,3271%. For the 200-node case, the percentage

is 59,81308%. As we expected, there are more instances classified as incorrect, so

this can also be a proper approach if the priority is minimizing type I error.

– 41 –

8 Conclusions

γ 0,01 0,02 0,03 0,04 0,05 0,06 0,1

50T (%) 80,0 80,39 80,46 80,92 80,77 81,08 74,62

V (%) 58,505 55,887 51,028 43,239 36,14 28,41 10,27

100T (%) 79,83 79,83 79,67 79,67 79,33 80,33 72,167

V (%) 55,264 51,028 45,856 40,124 33,208 26,04 8,97

150T (%) 80,0 79,82 80,0 78,55 77,818 78,727 68,727

V (%) 59,127 58,131 55,327 51,339 45,109 38,006 16,63

200T (%) 80,0 79,4 79,8 80,6 80,0 77,8 65,4

V (%) 66,85 64,299 59,813 55,26 48,598 39,065 14,14

Table 5: Percentages of good-classified observations for several γ values. 50-

node, 100-node, 150-node and 200-node time windows cases are shown. ν = 0,2

8 Conclusions

In this project, a new approach for engine faults detection has been presented.

From a kind of data not very common in the environment we have proposed a

mathematical model able to forecast the decay of the machinery, by developing a

method for dealing with the data, based on several novelty detection techniques,

some of them applied, not only in engine fault diagnosis but in more general

situations.

Let us remember that ARMA modelling has starred the preparation of the

data and has leaded us to good approximations of it. This methodology has pro-

vided us with the proper values to use as an input for the SVM based classification

model.

Because of the unsupervised environment, namely, the absence of knowledge

about whether the observations used as examples were cases of correct or wrong

functioning, we have not been able to assess the problem in terms of sensitivity

and specificity. In order to solve the inconvenient, we have decided to prioritize

the reduction of the false positives, namely, the observations classified as correct

when they belong to bad functioning.

Is obtained new already classified data so that the problem will become su-

– 42 –

8 Conclusions

pervised, the reliability of the model will increase since we will have techniques

to check more precisely the behaviour of the classifications, say with ROC curves

or lift charts. Moreover, a larger data base —in number of observations— could

let us working with more variables, extracted from the parameters of the ARMA

models, which possibly can give new aspect not relevant in this small data base.

Until the arrival of labelled data, the introduced model can help us handling

the classification, we advantage of knowing how to tune its parameters depending

on the importance that we desire to give to type I error.

– 43 –

R codes

R codes

Function used for building the time windows

chunks <− f unc t i on ( data , overlapped , nodes , nc , tam) # chunks ( ) d i v i d e s the time hor i zon

# into d i f f e r e n t p i e c e s o f l ength=tam .

# over lapped=TRUE ind i c a t e s that the p i e c e s w i l l

# be over lapped

i f ( over lapped ) nw=nodes−tam # number o f windows

e l s e nw=f l o o r ( nodes /tam)

# each window w i l l conta in tam nodes from nc channe l s

window <− array (0 , dim=c (nw, nc , tam) )

f o r ( i in 1 :nw) f o r ( j in 1 : nc )

ind <− 0

i f ( over lapped ) f o r ( k in i : ( i+tam−1) )

ind <− ind+1

window [ i , j , ind ]=data [ [ j ] ] [ k ] # over lapped k−i+1

e l s e

f o r ( k in ( ( i −1)∗tam+1) : ( i ∗tam) ) ind <− ind+1

window [ i , j , ind ]=data [ [ j ] ] [ k ] # non−over lapped k−( i −1)∗tam

# channe l s loop

# windows loop

chunks <− l i s t ( a=nw, b=window)

– 44 –

R codes

Function used for ARMA modelling

ex t r a c t f e a t AR <− f unc t i on ( nc , nw, t ime s e r i e s , tam) a u x i l i a r=vecto r ( ” l i s t ” , nc )

RERCM=matrix (0 , nrow=nc , nco l=nw)

EAM=matrix (0 , nrow=nc , nco l=nw)

RECM=matrix (0 , nrow=nc , nco l=nw)

ERM=matrix (0 , nrow=nc , nco l=nw)

LogLik=matrix (0 , nrow=nc , nco l=nw)

AIC=matrix (0 , nrow=nc , nco l=nw)

f o r ( i in 1 : nc ) f o r (w in 1 :nw)

model=auto . arima ( t s ( t im e s e r i e s [w, i , ] , s t =1,end=tam) ,

D=0, s ea sona l=FALSE)

a u x i l i a r [ [ i ] ] [ [ w] ]= as . l i s t (model$ co e f )

RERCM[ i ,w]= sq r t (sum( ( model$ r e s i d u a l s /model$x ) ˆ2) /tam)

EAM[ i ,w]=sum( abs (model$ r e s i d u a l s ) ) /tam

RECM[ i ,w]= sq r t (sum(model$ r e s i d u a l s ˆ2) /tam)

ERM[ i ,w]=sum( abs (model$ r e s i d u a l s ) /model$x ) /tam

LogLik [ i ,w]=model$ l o g l i k

AIC [ i ,w]=model$ a i c

ex t r a c t f e a t AR <− l i s t ( f e a t u r e s=aux i l i a r ,

RERCM=RERCM,EAM=EAM,

RECM=RECM,ERM=ERM,

LogLik=LogLik ,AIC=AIC)

Function used for preparing the input for the SVM

bu i ld datamart <− f unc t i on ( nc=nc ,nw=nw, c o e f f s ) coefAR1 <− matrix (0 , nrow=nc , nco l=nw)

d i f f coe fAR1 <− matrix (0 , nrow=nc , nco l=nw−1)scoefAR1 <− matrix (0 , nrow=nc , nco l=nw)

f o r ( i in 1 : nc )

– 45 –

R codes

f o r ( j in 1 :nw) i f ( ! i s . nu l l ( c o e f f s [ [ i ] ] [ [ j ] ] $ ar1 ) )

coefAR1 [ i , j ] <− c o e f f s [ [ i ] ] [ [ j ] ] $ ar1

di f f coe fAR1 [ i , ] <− coefAR1 [ 1 , 2 : nw]−coefAR1 [ 1 , 1 : ( nw−1) ]scoefAR1 [ i , ] <− smoothed ( par=coefAR1 [ i , ] , j j=j j )

coefAR2 <− matrix (0 , nrow=nc , nco l=nw)

d i f f coe fAR2 <− matrix (0 , nrow=nc , nco l=nw−1)scoefAR2 <− matrix (0 , nrow=nc , nco l=nw)

f o r ( i in 1 : nc ) f o r ( j in 1 :nw)

i f ( ! i s . nu l l ( c o e f f s [ [ i ] ] [ [ j ] ] $ ar2 ) ) coefAR2 [ i , j ] <− c o e f f s [ [ i ] ] [ [ j ] ] $ ar2

di f f coe fAR2 [ i , ] <− coefAR2 [ 1 , 2 : nw]−coefAR2 [ 1 , 1 : ( nw−1) ]scoefAR2 [ i , ] <− smoothed ( par=coefAR2 [ i , ] , j j=j j )

coefMA1 <− matrix (0 , nrow=nc , nco l=nw)

di f fcoefMA1 <− matrix (0 , nrow=nc , nco l=nw−1)scoefMA1 <− matrix (0 , nrow=nc , nco l=nw)


i f ( ! i s . nu l l ( c o e f f s [ [ i ] ] [ [ j ] ] $ma1) ) coefMA1 [ i , j ] <− c o e f f s [ [ i ] ] [ [ j ] ] $ma1

dif fcoefMA1 [ i , ] <− coefMA1 [ 1 , 2 : nw]−coefMA1 [ 1 , 1 : ( nw−1) ]scoefMA1 [ i , ] <− smoothed ( par=coefMA1 [ i , ] , j j=j j )

coefMA2 <− matrix (0 , nrow=nc , nco l=nw)

di f fcoefMA2 <− matrix (0 , nrow=nc , nco l=nw−1)scoefMA2 <− matrix (0 , nrow=nc , nco l=nw)


i f ( ! i s . nu l l ( c o e f f s [ [ i ] ] [ [ j ] ] $ma2) ) coefMA2 [ i , j ] <− c o e f f s [ [ i ] ] [ [ j ] ] $ma2

– 46 –

R codes

dif fcoefMA2 [ i , ] <− coefMA2 [ 1 , 2 : nw]−coefMA2 [ 1 , 1 : ( nw−1) ]scoefMA2 [ i , ] <− smoothed ( par=coefMA2 [ i , ] , j j=j j )

c o e f i n t <− matrix (0 , nrow=nc , nco l=nw)

d i f f c o e f i n t <− matrix (0 , nrow=nc , nco l=nw−1)s c o e f i n t <− matrix (0 , nrow=nc , nco l=nw)


i f ( ! i s . nu l l ( c o e f f s [ [ i ] ] [ [ j ] ] $ i n t e r c e p t ) ) c o e f i n t [ i , j ] <− c o e f f s [ [ i ] ] [ [ j ] ] $ i n t e r c e p t

e l s e i f ( ! i s . nu l l ( c o e f f s [ [ i ] ] [ [ j ] ] $ d r i f t ) )

c o e f i n t [ i , j ] <− c o e f f s [ [ i ] ] [ [ j ] ] $ d r i f t

d i f f c o e f i n t [ i , ] <− c o e f i n t [ 1 , 2 : nw]− c o e f i n t [ 1 , 1 : ( nw−1) ]s c o e f i n t [ i , ] <− smoothed ( par=c o e f i n t [ i , ] , j j=j j )

datamart <− matrix (0 , nrow=nw, nco l=nc∗5+2)

datamart [ , 1 ] <− 1 : dim( datamart ) [ 1 ] # obse rvat i on number

datamart [ , dim( datamart ) [ 2 ] ] <− +1 # ta rg e t

f o r ( cana l in 1 : nc ) datamart [ , ( 5 ∗ ( canal −1)+2) ] <− coefAR1 [ canal , ]

datamart [ , ( 5 ∗ ( canal −1)+3) ] <− coefAR2 [ canal , ]

datamart [ , ( 5 ∗ ( canal −1)+4) ] <− coefMA1 [ canal , ]

datamart [ , ( 5 ∗ ( canal −1)+5) ] <− coefMA2 [ canal , ]

datamart [ , ( 5 ∗ ( canal −1)+6) ] <− c o e f i n t [ canal , ]

dimnames ( datamart ) <− l i s t (NULL, c ( ”obs” ,

”AR1c1” , ”AR2c1” , ”MA1c1” , ”MA2c1” , ” i n t c1 ” ,






– 47 –

R codes




”AR1c10” , ”AR2c10” , ”MA1c10” , ”MA2c10” , ” in t c10 ” ,






” c o r r e c t ” ) )

datamart <− datamart [ , 2 : ( 5 ∗nc+2) ]

datamart <− as . data . frame ( datamart )

bu i ld datamart <− datamart

Function used for implementating the cross-validation

bu i ld svm n <− f unc t i on ( datamart , g , n , cut ) l i b r a r y ( e1071 )

datase t <− datamart [ 1 : cut , ]

s e l e c t i o n <− sample ( 1 : f l o o r ( 0 . 7 ∗ cut ) )t r a i n <− datase t [ s e l e c t i o n , ]

t e s t <− datase t [− s e l e c t i o n , ]

model <− svm(x=tra in , y=NULL, type=”one−c l a s s i f i c a t i o n ” ,

k e rne l=” r a d i a l ” , nu=n ,gamma=g )

s a l i d a <− l i s t (modelo=model , t r a i n=tra in , t e s t=t e s t )

bu i ld svm <− s a l i d a

Implementation of the SVM

bu i ld svm <− f unc t i on ( datamart , g , n , cut ) l i b r a r y ( e1071 )

t r a i n <− datamart [ 1 : cut , ]

– 48 –

R codes

t e s t <− datamart [ ( cut+1) : dim( datamart ) [ 1 ] , ]

model <− svm( t ra in , y=NULL, type=”one−c l a s s i f i c a t i o n ” ,

k e rne l=” r a d i a l ” , nu=n ,gamma=g )

s a l i d a <− l i s t (modelo=model , t r a i n=tra in , t e s t=t e s t )

bu i ld svm <− s a l i d a

Script

muestra=read . t ab l e ( ruta sample , header=T)

l i b r a r y ( f o r e c a s t )

tam=200 # windows ’ l ength

nc=15 # number o f channe l s

# preparat ion o f the time s e r i e s

channel=vecto r ( ” l i s t ” , nc )

f o r ( i in 1 : 15 ) channel [ [ i ] ]= t s ( muestra [ , ( i +1) ] , s t a r t =5980 , f r e q=1)

nodes <− l ength ( channel [ [ 1 ] ] ) [ 1 ]

source ( ruta chunks )

ch <− chunks ( data=channel , over lapped=TRUE, nodes=nodes ,

nc=nc , tam=tam)

nw <− ch$a ; p i e c e s <− ch$b

source ( ruta ex t r a c t AR)

co e f channel <− ex t r a c t f e a t AR( nc=nc ,nw=nw, t ime s e r i e s=p iece s , tam=

tam)

source ( ruta datamart )

datamart <− bu i ld datamart ( nc=nc ,nw=nw, c o e f f s=coe f channel $ f e a t u r e s )

source ( ruta svm)

c o l s <− c ( seq (1 , 71 , 5 ) , seq (3 , 73 , 5 ) )

svm out <− bu i ld svm( datamart=datamart [ , c o l s ] , g=0.03 ,n=0.2 , cut=500)

– 49 –

References

References

[1] Banerjee, Tribeni Prasad; Das, Swagatam. “Multi-sensor data fusion using

support vector machine for motor fault detection.” Information Science 217

(2012) 96–107.

[2] Basir, Otman; Yuan, Xiaohong. “Engine fault diagnosis based on multi-sensor

information fusion using Dempster-Shafer evidence theory.” ScienceDirect,

Information Fusion 8 (2007) 379–386.

[3] Boser, B.; Guyon, I.; Vapnik, V. “A training algorithm for optimal margin

classifiers”. Proc. 5th Annu. ACM Workshop Comput. Learn. Theory, (1992)

144–152.

[4] Box, G.E.P.; Jenkins, G.M.; Reinsel, G.C. Time series analysis - Forecasting

and control. (3rd edition) Prentice Hall, 1994.

[5] Chandola, Varun; Banerjee, Arindam; Kumar, Vipin. Anomaly Detection: A

Survey . Technical report. Department of Computer Science and Engineering,

University of Minnesota. Minneapolis, Minnesota, 2007.

[6] Chen, Wun-Hwa; Hsu, Sheng-Hsun; Shen, Hwang-Pin. “Application of SVM

and ANN for intrusion detection.” ScienceDirect, Computers & Operations

Research 32 (2005) 2617–2634.

[7] Chisci, Luigi; Mavino, Antonio; Perferi, Guido; Sciandrone, Marco; Anile,

Carmelo; Colicchio, Gabriella; Fuggetta, Filomena. “Real-Time Epileptic

Seizure Prediction Using AR Models and Support Vector Machines.” IEEE

Transactions on Biomedical Engineerings 57, 5 (2010) 1124–1132.

[8] Cowpertwait, Paul S.P. Introductory Time Series with R. Springer, 2006.

[9] Davy, Manuel; Desobry, Frederic; Gretton, Arthur; Doncarl, Christian. “An

Online Support Vector Machine for Abnormal Events Detection”. Signal Pro-

cessing, 2005. p.2009–2025.

– 50 –

References

[10] Dickey, D.A.; Fuller, W.A. “Likelihood Ratio Statistics for Autoregressive

Time Series with a Unit Root.” Econometrica, (1981) 49, 1057–1071.

[11] Durbin, J.; Koopman, S.J. “Time Series Analysis by State Space Methods”.

Oxford University Press, Oxford (2001).

[12] Fuchs, Eric; Gruber, Thiemo; Pree, Herlmut; Sick, Bernard. “Temporal data

mining using shape space representations of time series.” Neurocomputing 74

(2010) 379–393.

[13] Fuente Fernandez, Santiago de la. “Series Temporales: Modelos ARIMA”.

http://www.fuenterrebollo.com/Economicas/SERIES-TEMPORALES/

modelo-arima.pdf

[14] Gomez, V; Maravall, A. “Programs TRAMO and SEATS, Instructions for

the Users”. Working paper 97001. Ministerio de Economıa y Hacienda, Di-

reccion General de Analisis y Programacion Presupuestaria.

[15] Grinblat, Guillermo L.; Uzal, Lucas C.; Granitto, Pablo M. “Abrupt change

detection with Once-Class Time-Adaptive Support Vector Machines.” Experto

Systems with Applications 40 (2013) 7242–7249.

[16] Hannan, E.J.; Rissanen, J. “Recursive Estimation of Mixed Autoregressive-

Moving Average Order.” Biometrika, 69 (1) (1982), 81–94.

[17] Hyndman, Rob J.; Khandakar, Yeasmin. “Automatic Time Series Forecast-

ing: The forecast Package for R”. Journal of Statiscal Software 27,3 (2008).

[18] Hyndman, Rob J. with contributions from Athanasopoulos, George;

Razbash, Slava; Schmidt, Drew; Zhou, Zhenyu; Khan, Yousaf; Bergmeir,

Christoph. (2013). “forecast: Forecasting functions for time series and lin-

ear models.” R package version 4.8. http://CRAN.R-project.org/package=

forecast

[19] Kohavi, R.. “A study of cross-validation and bootstrap for accuracy estima-

tion and model selection”. IJCAI (1995) 1137–1145.

– 51 –

http://www.fuenterrebollo.com/Economicas/SERIES-TEMPORALES/modelo-arima.pdf

http://www.fuenterrebollo.com/Economicas/SERIES-TEMPORALES/modelo-arima.pdf

http://CRAN.R-project.org/package=forecast

http://CRAN.R-project.org/package=forecast

References

[20] Konar, P.; Chattopadhyay, P. “Bearing fault detection of induction motor

using wavelet and Support Vector Machines (SVMs)” Applied Soft Computing

11 (2011) 4203–4211.

[21] Kwiatkowski, D.; Phillips, P.C.; Schmidt, P.; Shin, Y. “Testing the Null

Hypothesis of Stationarity Against the Alternative of a Unit Root.” Journal

of Econometrics, 54 (1992), 159–178.

[22] Liu, Song; Yamada, Makoto; Collier, Nigel; Sugiyama, Masashi. “Change-

point detection in time series data by relative density-ratio estimation”. Neu-

ral Networks 43 (2013) 72–83.

[23] Ma, Junshui; Perkins, Simon. “Time-series Novelty Detection using Once-

class Support Vector Machines” IEEE 3 (2003) 1741–1745.

[24] Mentz, Raul Pedro. “Estimacion en los modelos autorregresivos y de prome-

dios moviles.” Estadıstica espanola 116 (1988) 87–106.

[25] Meyer, David; Dimitriadou, Evgenia; Hornik, Kurt; Weingessel, Andreas;

Leisch, Friedrich (2012). e1071: Misc Functions of the Department of Statis-

tics (e1071), TU Wien. R package version 1.6-1. http://CRAN.R-project.

org/package=e1071

[26] Niu, Gang; Han, Tian; Yang, Bo-Suk; Tan, Andy Chit Chiow. “Multi-agent

decision fusion for motor fault diagnosis.” Mechanical Systems and Signal

Processing 21 (2007) 1285–1299.

[27] Nour, F.; Watson, J.F. “The monitoring and analysis of transient vibra-

tion signals as a means of detecting faults in the three phase Induction Mo-

tor” Proceedings of 28th University of Power Engineering Conference Vol. 1,

(September 1993). pp. 178–181.

[28] Pena, Daniel. Analisis de series temporales. Alianza, 2005.

[29] Piciarelli, Claudio; Micheloni, Chistian; Foresti, Gian Luca. “Trajectory-

Based Anomalous Event Detection”. IEEE Transactions on Circuits and Sys-

tems for Video Technology 18 (2008) 1544–1554.

– 52 –

http://CRAN.R-project.org/package=e1071

http://CRAN.R-project.org/package=e1071

References

[30] Ripley, B.D. (2002). “Time Series in R 1.5.0.” R News, 2(2), 2–7. http:

//CRAN.Rproject.org/doc/Rnews/.

[31] Scholkopf, Bernhard; Smola, Alex J.; Williamson, Robert C.; Bartlett, Peter

L. Neural Computation 12 (2000) 1207–1245.

[32] Smola, A.; Schoelkopf, B. “Learning with Kernels”. MIT press Cambridge,

MA, USA (2002).

[33] Tavner, P.J.; Gaydon, B.G.; Word, D.M. “Monitoring generators and large

motors”. IEE Proceedings 133,3 (1986) 181–189 (Part B).

[34] Tavner, P.J.; Penman, J. “Condition Monitoring of Electrical Machines”.

Research Studies Press Ltd. 1987.

[35] Vapnik, V.; Lerner, A. “Pattern recognition using generalized portrait

method”. Autom. Remote Control, 24 (1963) 774–780.

– 53 –

http://CRAN.Rproject.org/doc/Rnews/

http://CRAN.Rproject.org/doc/Rnews/

engine fault diagnosis. a novelty detection problem

Documents