engine fault diagnosis. a novelty detection problem
TRANSCRIPT
Universita degli Studi di Firenze
Dipartimento di Matematica e Informatica
Engine fault diagnosisbased on novelty detection
Author:
Leonardo Torres Hansa
Director:
Fabio Schoen, Ph.D.
Florence, February 2014
Contents
1 Introduction 2
2 Fault prevention in industrial machinery 3
3 Novelty detection 5
3.1 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2 Novelty detection techniques . . . . . . . . . . . . . . . . . . . . . 9
4 ARMA modelling for time series 10
4.1 Stationary stochastic processes . . . . . . . . . . . . . . . . . . . . 11
4.2 ARMA models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.3 Automatic modelling . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.3.1 Identification of the model . . . . . . . . . . . . . . . . . . 15
4.3.2 Estimation of the parameters . . . . . . . . . . . . . . . . 16
5 Support Vector Machines 18
6 Resolution proposed 24
6.1 Preparation of the data . . . . . . . . . . . . . . . . . . . . . . . . 25
6.2 Implementation in R . . . . . . . . . . . . . . . . . . . . . . . . . 26
7 Results 38
7.1 Cross-validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
7.2 Unsupervised problem . . . . . . . . . . . . . . . . . . . . . . . . 40
8 Conclusions 42
R codes 44
References 50
1 Introduction
1 Introduction
In industrial machinery, it becomes indispensable the ability of predicting, given
a short period of time, whether the engines will collapse in this period of time,
so that the engineers could react and fix any issue that has been detected.
For decades, engineers have worried about automatic failure detection on en-
gines. Concretely, it has been the analysis of their vibrations the one which has
starred in the framework [27],[34],[33]. However, a proper procedure may not be
clear yet [20].
For the purpose of real-time analysis of the engines, a new device has been
developed. It works as follows. It is connected to the tank containing the oil and
it extracts data from it. Specifically, we have worked with 15 channels which pro-
vide us with the quantity of different chemical elements dissolved in the oil: iron,
chromium, nickel, molybdenum, aluminium, lead, copper, magnesium, phospho-
rus, zinc, boron, silicon and sodium.
Every minute the device sends online all this data, all these quantities. The
goal of this project is the analysis of the information provided by the device and
decide whether the engine which is under surveillance is in good conditions or
not. In this second case, the engineers should proceed with changing the oil or
the engine, as they consider.
In some previous works on engine fault diagnosis we can see that the use of
Support Vector Machines are widely spread in the framework. We find examples
of this in [1] or [20]. However, the environment in which these works have been
accomplished are related to a kind of data, to some degree, different from ours
—e.g., vibration, sound, voltage, temperature—. Moreover, let us bear in mind
that, from a more general point of view, this is a novelty detection problem. In
this framework, SVM are also a choice which is often taken to face the problems of
classification, as it can be read in [6], [12], [35], [3], [23], [29] or [15]. The intuitive
idea why OC-SVM are good approach to be applied to anomalous trajectory
detection is based on the nature of the procedure. The sequences associated with
each channel are initially transformed in fixed-dimension feature vectors. Then
the training data are clustered using the OC-SVM. This is how a hypervolume
– 2 –
2 Fault prevention in industrial machinery
in the feature space containing the normal trajectories is detected. Identifying
anomalous states of an engine is just a matter of checking if a new piece of the
sequences falls outside the computed hypervolume.
According to this, though our problem is demarcated within the motor fault
diagnosis, we have opted for a more general novelty detection approach. Con-
cretely, following some of the steps in [7], the input for the SVM has been the
parameters associated to an ARMA model.
The rest of the report will be structured as follows. We will see some examples
of engine diagnostics in section 2. Since we have decided to demarcate the problem
within novelty detection framework, we will present some of its approaches in
section 3. Then, we will explain the particular mathematical techniques we have
required. In section 4 we will explain the theory under ARMA modelling, since
it is the first tool we have implemented. Section 5 is focused on the theory by
which the One-Class SVMs are built, the main technique used for our forecasting.
Every detail about the procedure used during the project is explained in section
6. Its results are presented in section 7. We end the report with some conclusions
and an appendix with the R codes that have been employed and the consulted
bibliography .
2 Fault prevention in industrial machinery
In [2] the authors introduce a engine diagnosis method based on Dempster-Shafer
evidence theory. They consider engine diagnostics as a multi-sensor fusion prob-
lem, for which each channel is a piece of evidence, so that the main goal is learning
how to use all these pieces as a whole in order to explain the conditions of the
engine. So, the authors aim to acquire reliable information about potential faults
by incorporating complementary sensors. This consists, in general, on extrac-
ting features from multiple sensors and deciding which scheme should be used to
represent them. However, several complementary sensors can bring conflicting
data. Therefore, the quality of the decisions can decrease. The challenge is how
to detect conflicts among the sensors and how to fuse their decisions into one
coherent decision. Evidence theory is used to try to solve this issue.
– 3 –
2 Fault prevention in industrial machinery
We see again the idea of multi-sensor data fusion in [1]. What is proposed in
this paper is a hybrid method using Support Vector Machines and Short Term
Fourier Transform (STFT) techniques. In a first phase the signal obtained from
different channels is separated by STFT according to its frequency level and
amplitude. Next the system faults are modelled as changes in the sensor gain
with the magnitude given by a non-linear function of the measurable output and
input signals. An adaptive time-based observer is proposed in order to monitor
the system for unanticipated sensor failures. Then a fusion block comes into play
to combine the sensor data. Finally the information provided by the previous two
steps and the fused data is used to train the SVM classifier for fault-predictive
system modelling.
In [26] we find another example of data fusion strategy. The paper pro-
poses a decision fusion system for fault diagnosis, which integrates data sources
from different types of sensors and decisions of multiple classifiers. First, non-
commensurate sensor data sets are combined using relativity theory. The channels
they use are two types of vibrations and three of currents. This data must be
fused. Usually, the classes of a vector are diverse for different classifiers with a
same data set. Utilising the relativity theory, they remark that the outputs could
also be changed for different data sets which are classified by a same classifier.
The generated decision vectors are then selected based on correlation measure
of classifiers in order to find an optimal sequence of classifiers fusion, which can
lead to the best fusion performance. As a result, an optimal team of classifiers
containing classes of information from both the vibration and the current signals
is formed to improve classification accuracy. Finally, multi-agent classifiers fu-
sion algorithm is employed as the core of the whole fault diagnosis system. The
efficiency of the proposed system was demonstrated through fault diagnosis of
induction motors.
Again, vibration analysis stars in fault diagnostics in [20]. In this paper, the
authors focus on bearing faults, alleging that is the most frequent type of fault
on induction engines. The followed procedure begins with simulating faults with
Machinery Fault Simulator (MFS), a tool for simulating various types of induc-
tion motor faults initially fitted with a healthy motor and a motor with faulted
– 4 –
3 Novelty detection
bearings of same specification. After the data acquisition, done for both healthy
motor and a motor with faulted bearings under the same running conditions,
they proceeded with the signal processing, carried out by Continuous Wavelet
Transform (CWT). The methods applied for the diagnosis of faults are support
vector machines and artificial neural networks, whose classification results are
finally compared.
3 Novelty detection
The problem of the anomaly or novelty detection has been researched within
diverse research areas and application domains. Many anomaly detection tech-
niques have been specifically developed for certain application domains, while
others are more generic. Anomaly detection refers to the problem of finding
patterns in data that do not conform to expected behaviour [5].
The applications of novelty detection go over music segmentation [15], speech
recognition [22], intrusion detection in networks on the Internet [6], detecting
forbidden U-turn trajectories on urban roads [29], epileptic seizure prediction [7]
or electrocardiogram time series [12].
The importance of novelty detection resides on the fact that it allows to pre-
vent anomalies in a wide variety of systems, so that proper measures can be taken
soon enough.
We can define novelties as patterns in data that do not conform to a well
defined notion of normal behaviour. Let us illustrate this idea. In figure 1 a
simple two dimensional example has been plotted. The data has two normal
regions, N1 and N2 , since most observations lie on these two regions. Points
which are sufficiently far away from the regions, namely points o1 and o2 , and
points in region O3, are anomalies.
The presence of anomalies or novelties might be induced in the data for a vari-
ety of reasons, such as malicious activity, e.g., credit card fraud, cyber-intrusion,
terrorist activity or breakdown of a system, but all of the reasons have a common
characteristic: they are interesting for the analysis.
In abstract terms, the main novelty detection approach consists on defining
– 5 –
3 Novelty detection
Figure 1: A simple example of anomalies in a 2-dimensional data set
a region representing normal behaviour and declare any observation in the data
which does not belong to this normal region as an anomaly. Which challenges
are there within such a naıve approach?
First of all, a normal region which encompasses every possible normal be-
haviour has to be define, task which may be not easy. In addition, the bound-
ary between expected and anomalous behaviour is often not precise. Thus an
anomalous observation which lies close to the boundary can actually be normal,
and vice versa. Let us remark that, in the case that an anomaly is the result
of malicious actions, the malicious adversaries may have adapted themselves to
make the anomalous observations appear like normal. Let us also bear in mind
that expected behaviour may keep on evolving and a current notion of what is
expected behaviour might not be sufficiently representative in the future. Let
us, as well, take into account that labelled data for training and validation of
mathematical models are not always available, which radically complicates the
evaluation of the performance of the analysis. We can consider the cases when
the data contains noise which tends to be similar to the actual anomalies and
hence is difficult to distinguish and remove.
Because of the difficulty of the target, most of the existing anomaly detection
– 6 –
3 Novelty detection
techniques solve a specific formulation of the problem. The formulation is induced
by various factors such as nature of the data, availability of labelled data, type
of anomalies to be detected. . . Researchers have adopted concepts from diverse
disciplines such as statistics, machine learning, data mining, information theory,
spectral theory, and have applied them to specific problem formulations.
Some of the aspects that have to be consider when dealing with novelty de-
tection are the nature of the data, the data labels —if any—, the format of the
output that is desired and the type of anomalies that is being looked for. For the
latter, we can give some definitions:
Point anomalies. If an individual data instance can be considered as anomalous
with respect to the rest of data, then the instance is termed as a point
anomaly.
Contextual anomalies. If a data instance is anomalous in a specific context,
but not otherwise, then it is termed as a contextual or conditional anomaly.
E.g., a temperature of 0C might be expected in winter in Paris but, in
summer, it should be considered as an anomaly.
Collective anomalies. If a collection of related data instances is anomalous
with respect to the entire data set, it is termed as a collective anomaly. The
individual data instances in a collective anomaly may not be anomalies by
themselves, but their occurrence together as a collection is anomalous.
3.1 Applications
A way of dividing the fields in which anomaly detection can be utilized is consid-
ering the following properties of its particular framework:
• The notion of anomaly or novelty
• Nature of the data
• Challenges of the anomaly detection
• Existing techniques for solving the problem
– 7 –
3 Novelty detection
Let us see some examples which will help to illustrate this.
In intrusion detection, the goal is the detection of malicious activity in a
computer related system. These malicious activities or intrusions are interesting
from a computer security perspective.The key challenge for anomaly detection in
this domain is the huge volume of data. The data typically comes in streaming,
thereby requiring online analysis. Another issue which arises because of the large
sized input is the false alarm rate.
Fraud detection refers to detection of criminal activities occurring in commer-
cial organizations such as banks, credit card companies, insurance agencies, cell
phone companies, stock market. . . The malicious users might be the actual cus-
tomers of the organization or might be posing as a customer. The fraud occurs
when these users consume the resources provided by the organization in an unau-
thorized way. The organizations are interested in immediate detection of such
frauds to prevent economic losses. The term activity monitoring as a general
approach to fraud detection in these domains. The typical approach of anomaly
detection techniques is to maintain a usage profile for each customer and monitor
the profiles to detect any deviations. Some examples of fraud detection problems
are credit card fraud, mobile phone fraud, insurance claim fraud or insider trading
fraud —using non-public inside information in stock markets—.
Anomaly detection in the medical and public health domains typically work
with patient records. The data can have anomalies due to several reasons such as
abnormal patient condition or instrumentation errors or recording errors. Several
techniques have also focussed on detecting disease outbreaks in a specific area.
Thus the novelty detection is a very critical problem in this domain and requires
high degree of accuracy. The data typically consists of records which may have
several different types of features such as patient age, blood group, weight. The
data might also have temporal as well as spatial aspect to it. Most of the current
novelty detection techniques in this domain aim at detecting anomalous records
—point anomalies. Typically the labelled data belongs to the healthy patients,
hence most of the techniques adopt semi-supervised approach. Another form of
data handled by anomaly detection techniques in this domain is time series data,
such as Electrocardiograms (ECG) and Electroencephalograms (EEG) [7]. The
– 8 –
3 Novelty detection
most challenging aspect of the anomaly detection problem in this domain is the
cost of classifying a novelty, usually very high.
Industrial units suffer damage due to continuous usage and the normal wear
and tear. Such damages need to be detected early to prevent further escalation
and losses. The data in this domain is usually referred to as sensor data because
it is recorded using different sensors and collected for analysis. Novelty detection
techniques have been extensively applied in this domain to detect such damages.
Industrial damage detection can be further classified into two domains, one which
deals with defects in mechanical components such as motors, engines. . . and the
other which deals with defects in physical structures. The former domain is also
referred to as system health management. This is the field we are more interested
in for this project.
3.2 Novelty detection techniques
We can consider novelty detection as a branch of the classification problems —it
is not the only approach, but is the most related to this project. Classification
is used to learn a model, or classifier, from a set of data instances, the training
phase. This data instances can be labelled —supervised classification— or unla-
belled —unsupervised. Then, a test instance is classified into one of the classes
using the learnt model, the testing phase. Classification based novelty detection
techniques operate in the same two-phase method. The classification problems
can be grouped in two main categories: multi-class classification or one-class
classification. For this project, we are interested in the latter.
One-class classification based novelty detection techniques assume that all
training instances have only one class label, namely, they belong to a one single
class or they do not. Such techniques learn a discriminative boundary around the
normal instances using a one-class classification algorithm, e.g., one-class SVMs.
Any test instance that does not fall within the learnt boundary is declared as
anomalous, novelty or outlier.
As examples of frequent classification based novelty detection techniques we
can mention neural networks, bayesian networks or support vector machines.
– 9 –
4 ARMA modelling for time series
4 ARMA modelling for time series
Definition 1. We will define a time series as a sequence of N observations,
taken from one or several variables, which are chronologically ordered and equidis-
tant on time.
When we are dealing with univariate time series, i.e., time series build from
a single variable, we will denote them as sequences with this format:
xtNt=1 ,
where N will be the length or size of the time series and, for a given instant
t (1 ≤ t ≤ N), xt will be the observation measured in t. All N observations can
be written as a column vector x = (x1, . . . , xN)T .
In the other hand, we can also work with multivariate time series, which can
be represented as
xtNt=1 .
The vector xt = (xt1 , xt2 , . . . , xtM )T will be the observation taken in the instant
t, t ∈ 1, . . . , N , andN will be the length of the sequence. All theN observations
may be represented with a matrix N ×M :
X =
xT1
xT2...
xTN
=
x11 x12 · · · x1M
x21 x22 · · · x2M...
.... . .
...
xN1 xN2 · · · xNM
(4.1)
where xtj is the observation of the variable j on the instant t, ∀t = 1, . . . , N ∀j =
1, . . . ,M.
The main task we are interested on is building a mathematical model which
will help explaining these observations and will identify a pattern in order to
forecast.
The starting point when it comes to elaborate a model for a time series consists
on considerating the sequence as a particular finite realization of a stochastic
process.
– 10 –
4 ARMA modelling for time series
Definition 2. We will define a stochastic process as a sequence of random
variables, which are chronologically ordered and equidistant on time. A stochastic
process may be univariate or multivariate.
Formally, it is an application
X : Ω× T 7−→ S
(ω, t) 7−→ X(ω, t)
S is called space of states and it will be S ⊂ Z+ or S ⊂ R. T represents discret
time (T = 0, 1, 2, . . .) or continuous time (T = [0,∞))
Those stochastic processes which are univariate will be represented with Xt ,t = 0,±1,±2, . . ., where Xt is a random variable referred to the measurable unit
observed by the stochastic process in the instant t.
Deal we with multivariate processes, we will denote them with Xt , t =
0,±1,±2, . . ., where Xt = (Xt1 , Xt2 , . . . , XtM )T (M ≥ 2, j ∈ 1, 2, . . . ,M) is a
random variable referred to an observation of the system on t.
From now on, we are going to focus on univariate stochastic processes, since
they are the ones we have worked with.
A stochastic process will not be completely described unless we know the
distribution functions
F (Xt1 , Xt2 , . . . , XtN ) ∀t1, t2, . . . , tN ∈ Z, N ∈ N.
Such a goal cannot be considered possible unless we assume some relaxations.
4.1 Stationary stochastic processes
Definition 3. We will say that a stochastic process Xt is strictly stationary
when for every n ≥ 1 instants t1 < t2 < . . . < tn from its history, the joint
probability distribution for (Xt1 , . . . , Xtn)T and the one for (Xt1+h, . . . , Xtn+h)T
∀h = ±1,±2, . . . are the same.
A stochastic process Xt such as E [Xt] < ∞ ∀t = 0,±1,±2, . . . will be
defined as first-order weak-sense stationary when E [Xt] is constant ∀t =
0,±1,±2, . . .
– 11 –
4 ARMA modelling for time series
A stochastic process Xt such as E [X2t ] < ∞ ∀t = 0,±1,±2, . . . will be
defined as second-order weak-sense stationary when:
• E [Xt] and Var [Xt] are constants ∀t = 0,±1,±2, . . .
• Cov [Xt, Xt+k] depends as much on k ∈ Z but it is independent ∀t =
0,±1,±2, . . .
We will say that a stochastic process Xt is Gaussian when, for every n ≥ 1
instants t1 < t2 < . . . < tn from its history, the joint probability distribution for
(Xt1 , Xt2 , . . . , Xtn)T is a Normal n-variate distribution.
Unless we say the opposite, the second-order weak-sense stationarity will suf-
fice for our work.
Recall that the mean of the process will be denoted as µX = E[Xt] and the
variance, σ2X = Var[Xt] = E[(Xt − µX)2].
Definition 4. Given a stationary process Xt, its k-order autocovariance
(k > 0) will be
γk = Cov [Xt, Xt+k] = E [(Xt − µX)(Xt+k − µX)] .
Remark 1. Let us notice that the k-order autocovariance γk does not depend on t.
Definition 5. Given a stationary process Xt, its k-order simple autocor-
relation (k > 0) will be define as
ρk =Cov[Xt, Xt+k]√Var[Xt]
√Var[Xt]
=γkγ0
(4.2)
If we consider the sequence ρk : k = 1, 2, . . . as a function of k, we will call
it simple autocorrelation function (ACF).
Definition 6. The k-order parcial autocorrelation (k > 0) of a stationary
process Xt will be represented as φkk and will be define by the regression
Xt = φk1Xt−1 + φk2Xt−2 + . . .+ φkkXt−k + Ut (4.3)
where Xt−i = Xt−i − µX (i = 0, 1, . . . , k) and Ut is independent on Xt−i ∀i ≥ 1.
– 12 –
4 ARMA modelling for time series
4.2 ARMA models
Definition 7. A stationary stochastic process Xt will allow an autorregresive
moving average model of order (p, q) (ARMA(p, q)) when
Xt = µ+φ1Xt−1+φ2Xt−2+ . . .+φpXt−p+at−θ2at−2−θ2at−2− . . .−θqat−q (4.4)
for every t = 0,±1,±2, . . ., where at ∼ IID(0, σ2a) and µ, φ1, φ2, . . . , φp,θ1, θ2, . . . , θq
are such as every root for
1− φ1x− φ2x2 − . . .− φpxp = 0 (4.5)
are outside the unity circle. This will be called stationarity condition.
Definition 8. An ARMA(p, q) model given (4.4) is invertible when every root
in
1− θ1x− θ2x2 − . . .− θpxp = 0 (4.6)
are outside the unity circle (invertibility condition).
We will also define a lag operator —denotated by B or L— as
BXt = BXt−1, BdXt = Xt−d (d ≥ 2 integer) (4.7)
where Xt may be a random or real variable, referred to an instant t.
We can then rewrite the expression (4.4) as
φ(B)Xt = µ+ θ(B)at, (4.8)
where
φ(B) = 1− φ1B − φ2B2 − . . .− φpBp (4.9)
is the autorregresive polynomial of the model and
θ(B) = 1− θ1B − θ2B2 − . . .− θqBq (4.10)
is the moving average polynomial of the model.
– 13 –
4 ARMA modelling for time series
Stationarity and invertibility When a stationary process Xt allows an
ARMA(p, q) model written as in (4.8), its non-conditional expected value µX can
be obtained as follows:
E[φ(B)Xt] = µ+
0︷ ︸︸ ︷E[θ(B)at]
(1− φ1 − φ2 − . . .− φp) E[Xt] = µ
E[Xt] =µ
(1− φ1 − φ2 − . . .− φp)
µX =µ
φ(1)
where φ(1) is the value of the autorregresive operator when B = 1. Therefore,
(4.8) can be rewritten as
φ(B)(Xt − µX) = θ(B)at, and also φ(B)Xt = θ(B)at, (4.11)
where Xt = Xt − E[Xt] = Xt − µX for every t = 0,±1,±2, . . .
Theorem (Wolf’s Theorem). The stationarity condition (4.5) guarantees that
the coefficients ψ0, ψ1, ψ2 . . . of the infinite-order polynomial
ψ(B) =θ(B)
φ(B)= 1 + ψ1B + ψ2B
2 + . . . =∞∑i=0
ψiBi (ψ0 = 1) (4.12)
satisfy that∑∞
i=0|ψi| <∞, which is a sufficient condition so that Xt = ψ(B)at is
a stationary process.
Theorem. The invertibility condition (4.6) guarantees that the coefficients π0,
π1, π2 . . . of the infinite-order polynomial
π(B) =θ(B)
π(B)= 1− π1B − π2B2 − . . . = −
∞∑i=0
ψiBi (ψ0 = −1) (4.13)
satisfy that∑∞
i=0|πi| <∞, so that when we write (4.11) as π(B)Xt = at, Xt is
a stationary process such as
φkk −→ 0 when k →∞
– 14 –
4 ARMA modelling for time series
4.3 Automatic modelling
A common obstacle when dealing with ARIMA models is that the order selection
process may be considered subjective and difficult to apply. But it does not have
to be. There have been several attempts to automate ARIMA modelling in the
last 25 years.
In [16] a method to identify the order of an ARMA model for stationary series
is proposed. The idea is that the innovations can be obtained by fitting a long
autoregressive model to the data, and then the likelihood of potential models
is computed via a series of standard regressions. The asymptotic properties of
the procedure under very general conditions were established. Years later, an
extension of this automatic identification procedure method was implemented in
the software TRAMO and SEATS [14]. For a given series, the algorithm attempts
to find the model with the minimum BIC.
4.3.1 Identification of the model
A non-seasonal ARIMA(p,d,q) process is given by
φ(B)Xt = µ+ θ(B)at, (4.14)
where at is a white noise process with mean zero and variance σ2, B is the lag
operator, and φ(z) and θ(z) are polynomials of order p and q respectively. To
ensure causality and invertibility, it is assumed that φ(z) and θ(z) have no roots
for |z| < 1, as seen in the previous section.
The main task in automatic ARIMA forecasting is selecting an appropriate
model order, namely p, q, and d. If d is known, the orders p, q can be selected
by an information criterion such as the AIC:
AIC = −2 log(L) + 2(p+ q + k) (4.15)
where k = 1 if µ = 0 and 0 otherwise, L is the maximized likelihood of the model
fitted to the differenced data
(1−B)dxt.
The likelihood of the full model for xt is not actually defined and so the value of the
AIC for different levels of differencing d are not comparable. One solution to this
– 15 –
4 ARMA modelling for time series
difficulty is presented in [11] and it is implemented in the arima() —which is part
from the main function we have used in this prject, auto.arima()— function in
R [30]. In this approach, the initial values of the time series —before the observed
values— are assumed to have mean zero and a large variance. However, choosing
d minimizing the AIC and using this approach tends to lead to over-differencing.
An alternate approach for choosing d are the unit-root tests. Most unit-
root tests are based on a null hypothesis that a unit root exists, which biases
results towards more differences rather than fewer differences [17]. For example,
variations on the Dickey-Fuller test [10] assume there is a unit root at lag 1.
Instead, in [17] unit-root tests based on a null hypothesis of no unit-root are
preferred. For non-seasonal data, what are considered are ARIMA(p,d,q) models
where d is selected based on successive KPSS unit-root tests [21]. We give a short
explanation. The data is tested for a unit root. If the test result is significant, the
differenced data is tested for a unit root. And so on. This procedure is stopped
when a first insignificant result is obtained.
Once d is selected, the next step is the selection of the values of p and q by
minimizing the AIC, as mentioned above.
4.3.2 Estimation of the parameters
We proceed now with the estimation of the parameters µ, φi, i = 1, . . . , p and θi,
i = 1, . . . , q. Let us denote the sample autocorrelation function as
rk =
∑n−kt=1 (xt − x) (xt+k − x)∑n
t=1 (xt − x)2, (4.16)
which provides us with an estimate for the true autocorrelation ρk.
For estimating the mean factor —or drift— µ we can take
x =1
n
n∑t=1
xt
Let us also calculate the variance of this estimator when the process is sta-
tionary:
– 16 –
4 ARMA modelling for time series
Var(x) =γ0n
n−1∑k=−n+1
(1− |k|
n
)ρk (4.17)
=γ0n
[1 + 2
k=1∑n−1
(1− k
n
)]ρk (4.18)
The variance is inversely proportional to the sample size n.
For the estimation of the rest of the parameters, we will focus on the least
squares method, which is the one implemented in the R tools that we have used
[13].
Least squares estimation The AR(1) model
xt − µ = φ(xt−1 − µ) + at
can be viewed as a regression with independent variable xt−1 and dependent
variable xt. Thus, we can estimate φ minimizing the sum of squares
S∗(φ, µ) =n∑t=2
[(xt − µ)− φ(xt−1 − µ)]2. (4.19)
We will call this function S∗ conditional sum of squares function. Mini-
mizing it we get
µ =
∑nt=2 xt − φ
∑nt=2 xt−1
(n− 1)(φ− 1), (4.20)
which for large values of the sample size n can be approximated as
µ ≈ x− φx1− φ
= x. (4.21)
We finally substitute x for µ in (4.20) and obtain, again, for large values of n,
φ =
∑nt=2(xt − x)(xt−1 − x)∑n
t=2(xt−1 − x)2(4.22)
In AR(p) again we have that
µ = x
– 17 –
5 Support Vector Machines
The estimators of φ2,. . .,φp are given by the Yule-Walker equations. For ex-
amples, for an AR(2) model we should solve
r1 = φ1 + r1φ2 (4.23)
r2 = r1φ1 + φ2 (4.24)
in order to obtain φ1 and φ2.
Estimating the parameters of the MA(q) model give more difficulties and force
us to resort to numerical techniques to estimate θ1, θ2,. . .,θq.
Let us try to estimate the parameter θ for a MA(1) model
xt = at − θ1at−1
Recall that if this model in invertible, then it can be rewritten as an AR(∞)
model
xt = at − θ1xt−1 − θ21xt2 − . . .
We have reached again a regression model. However, in this case we are
dealing with a infinite order regression and non-linear in θ1. We will not be able
to minimize the sum of squares function
S∗(θ1) =∑
at
in an analytically way, so we will have to use numerical methods, as we have
already mention.
In general, for calculating the parameters of MA(q) and ARMA(p,q) models,
we will have to use numerical techniques in order to solve the sum of squares
equations.
5 Support Vector Machines
Given a set of observations which belong to a certain distribution, we may be
interested on finding a region which best fits to this set, so that if a new piece of
data is given and it is in this region, we will be able to say —with a probability—
– 18 –
5 Support Vector Machines
that it follows the same distribution as the others, and if it is not, we will be able
to say that it does not.
Support Vector Machines (SVM) consist on several mathematical techniques
whose final purpose is the classification or regression of data. They are based
on statistical learning theory and risk minimization and were initially proposed
by Vapnik et al. [35] for linear problems. Kernel methods [3] were later used to
improve the training of the SVMs so that they could be extended to nonlinear
problems.
We have been especially interested on One-Class SVMs. For this kind of
SVM, the data used for training is considered to belong to a single probability
distribution. The target is finding its support so that outliers and anomalies will
be discarded, which make OC-SVM a proper tool for novelty detection.
The basis of SVM arises from the statistical learning theory applied to linear
classification. Let us consider two-class hyperplane classifiers in some dot product
(·) space H
(w · x) + b = 0, w,x ∈ H, b ∈ R (5.1)
with a decision function associated, f : Rn 7→ ±1 (where n is the dimension in
which the data are expressed),
f(x) = sgn ((w · x) + b) .
This situation is represented in figure 2. We define margin as the distance
along the direction of w between the two hyperplanes parallel to the classification
plane and passing though the points in the two class sets nearest to the classifi-
cation plane (2). The objective is finding the optimal classifier, in the sense that
it will have a maximum margin.
In order to express this as an optimization problem, let us denote the class
of the observations which belong to the original distribution with y = 1 and the
ones which does not, with y = −1. Recall that we have been provided with a
training data set (xi, yi) ∈ Rn × ±1 i = 1, . . . ,m, where m is the number of
training data. Now, we can write the optimization problem as
– 19 –
5 Support Vector Machines
Figure 2: Separation of two classes by means of classification hyperplanes. The
optimal classification function is found by maximizing the distance between the
two dashed hyperplanes in the direction of w, which is called margin.
minw
1
2‖w‖2
subject to yi ((w · xi) + b) ≥ 1, i = 1, . . . ,m
where yi ∈ −1,+1 is the label for each training element. Lagrangian multipliers
αi ≥ 0 can help solving the problem leading to the dual optimization one
maxα
W (α) =m∑i=1
αi −1
2
m∑i,j=1
αiαjyiyj(xi · xj)
subject to αi ≥ 0, i = 1, . . . ,m (5.2)m∑i=1
αiyi = 0
which can be solved by the KKT method. Now the decision function is rede-
fined as
f(x) = sgn
(m∑i=1
yiαi(x · xi) + b
)(5.3)
where αi are solutions of (5.2) and b can be calculated knowing that for any xi
lying on the margin borders, the following equation must be satisfied:
yi(w · xi + b)− 1 = 0.
– 20 –
5 Support Vector Machines
These points lying on the margin borders are called support vectors. The
αi associated to them will be nonzero values. Therefore the final solution (5.3)
will be defined only in terms of a small subset of the training data.
The main drawbacks of this approach are:
• It only resolves linear problems.
• A dot product is required in the space of definition of the data.
In order to solve both inconvenients, a map Φ : X 7→ H from the nonempty set
of the original input data X to a dot product space H should be defined. We will
call H feature space and the problem will have a linear solution in it.
Let us notice that, in the dual problem (5.2) and in the decision function (5.3),
the only operations performed in H are dot products. If we have a formula for
dot products in H, there is no need to do any explicit computation in H when
training the model. This formula for the dot product is called kernel and it is
defined as
k(x, x′) = Φ(x) · Φ(x′), x, x′ ∈ X (5.4)
With this notation, the decision function in (5.3) can rewritten as
f(x) = sgn
(m∑i=1
yiαik(x, x′) + b
)(5.5)
where α is a solution of the optimization problem
maxα
W (α) =m∑i=1
αi −1
2
m∑i,j=1
αiαjyiyjk(xi, xj)
subject to αi ≥ 0, i = 1, . . . ,m (5.6)m∑i=1
αiyi = 0
Recall that we have worked with an unsupervised problem, i.e., unlabeled
data. Before applying this approach to our case, a proper transformation is
necessary. Let P be an unknown probability distribution for a set of unlabeled
measures. OC-SVMs can be used to estimate an appropriate region in the space
– 21 –
5 Support Vector Machines
X which contains the majority of the data drawn from P , leaving outliers outside
the region, if possible.
In the SVM framework, this idea works as follows. We wish to maximize the
distance from the decision hyperplane in the feature space H to its origin, while
a small fraction of data —the outliers— will fall between the hyperplane and the
origin. In terms of a minimization problem we can write
minw,ξ,ρ
1
2‖w‖2 +
1
νn
∑i
ξi − b
subject to w · Φ(xi) ≥ b− ξi, ξi ≥ 0
(5.7)
where xi ∈ X , (i ∈ 1, . . . , n) are n training observations in the data space
X , Φ : X 7→ H is the function mapping vectors xi into the feature space Hand (w · Φ(x)) − b = 0 is the decision hyperplane in H. Outliers are linearly
penalized by the variables ξi, ponderated by the parameter ν ∈ (0, ]. Applying
again Lagrangian multipliers method, the problem (5.7) becomes:
minα
1
2
∑i,j
αiαjk(xi,xj)
subject to 0 ≤ αi ≤1
(νn)∑i
αi = 1.
(5.8)
Quadratic programming techniques can lead to solve the problem, thus ob-
taining the values αi. Following the notation from (5.7), w is given by
w =∑i
αiΦ(x) (5.9)
and ∀ Φ(xi) such as αi 6= 0, the following equations are satisfied:
b− ξi = (w · Φ(x)) =∑j
αjk(xi,xj) (5.10)
with ξi > 0 for outliers and ξi = 0 for support vectors lying on the decision plane.
The decision plane. We can finally define the decision function in the data space
X as
– 22 –
5 Support Vector Machines
f(x) = sgn((w · Φ(xi))− b)
= sgn
(∑i
αik(x,xi)− b).
(5.11)
Recall that ∀ xi lying within the support region, we have αi = 0. Thus a
big number of training vectors do not contribute to the definition of the decision
function (5.11). For the rest of the vectors —namely, support vectors— , two
options may occur:
• If the vector xi lies on the decision hyperplane, αi will be such as 0 < αi <
1/(νn).
• If the vector xi is an outlier, αi = 1/(νn).
Let us now analize, intuitively, the meaning of the parameter ν, which is going
to play a main roll when tuning the number of outliers we are willing to accept.
From the objective function in (5.7), it can be seen that
ν → 0 =⇒ 1
νn→∞,
that is, the outliers’ penalization factors grow to infinity: no outliers would be
allowed.
On the other hand, the case where ν = 1 allows a single solution. Because
of the restrictions in (5.8), what would happen would be that αi = 1/n ∀i, its
maximum value, so that all vectors would be identified as outliers.
In general, we can say that ν is:
• an upper bound on the fraction of margin errors;
• a lower bound of the fraction of support vectors relative to the total number
of training examples.
A proof for these properties is shown in [31].
– 23 –
6 Resolution proposed
6 Resolution proposed
As we have said on section 1, we have decided to implement an OC-SVM in order
to decide whether a given engine is in proper conditions or is not. The input that
we have used is not the direct information extracted from the oil of the engine,
but the parameters associated to an ARMA model adjusted after the observation
of several states of the engine along time. These states provide us with several
time series, which have been modelled.
Concretely, we have worked with fifteen different channels, mentioned in sec-
tion 1. Some of them have been plotted in figure 3.
Figure 3: Plots of the evolution of the quantities of some of the elements we
have dealt with.
These 2305 node sequences are synthetic data extrapolated from real analyses.
The online analysis tool is still not available, so this has been the data set we
have worked with.
Though we have dealt with a unsupervised problem, we have followed the
– 24 –
6 Resolution proposed
idea developed in [7]. We have divided all the data into different time windows.
All the time windows have the same length, l. We have overlapped all of them,
so that when a new piece of data arrives from the device, we use it together
with the l− 1 nodes before in order to decide whether the engine is still working
correctly. For each channel and each time window, an ARMA model has been
adjusted. Its parameters will be the input for the SVM: one time window will be
one observation in the data set used to train and validate the SVM.
Let us remark that l-node time windows imply that, in order to decide whether
the engine is in good or bad conditions at a given moment t, we will use infor-
mation from the last l nodes, namely, instants. If we recover information every
minute, it means only information from the last l minutes is utilized. Since this
is a real-time problem, we have had to take short enough windows so that the
classification could be done in a short period of time, but long enough in order to
take advantage of as much past information as possible. Thus, we have worked
with l = 200 and l = 150, since they gave the best forecasting results, which have
been checked cross-validating the data as explained below.
6.1 Preparation of the data
We have worked with a data matrix whose rows represent the state of the engine
in a given time window.
The input data for the SVM consists on the parameters of ARMA models
for these channels for each time window (2305 − l in total). Though the order
of the model may reach the third level or higher, we have had to decide how
many parameters will be used as input. Our first consideration was working
with the parameters until the second order, which are usually more significant
in ARMA modelling [17]. So we would have had, for each channel, a parameter
for the component AR(1), another one for the component AR(2), and so on for
MA(1), MA(2) and the intercept; i.e., five columns for each channel (5× 15 = 75
independent variables). Since it is a linear model we are dealing with, in the case
that one of these parameters has not been included, we will take it as 0 into the
model. However, as we explain below, we have finally decided to work only with
the parameters φ1 and θ1, namely, the first AR and MA components. Therefore,
– 25 –
6 Resolution proposed
we have worked with 30 variables, an important reduction on the dimension of
the problem but, as we will see, has allowed to make a proper classification.
Because of the information provided by the engineers, we know that the first
700 observations —or the first 700 − l time windows— belong to correct states
of the engine used to take the data, so these ones have been used for training the
model. Since we know that there are no novelties or anomalies in this set, the
setting up of one of the parameters for the SVM is simple, as we will explain in
section 6.2.
6.2 Implementation in R
We have implemented the whole project in R. Particularly, the ARMA model-
ling has been developed with its function auto.arima, provided by the package
‘forecast’. The SVM-based classification, with its function svm, given with the
package ‘e1071’. We give now a few details about the use of this functions.
ARMA in R
The main input that is needed for the auto.arima function is a vector define
as a time series in R. Concretely, we have a time series per channel and per
time window. All these sequences have been modelled with auto.arima, which
estimates the parameters with the Least Squares Method, minimizing the residual
sum of squares, as we saw in section 4.3.2 [13]. We have specified the non-seasonal
behaviour of the time series. For this project, no seasonal behaviour has been
taken into account. Let us bear in mind that the data is extracted every minute.
Therefore, it does seem reasonable considering no seasonality in the model.
Though no extra input has been necessary for the function, it is important
the final result, namely, the object created by the R function auto.arima. Not
only gives us the parameters of the ARMA model, which will be the input for
the SVM, but also it allows us calculating the residuals.
The mean of the residuals —for each channel and each window— is around 5%.
So it seems that the adjust that has been made is acceptable. We have tested other
models built after longer time windows —up to 200 nodes per time window— and
– 26 –
6 Resolution proposed
the mean of the residuals is usually around 5%, as well. So we conclude that taken
time windows longer than the ones from our first considerations does not improve
the fitted values.
The goal of the SVM has been the detection of anomalies in the parameters
of these ARMA models. The idea extracted from [7] is that novelties in the state
of the engines will be properly identified by means of the parameters φ1, φ2, θ1,
θ2,. . . Figures 4–10 show how the evolution on time of these parameters looks
like.
In first place, the time windows between, approximately, the 800th and the
1000th stand out among the rest, since the variance of the associated parameters
seem to the decrease in this period of time. This can also be observed, in a
lower measure, in the period next to the 300th time window. The fact that these
changes occur for several channels suggest that ARMA modelling, as a tool for
representing these time series, is a consistent method. It is also observable that,
for the last time windows, given a parameter φ or θ, it seems to keep the sign,
positive or negative, namely, the modelling seems to be very stable in this period.
(a) Iron (b) Chromium
Figure 4: Evolution of the parameters φ1 and θ1
– 27 –
6 Resolution proposed
(a) Nickel (b) Molybdenum
Figure 5: Evolution of the parameters φ1 and θ1
(a) Aluminium (b) Lead
Figure 6: Evolution of the parameters φ1 and θ1
– 28 –
6 Resolution proposed
(a) Copper (b) Silver
Figure 7: Evolution of the parameters φ1 and θ1
(a) Calcium (b) Magnesium
Figure 8: Evolution of the parameters φ1 and θ1
– 29 –
6 Resolution proposed
(a) Phosphorus (b) Zinc
Figure 9: Evolution of the parameters φ1 and θ1
(a) Boron (b) Silicon
Figure 10: Evolution of the parameters φ1 and θ1
– 30 –
6 Resolution proposed
Figure 11: Relation between φ1 and θ1. Four examples
We can also take a brief sight to the relation between some of these indepen-
dent variables, by whom the SVM decides whether the conditions of the engine
are good or bad. In figures 11–13 we can see, given a channel, relations between
some of the parameters from the ARMA models.
Let us remark the number of cases for which the parameters φi or θj are zero,
for some i and some j, event that takes place when no AR or MA components are
computed. It is, as well, worthy of mention the relation shown in figures 11 and
13, where the compared parameters seem to follow a linear trend. Visually, it is
noticeable that some points do not follow this regression, which may be related
with the appearance of anomalies.
– 31 –
6 Resolution proposed
Figure 12: Relation between φ2 and θ2. Four examples
– 32 –
6 Resolution proposed
Figure 13: Relation between φ1 and θ2. Four examples
– 33 –
6 Resolution proposed
We finally present some plots which can help understanding the distribution
of the independent variables, φ1, θ1,. . . Recall that an ARMA(p,q) model will
have p + q + 1 parameters. If p = 0, the ARMA model will have no AR part,
and it can be written as a MA(q) model. However, in our datamart this case is
represented as if there were AR components φi = 0, (i = 1, . . . , p) —we will have
an analogous case when q = 0—. Therefore, for plotting the histograms (figures
14 and 15) we have worked eliminating the cases where φi = 0 or θj = 0, for any
i ∈ 1, . . . , p and any j ∈ 1, . . . , q.From the distribution of each parameter, we can emphasize the general idea
that they do not have heavy tails nor high width of peak, which should be related
to a high kurtosis. The skewness does not seem high either, but visually we
cannot say that we are dealing with symmetric distributions. Anyway, we should
be aware of the great number of changes in the values of the parameters, though
they are not so big, according to the soft tails.
– 34 –
6 Resolution proposed
(a) Lead (b) Copper
(c) Silver (d) Calcium
Figure 14: Histograms for φ1 variable
– 35 –
6 Resolution proposed
(a) Lead (b) Copper
(c) Silver (d) Calcium
Figure 15: Histograms for θ1 variable
– 36 –
6 Resolution proposed
SVM in R
In general terms, we have worked with the R function svm introducing the training
data set and the type of SVM we wanted to proceed with, namely, single-class
classification.
For this type of SVM, the most used kernel function —defined generally in
(5.4)— is the radial-basis one as the authors in [29] explain, mainly because of
the role of the Euclidean distance, which gives an intuitive interpretation to the
function. Recall that a radial-basis kernel function is defined as
k(x, y) = exp(−γ ‖x− y‖2
). (6.1)
Since this is the kernel function we have used, we explain how to choose those
parameters, ν and γ. The parameter γ is defined as
γ =1
2σ2
where σ represents the scale factor at which the data should be clustered [29].
Though a standard way of estimating γ consists on setting it up as the inverse
of the number of dimensions [25], cross-validating the data is a common method
for perform the estimation [6] [9] [32] [19]. The inconvenient we had to face was
that the unsupervised nature of the problem would not allow us to carry out the
cross-validation. To solve the issue, we decided to use those first 700 observations
—that were known as good— for doing the cross-validation. Thus, we were able
to tune γ and finally apply the implemented model.
This cross-validation has been done as follows. From the 700 observations
classified as correct, we have randomly chosen a 70%. With this set, we have
trained a OC-SVM model and have tested it over the other 30% of the observa-
tions. We have proceed in this way for several γ candidate values: 0,01, 0,02, 0,03,
0,04, 0,05,. . ., 0,1. However, we have considered that the number of independent
variables was too high in relation with the number of observations used for the
cross-validation. Because of this, we decided to keep just the more significant vari-
ables, as mentioned above, φ1 and θ1, for each channel. Actually, including more
independent variables to the data set leaded us to low-quality results. For this
reason, we have finally decided to work only with those 30 independent variables.
– 37 –
7 Results
Under this new environment, the optimal estimation for γ has been 0,03.
This cross-validation has also been used to study the optimal length of the
time windows. We have been able to check that short time windows, like 50 or 100
nodes long —in time, 50 or 100 minutes—, though they are quite useful in real-
time terms, did not give good enough percentage of correct SVM-classification
—recall that the error for the ARMA models was not high—. Therefore, this
lengths have been dismissed. Instead 150-node and 200-node time windows have
been considered.
The other parameter associated with the OC-SVM model is ν. We have to
remember that 0 ≤ ν ≤ 1, as well as its interpretation, explained in section 5.
It is both an upper bound on the fraction of margin errors and a lower bound of
the fraction of support vectors relative to the total number of training examples.
Recall that in the training data set —the first 700 nodes— there is no anomaly.
This allow us to take ν as low as we wish. However, recall that it cannot be 0,
as seen in 5. There is also one consideration worth enough of being taken into
account. It is true that, for training, we have no outliers or novelties, but this
will not happen in the future —not even in the rest of the data, used to check
the results forecast by the model—. Taken a too low value for ν may lead us to
overadjustment. To avoid this, we have chosen to set ν = 0,1, so that the type
of error in which a motor in bad conditions is classified as a good case, will be
penalized.
7 Results
7.1 Cross-validation
Regarding the cross-validation carried out for setting the γ parameter, we can
give some comments about the results. In table 1 we present the percentages of
correctly classified in training and validation in the procedure in the 200-node-
per-window case (ν = 0,1).
We have considered that those high percentages in the validation phase may be
– 38 –
7 Results
γ 0,01 0,02 0,03 0,04 0,05 0,06 0,1
T (%) 89,428 89,428 88,86 89,143 82,86 78,57 61,42
V (%) 94,0 92,67 90,67 89,3 80,0 73,33 45,33
Table 1: Training and validation percentages for the cross-validating phase.
200-node time windows. ν = 0,1
due to overadjustment and that is why we have preferred not to take γ < 0,03.
With more comparisons, we see that γ > 0,04 does not lead to good results.
Table 2 shows the 150-node-per-window, 100-node-per-window and 50-node-per-
window cases, respectively (ν = 0,1).
γ 0,01 0,02 0,03 0,04 0,05 0,06 0,1
150T (%) 90,389 90,129 89,35 86,49 84,15 76,36 64,93
V (%) 93,3 92,12 87,87 77,57 65,45 58,18 32,12
100T (%) 90,23 89,28 89,047 89,5 87,38 82,14 72,62
V (%) 74,4 68,3 58,89 43,3 34,4 28,33 8,89
50T (%) 90,088 90,528 89,20 89,867 85,9 83,25 71,36
V (%) 60,204 56,12 48,98 34,69 24,489 15,816 1,02
Table 2: Training and validation percentages for the cross-validating phase.
150-node, 100-node and 50-node time windows. ν = 0,1
We can see that in some cases, the accuracy is especially low, up to 1,02%.
This means that the model classifies a great number of instances as bad cases. Let
us remark that it happens not only when the length of the time windows decrease
—which can be related with the employment of a lower quantity of information—
but also with the higher γ is. According to what we have explained in section 5,
when γ increases its value, the final decision of the model, given by the function
f in (5.11), will be always the same since the kernel function (6.1) will tend to 0.
– 39 –
7 Results
7.2 Unsupervised problem
We can finally proceed to present some of the results achieved with the final
model, the one which should be put into practice.
As we have already mentioned on section 6.1, the training data for this model
has been the first 700 − l time windows, where l is its length in nodes, 150 or
200. The rest of them have been classified with the model obtained by SVM
techniques. Again, the parameters were set up as ν = 0,1 and γ = 0,03.
Because of the unsupervised nature of the problem, there is no way for eva-
luating whether the forecasting obtained with the SVM is right or wrong. What
we have done is comparing predictions from different SVMs —table 3—, whose
training data sets have been build from 150-node time windows and 200-node
time windows.
γ 0,01 0,02 0,03 0,04 0,05 0,06 0,1
150T (%) 89,818 90,545 90,364 89,273 86,182 82,0 64,545
V (%) 76,698 73,084 67,414 58,442 47,289 38,131 16,635
200T (%) 90,2 90,0 89,6 87,8 85,8 80,0 63,0
V (%) 78,816 76,137 71,153 61,184 50,093 39,065 14,143
Table 3: Percentages of good-classified observations for several γ values. 150-
node and 200-node time windows cases. ν = 0,1
We have been able to check that our first considered model, the one associated
with 150-node time windows, is the one for which there are more observations
classified as bad cases —if γ = 0,03. Concretely, 67,41433% of the instances have
been classified as good. This leads us to think that in this way, we will minimize
type I error, i.e., minimize the number of false positives classifications. Namely,
if the model classifies a time window as good, there will more certainty that it is
actually good than if the model had been built with longer time windows. This
approach seems to be more reliable than the opposite —the one which would
allow a higher number of false positive cases—, since the latter could lead to
situations in which a motor is classified as correct, when it is not, so that the
engineers would not be prepared for the eventual failure.
– 40 –
7 Results
Again, as explained in section 7.1, the higher the γ is, the lower the studied
percentage is.
On the other hand, the percentage of time windows classified as good ones
for the 200-node-per-window case (γ = 0,03) has been 71,1526%. So, as we have
already mentioned, if we prefer to be prudent, in order to prevent classifying bad
engines as good ones, choosing 150-node time windows will solve the issue.
As a curiosity, we present in table 4 the analogous percentages to the previous
ones, now associated to the problems with 50-node and 100-node time windows.
γ 0,01 0,02 0,03 0,04 0,05 0,06 0,1
50T (%) 90,0 88,93 88,15 89,076 86,307 84,615 73,846
V (%) 67,103 61,931 56,137 47,227 37,321 29,034 10,218
100T (%) 89,67 89,67 89,167 88,3 88,0 86,167 73,17
V (%) 64,424 61,059 53,769 44,611 35,264 26,67 8,972
Table 4: Percentages of good-classified observations for several γ values. 50-node
and 100-node time windows cases. ν = 0,1
It is true that these cases could be more plausible, since less past information
is needed, but as we have commented in section 7.1, the quality of the forecasting
decreases dangerously.
Last but not least, another way of minimizing the type I error is redefining the
parameters of the SVM based model. Recall that ν represents a lower bound for
the number of anomalies in the training set. We know that we have no anomaly
in the set, but we can take a higher ν, say 0,2, in order to force the model to
classify more time windows as bad ones.
When working with 150-node time windows, the percentage of time win-
dows classified as correct is 55,3271%. For the 200-node case, the percentage
is 59,81308%. As we expected, there are more instances classified as incorrect, so
this can also be a proper approach if the priority is minimizing type I error.
– 41 –
8 Conclusions
γ 0,01 0,02 0,03 0,04 0,05 0,06 0,1
50T (%) 80,0 80,39 80,46 80,92 80,77 81,08 74,62
V (%) 58,505 55,887 51,028 43,239 36,14 28,41 10,27
100T (%) 79,83 79,83 79,67 79,67 79,33 80,33 72,167
V (%) 55,264 51,028 45,856 40,124 33,208 26,04 8,97
150T (%) 80,0 79,82 80,0 78,55 77,818 78,727 68,727
V (%) 59,127 58,131 55,327 51,339 45,109 38,006 16,63
200T (%) 80,0 79,4 79,8 80,6 80,0 77,8 65,4
V (%) 66,85 64,299 59,813 55,26 48,598 39,065 14,14
Table 5: Percentages of good-classified observations for several γ values. 50-
node, 100-node, 150-node and 200-node time windows cases are shown. ν = 0,2
8 Conclusions
In this project, a new approach for engine faults detection has been presented.
From a kind of data not very common in the environment we have proposed a
mathematical model able to forecast the decay of the machinery, by developing a
method for dealing with the data, based on several novelty detection techniques,
some of them applied, not only in engine fault diagnosis but in more general
situations.
Let us remember that ARMA modelling has starred the preparation of the
data and has leaded us to good approximations of it. This methodology has pro-
vided us with the proper values to use as an input for the SVM based classification
model.
Because of the unsupervised environment, namely, the absence of knowledge
about whether the observations used as examples were cases of correct or wrong
functioning, we have not been able to assess the problem in terms of sensitivity
and specificity. In order to solve the inconvenient, we have decided to prioritize
the reduction of the false positives, namely, the observations classified as correct
when they belong to bad functioning.
Is obtained new already classified data so that the problem will become su-
– 42 –
8 Conclusions
pervised, the reliability of the model will increase since we will have techniques
to check more precisely the behaviour of the classifications, say with ROC curves
or lift charts. Moreover, a larger data base —in number of observations— could
let us working with more variables, extracted from the parameters of the ARMA
models, which possibly can give new aspect not relevant in this small data base.
Until the arrival of labelled data, the introduced model can help us handling
the classification, we advantage of knowing how to tune its parameters depending
on the importance that we desire to give to type I error.
– 43 –
R codes
R codes
Function used for building the time windows
chunks <− f unc t i on ( data , overlapped , nodes , nc , tam) # chunks ( ) d i v i d e s the time hor i zon
# into d i f f e r e n t p i e c e s o f l ength=tam .
# over lapped=TRUE ind i c a t e s that the p i e c e s w i l l
# be over lapped
i f ( over lapped ) nw=nodes−tam # number o f windows
e l s e nw=f l o o r ( nodes /tam)
# each window w i l l conta in tam nodes from nc channe l s
window <− array (0 , dim=c (nw, nc , tam) )
f o r ( i in 1 :nw) f o r ( j in 1 : nc )
ind <− 0
i f ( over lapped ) f o r ( k in i : ( i+tam−1) )
ind <− ind+1
window [ i , j , ind ]=data [ [ j ] ] [ k ] # over lapped k−i+1
e l s e
f o r ( k in ( ( i −1)∗tam+1) : ( i ∗tam) ) ind <− ind+1
window [ i , j , ind ]=data [ [ j ] ] [ k ] # non−over lapped k−( i −1)∗tam
# channe l s loop
# windows loop
chunks <− l i s t ( a=nw, b=window)
– 44 –
R codes
Function used for ARMA modelling
ex t r a c t f e a t AR <− f unc t i on ( nc , nw, t ime s e r i e s , tam) a u x i l i a r=vecto r ( ” l i s t ” , nc )
RERCM=matrix (0 , nrow=nc , nco l=nw)
EAM=matrix (0 , nrow=nc , nco l=nw)
RECM=matrix (0 , nrow=nc , nco l=nw)
ERM=matrix (0 , nrow=nc , nco l=nw)
LogLik=matrix (0 , nrow=nc , nco l=nw)
AIC=matrix (0 , nrow=nc , nco l=nw)
f o r ( i in 1 : nc ) f o r (w in 1 :nw)
model=auto . arima ( t s ( t im e s e r i e s [w, i , ] , s t =1,end=tam) ,
D=0, s ea sona l=FALSE)
a u x i l i a r [ [ i ] ] [ [ w] ]= as . l i s t (model$ co e f )
RERCM[ i ,w]= sq r t (sum( ( model$ r e s i d u a l s /model$x ) ˆ2) /tam)
EAM[ i ,w]=sum( abs (model$ r e s i d u a l s ) ) /tam
RECM[ i ,w]= sq r t (sum(model$ r e s i d u a l s ˆ2) /tam)
ERM[ i ,w]=sum( abs (model$ r e s i d u a l s ) /model$x ) /tam
LogLik [ i ,w]=model$ l o g l i k
AIC [ i ,w]=model$ a i c
ex t r a c t f e a t AR <− l i s t ( f e a t u r e s=aux i l i a r ,
RERCM=RERCM,EAM=EAM,
RECM=RECM,ERM=ERM,
LogLik=LogLik ,AIC=AIC)
Function used for preparing the input for the SVM
bu i ld datamart <− f unc t i on ( nc=nc ,nw=nw, c o e f f s ) coefAR1 <− matrix (0 , nrow=nc , nco l=nw)
d i f f coe fAR1 <− matrix (0 , nrow=nc , nco l=nw−1)scoefAR1 <− matrix (0 , nrow=nc , nco l=nw)
f o r ( i in 1 : nc )
– 45 –
R codes
f o r ( j in 1 :nw) i f ( ! i s . nu l l ( c o e f f s [ [ i ] ] [ [ j ] ] $ ar1 ) )
coefAR1 [ i , j ] <− c o e f f s [ [ i ] ] [ [ j ] ] $ ar1
di f f coe fAR1 [ i , ] <− coefAR1 [ 1 , 2 : nw]−coefAR1 [ 1 , 1 : ( nw−1) ]scoefAR1 [ i , ] <− smoothed ( par=coefAR1 [ i , ] , j j=j j )
coefAR2 <− matrix (0 , nrow=nc , nco l=nw)
d i f f coe fAR2 <− matrix (0 , nrow=nc , nco l=nw−1)scoefAR2 <− matrix (0 , nrow=nc , nco l=nw)
f o r ( i in 1 : nc ) f o r ( j in 1 :nw)
i f ( ! i s . nu l l ( c o e f f s [ [ i ] ] [ [ j ] ] $ ar2 ) ) coefAR2 [ i , j ] <− c o e f f s [ [ i ] ] [ [ j ] ] $ ar2
di f f coe fAR2 [ i , ] <− coefAR2 [ 1 , 2 : nw]−coefAR2 [ 1 , 1 : ( nw−1) ]scoefAR2 [ i , ] <− smoothed ( par=coefAR2 [ i , ] , j j=j j )
coefMA1 <− matrix (0 , nrow=nc , nco l=nw)
di f fcoefMA1 <− matrix (0 , nrow=nc , nco l=nw−1)scoefMA1 <− matrix (0 , nrow=nc , nco l=nw)
f o r ( i in 1 : nc ) f o r ( j in 1 :nw)
i f ( ! i s . nu l l ( c o e f f s [ [ i ] ] [ [ j ] ] $ma1) ) coefMA1 [ i , j ] <− c o e f f s [ [ i ] ] [ [ j ] ] $ma1
dif fcoefMA1 [ i , ] <− coefMA1 [ 1 , 2 : nw]−coefMA1 [ 1 , 1 : ( nw−1) ]scoefMA1 [ i , ] <− smoothed ( par=coefMA1 [ i , ] , j j=j j )
coefMA2 <− matrix (0 , nrow=nc , nco l=nw)
di f fcoefMA2 <− matrix (0 , nrow=nc , nco l=nw−1)scoefMA2 <− matrix (0 , nrow=nc , nco l=nw)
f o r ( i in 1 : nc ) f o r ( j in 1 :nw)
i f ( ! i s . nu l l ( c o e f f s [ [ i ] ] [ [ j ] ] $ma2) ) coefMA2 [ i , j ] <− c o e f f s [ [ i ] ] [ [ j ] ] $ma2
– 46 –
R codes
dif fcoefMA2 [ i , ] <− coefMA2 [ 1 , 2 : nw]−coefMA2 [ 1 , 1 : ( nw−1) ]scoefMA2 [ i , ] <− smoothed ( par=coefMA2 [ i , ] , j j=j j )
c o e f i n t <− matrix (0 , nrow=nc , nco l=nw)
d i f f c o e f i n t <− matrix (0 , nrow=nc , nco l=nw−1)s c o e f i n t <− matrix (0 , nrow=nc , nco l=nw)
f o r ( i in 1 : nc ) f o r ( j in 1 :nw)
i f ( ! i s . nu l l ( c o e f f s [ [ i ] ] [ [ j ] ] $ i n t e r c e p t ) ) c o e f i n t [ i , j ] <− c o e f f s [ [ i ] ] [ [ j ] ] $ i n t e r c e p t
e l s e i f ( ! i s . nu l l ( c o e f f s [ [ i ] ] [ [ j ] ] $ d r i f t ) )
c o e f i n t [ i , j ] <− c o e f f s [ [ i ] ] [ [ j ] ] $ d r i f t
d i f f c o e f i n t [ i , ] <− c o e f i n t [ 1 , 2 : nw]− c o e f i n t [ 1 , 1 : ( nw−1) ]s c o e f i n t [ i , ] <− smoothed ( par=c o e f i n t [ i , ] , j j=j j )
datamart <− matrix (0 , nrow=nw, nco l=nc∗5+2)
datamart [ , 1 ] <− 1 : dim( datamart ) [ 1 ] # obse rvat i on number
datamart [ , dim( datamart ) [ 2 ] ] <− +1 # ta rg e t
f o r ( cana l in 1 : nc ) datamart [ , ( 5 ∗ ( canal −1)+2) ] <− coefAR1 [ canal , ]
datamart [ , ( 5 ∗ ( canal −1)+3) ] <− coefAR2 [ canal , ]
datamart [ , ( 5 ∗ ( canal −1)+4) ] <− coefMA1 [ canal , ]
datamart [ , ( 5 ∗ ( canal −1)+5) ] <− coefMA2 [ canal , ]
datamart [ , ( 5 ∗ ( canal −1)+6) ] <− c o e f i n t [ canal , ]
dimnames ( datamart ) <− l i s t (NULL, c ( ”obs” ,
”AR1c1” , ”AR2c1” , ”MA1c1” , ”MA2c1” , ” i n t c1 ” ,
”AR1c2” , ”AR2c2” , ”MA1c2” , ”MA2c2” , ” i n t c2 ” ,
”AR1c3” , ”AR2c3” , ”MA1c3” , ”MA2c3” , ” i n t c3 ” ,
”AR1c4” , ”AR2c4” , ”MA1c4” , ”MA2c4” , ” i n t c4 ” ,
”AR1c5” , ”AR2c5” , ”MA1c5” , ”MA2c5” , ” i n t c5 ” ,
”AR1c6” , ”AR2c6” , ”MA1c6” , ”MA2c6” , ” i n t c6 ” ,
– 47 –
R codes
”AR1c7” , ”AR2c7” , ”MA1c7” , ”MA2c7” , ” i n t c7 ” ,
”AR1c8” , ”AR2c8” , ”MA1c8” , ”MA2c8” , ” i n t c8 ” ,
”AR1c9” , ”AR2c9” , ”MA1c9” , ”MA2c9” , ” i n t c9 ” ,
”AR1c10” , ”AR2c10” , ”MA1c10” , ”MA2c10” , ” in t c10 ” ,
”AR1c11” , ”AR2c11” , ”MA1c11” , ”MA2c11” , ” in t c11 ” ,
”AR1c12” , ”AR2c12” , ”MA1c12” , ”MA2c12” , ” in t c12 ” ,
”AR1c13” , ”AR2c13” , ”MA1c13” , ”MA2c13” , ” in t c13 ” ,
”AR1c14” , ”AR2c14” , ”MA1c14” , ”MA2c14” , ” in t c14 ” ,
”AR1c15” , ”AR2c15” , ”MA1c15” , ”MA2c15” , ” in t c15 ” ,
” c o r r e c t ” ) )
datamart <− datamart [ , 2 : ( 5 ∗nc+2) ]
datamart <− as . data . frame ( datamart )
bu i ld datamart <− datamart
Function used for implementating the cross-validation
bu i ld svm n <− f unc t i on ( datamart , g , n , cut ) l i b r a r y ( e1071 )
datase t <− datamart [ 1 : cut , ]
s e l e c t i o n <− sample ( 1 : f l o o r ( 0 . 7 ∗ cut ) )t r a i n <− datase t [ s e l e c t i o n , ]
t e s t <− datase t [− s e l e c t i o n , ]
model <− svm(x=tra in , y=NULL, type=”one−c l a s s i f i c a t i o n ” ,
k e rne l=” r a d i a l ” , nu=n ,gamma=g )
s a l i d a <− l i s t (modelo=model , t r a i n=tra in , t e s t=t e s t )
bu i ld svm <− s a l i d a
Implementation of the SVM
bu i ld svm <− f unc t i on ( datamart , g , n , cut ) l i b r a r y ( e1071 )
t r a i n <− datamart [ 1 : cut , ]
– 48 –
R codes
t e s t <− datamart [ ( cut+1) : dim( datamart ) [ 1 ] , ]
model <− svm( t ra in , y=NULL, type=”one−c l a s s i f i c a t i o n ” ,
k e rne l=” r a d i a l ” , nu=n ,gamma=g )
s a l i d a <− l i s t (modelo=model , t r a i n=tra in , t e s t=t e s t )
bu i ld svm <− s a l i d a
Script
muestra=read . t ab l e ( ruta sample , header=T)
l i b r a r y ( f o r e c a s t )
tam=200 # windows ’ l ength
nc=15 # number o f channe l s
# preparat ion o f the time s e r i e s
channel=vecto r ( ” l i s t ” , nc )
f o r ( i in 1 : 15 ) channel [ [ i ] ]= t s ( muestra [ , ( i +1) ] , s t a r t =5980 , f r e q=1)
nodes <− l ength ( channel [ [ 1 ] ] ) [ 1 ]
source ( ruta chunks )
ch <− chunks ( data=channel , over lapped=TRUE, nodes=nodes ,
nc=nc , tam=tam)
nw <− ch$a ; p i e c e s <− ch$b
source ( ruta ex t r a c t AR)
co e f channel <− ex t r a c t f e a t AR( nc=nc ,nw=nw, t ime s e r i e s=p iece s , tam=
tam)
source ( ruta datamart )
datamart <− bu i ld datamart ( nc=nc ,nw=nw, c o e f f s=coe f channel $ f e a t u r e s )
source ( ruta svm)
c o l s <− c ( seq (1 , 71 , 5 ) , seq (3 , 73 , 5 ) )
svm out <− bu i ld svm( datamart=datamart [ , c o l s ] , g=0.03 ,n=0.2 , cut=500)
– 49 –
References
References
[1] Banerjee, Tribeni Prasad; Das, Swagatam. “Multi-sensor data fusion using
support vector machine for motor fault detection.” Information Science 217
(2012) 96–107.
[2] Basir, Otman; Yuan, Xiaohong. “Engine fault diagnosis based on multi-sensor
information fusion using Dempster-Shafer evidence theory.” ScienceDirect,
Information Fusion 8 (2007) 379–386.
[3] Boser, B.; Guyon, I.; Vapnik, V. “A training algorithm for optimal margin
classifiers”. Proc. 5th Annu. ACM Workshop Comput. Learn. Theory, (1992)
144–152.
[4] Box, G.E.P.; Jenkins, G.M.; Reinsel, G.C. Time series analysis - Forecasting
and control. (3rd edition) Prentice Hall, 1994.
[5] Chandola, Varun; Banerjee, Arindam; Kumar, Vipin. Anomaly Detection: A
Survey . Technical report. Department of Computer Science and Engineering,
University of Minnesota. Minneapolis, Minnesota, 2007.
[6] Chen, Wun-Hwa; Hsu, Sheng-Hsun; Shen, Hwang-Pin. “Application of SVM
and ANN for intrusion detection.” ScienceDirect, Computers & Operations
Research 32 (2005) 2617–2634.
[7] Chisci, Luigi; Mavino, Antonio; Perferi, Guido; Sciandrone, Marco; Anile,
Carmelo; Colicchio, Gabriella; Fuggetta, Filomena. “Real-Time Epileptic
Seizure Prediction Using AR Models and Support Vector Machines.” IEEE
Transactions on Biomedical Engineerings 57, 5 (2010) 1124–1132.
[8] Cowpertwait, Paul S.P. Introductory Time Series with R. Springer, 2006.
[9] Davy, Manuel; Desobry, Frederic; Gretton, Arthur; Doncarl, Christian. “An
Online Support Vector Machine for Abnormal Events Detection”. Signal Pro-
cessing, 2005. p.2009–2025.
– 50 –
References
[10] Dickey, D.A.; Fuller, W.A. “Likelihood Ratio Statistics for Autoregressive
Time Series with a Unit Root.” Econometrica, (1981) 49, 1057–1071.
[11] Durbin, J.; Koopman, S.J. “Time Series Analysis by State Space Methods”.
Oxford University Press, Oxford (2001).
[12] Fuchs, Eric; Gruber, Thiemo; Pree, Herlmut; Sick, Bernard. “Temporal data
mining using shape space representations of time series.” Neurocomputing 74
(2010) 379–393.
[13] Fuente Fernandez, Santiago de la. “Series Temporales: Modelos ARIMA”.
http://www.fuenterrebollo.com/Economicas/SERIES-TEMPORALES/
modelo-arima.pdf
[14] Gomez, V; Maravall, A. “Programs TRAMO and SEATS, Instructions for
the Users”. Working paper 97001. Ministerio de Economıa y Hacienda, Di-
reccion General de Analisis y Programacion Presupuestaria.
[15] Grinblat, Guillermo L.; Uzal, Lucas C.; Granitto, Pablo M. “Abrupt change
detection with Once-Class Time-Adaptive Support Vector Machines.” Experto
Systems with Applications 40 (2013) 7242–7249.
[16] Hannan, E.J.; Rissanen, J. “Recursive Estimation of Mixed Autoregressive-
Moving Average Order.” Biometrika, 69 (1) (1982), 81–94.
[17] Hyndman, Rob J.; Khandakar, Yeasmin. “Automatic Time Series Forecast-
ing: The forecast Package for R”. Journal of Statiscal Software 27,3 (2008).
[18] Hyndman, Rob J. with contributions from Athanasopoulos, George;
Razbash, Slava; Schmidt, Drew; Zhou, Zhenyu; Khan, Yousaf; Bergmeir,
Christoph. (2013). “forecast: Forecasting functions for time series and lin-
ear models.” R package version 4.8. http://CRAN.R-project.org/package=
forecast
[19] Kohavi, R.. “A study of cross-validation and bootstrap for accuracy estima-
tion and model selection”. IJCAI (1995) 1137–1145.
– 51 –
References
[20] Konar, P.; Chattopadhyay, P. “Bearing fault detection of induction motor
using wavelet and Support Vector Machines (SVMs)” Applied Soft Computing
11 (2011) 4203–4211.
[21] Kwiatkowski, D.; Phillips, P.C.; Schmidt, P.; Shin, Y. “Testing the Null
Hypothesis of Stationarity Against the Alternative of a Unit Root.” Journal
of Econometrics, 54 (1992), 159–178.
[22] Liu, Song; Yamada, Makoto; Collier, Nigel; Sugiyama, Masashi. “Change-
point detection in time series data by relative density-ratio estimation”. Neu-
ral Networks 43 (2013) 72–83.
[23] Ma, Junshui; Perkins, Simon. “Time-series Novelty Detection using Once-
class Support Vector Machines” IEEE 3 (2003) 1741–1745.
[24] Mentz, Raul Pedro. “Estimacion en los modelos autorregresivos y de prome-
dios moviles.” Estadıstica espanola 116 (1988) 87–106.
[25] Meyer, David; Dimitriadou, Evgenia; Hornik, Kurt; Weingessel, Andreas;
Leisch, Friedrich (2012). e1071: Misc Functions of the Department of Statis-
tics (e1071), TU Wien. R package version 1.6-1. http://CRAN.R-project.
org/package=e1071
[26] Niu, Gang; Han, Tian; Yang, Bo-Suk; Tan, Andy Chit Chiow. “Multi-agent
decision fusion for motor fault diagnosis.” Mechanical Systems and Signal
Processing 21 (2007) 1285–1299.
[27] Nour, F.; Watson, J.F. “The monitoring and analysis of transient vibra-
tion signals as a means of detecting faults in the three phase Induction Mo-
tor” Proceedings of 28th University of Power Engineering Conference Vol. 1,
(September 1993). pp. 178–181.
[28] Pena, Daniel. Analisis de series temporales. Alianza, 2005.
[29] Piciarelli, Claudio; Micheloni, Chistian; Foresti, Gian Luca. “Trajectory-
Based Anomalous Event Detection”. IEEE Transactions on Circuits and Sys-
tems for Video Technology 18 (2008) 1544–1554.
– 52 –
References
[30] Ripley, B.D. (2002). “Time Series in R 1.5.0.” R News, 2(2), 2–7. http:
//CRAN.Rproject.org/doc/Rnews/.
[31] Scholkopf, Bernhard; Smola, Alex J.; Williamson, Robert C.; Bartlett, Peter
L. Neural Computation 12 (2000) 1207–1245.
[32] Smola, A.; Schoelkopf, B. “Learning with Kernels”. MIT press Cambridge,
MA, USA (2002).
[33] Tavner, P.J.; Gaydon, B.G.; Word, D.M. “Monitoring generators and large
motors”. IEE Proceedings 133,3 (1986) 181–189 (Part B).
[34] Tavner, P.J.; Penman, J. “Condition Monitoring of Electrical Machines”.
Research Studies Press Ltd. 1987.
[35] Vapnik, V.; Lerner, A. “Pattern recognition using generalized portrait
method”. Autom. Remote Control, 24 (1963) 774–780.
– 53 –