machine learning for mechanical analysis

Malardalen UniversitySchool of Innovation Design and Engineering

Vasteras, Sweden

Thesis for the Degree of Master of Science in Engineering - Robotics30.0 credits

MACHINE LEARNING FORMECHANICAL ANALYSIS

Sebastian [email protected]

Examiner: Ning XiongMalardalen University, Vasteras, Sweden

Supervisors: Martin EkstromMalardalen University, Vasteras, Sweden

Company supervisor: Henrik Grunditz,Prevas, Vasteras, Sweden

Per Ahman,Prevas, Vasteras, Sweden

June 20, 2019

S. Bengtsson Machine learning for mechanical analysis

Abstract

It is not reliable to depend on a persons inference on dense data of high dimensionality on adaily basis. A person will grow tired or become distracted and make mistakes over time. Thereforeit is desirable to study the feasibility of replacing a persons inference with that of Machine Learn-ing in order to improve reliability. One-Class Support Vector Machines (SVM) with three differentkernels (linear, Gaussian and polynomial) are implemented and tested for Anomaly Detection.Principal Component Analysis is used for dimensionality reduction and autoencoders are used withthe intention to increase performance. Standard soft-margin SVMs were used for multi-class clas-sification by utilizing the 1vsAll and 1vs1 approaches with the same kernels as for the one-classSVMs. The results for the one-class SVMs and the multi-class SVM methods are compared againsteach other within their respective applications but also against the performance of Back-PropagationNeural Networks of varying sizes. One-Class SVMs proved very effective in detecting anomaloussamples once both Principal Component Analysis and autoencoders had been applied. StandardSVMs with Principal Component Analysis produced promising classification results. Twin SVMswere researched as an alternative to standard SVMs.

i


Table of Contents

1. Introduction 1

2. Problem Formulation 32.1 Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

3. Background 43.1 Concepts and notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43.2 Back-Propagation Neural Network (BPNN) . . . . . . . . . . . . . . . . . . . . . . 53.3 Support Vector Machines (SVM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3.3.1 One-class Support Vector Machines (OC-SVM) . . . . . . . . . . . . . . . . 63.3.2 SVM for multiclass classification . . . . . . . . . . . . . . . . . . . . . . . . 7

3.4 Data processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73.4.1 Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83.4.2 PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83.4.3 Autoencoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.5 Cross-validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

4. Related Work 10

5. Method 135.1 Constructing a system with alternative data . . . . . . . . . . . . . . . . . . . . . . 13

5.1.1 Data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145.2 Stochastic quasi-Newton method for Twin SVM . . . . . . . . . . . . . . . . . . . . 145.3 Result evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

6. Ethical and Societal Considerations 17

7. Work process 187.1 Algorithm research and preparation . . . . . . . . . . . . . . . . . . . . . . . . . . 187.2 Data processing research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187.3 Anomaly detection implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 187.4 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197.5 SQN-TWSVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

8. Results 208.1 Anomaly detection results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

8.1.1 Baseline performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208.1.2 Performance with PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218.1.3 Performance with autoencoders . . . . . . . . . . . . . . . . . . . . . . . . . 238.1.4 Performance with PCA and autoencoders . . . . . . . . . . . . . . . . . . . 25

8.2 Classification results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278.2.1 Classification baseline performance . . . . . . . . . . . . . . . . . . . . . . . 278.2.2 Classification performance with PCA . . . . . . . . . . . . . . . . . . . . . . 288.2.3 Stochastic Quasi-Newton TWSVM . . . . . . . . . . . . . . . . . . . . . . . 29

9. Conclusions 309.1 Anomaly Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

10. Discussion 3210.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

References 36

ii


1. Introduction

Machine Learning, in all its diverse forms, has been used to solve a multitude of problems - forexample classification[1, 2], detection[3, 4], regression[5] and optimization[6, 7]. At its core Ma-chine Learning is a marriage between statistical theory and signal processing, which gave birth toSupport Vector Machines (SVM), Neural Networks (NN), Genetic algorithms and a plethora ofother algorithms with various ad-ons.

The task in this thesis is detecting flaws and shortcomings in a mechanical product based on themeasured moments from rotation of a key component of the product. The moments reveal not onlyif something is wrong within the product but also clues as to what the source of the problem mightbe. Currently these measurements are interpreted manually by employees at the production site,however it is anticipated that an autonomous solution will bring several benefits including, butnot limited to, improved anomaly detection and better reliability. In general this solution couldbe useful for a multitude of industries where anomaly detection and classification is needed.There are two main questions regarding this task: 1) How is accurate and reliable anomaly de-tection as well as classification achieved, and 2) How does it compare to a person performing thetask manually? To answer these questions research was conducted broadly to find what methodshas been used prior and what their results were. Once a promising direction was found a deeperstudy was made and extensive testing and analysis was performed. With an initially broad scopewhich narrows as the research progresses a generally favorable solution is expected to be found.

For this particular task a fully autonomous solution is not only favorable to the production butto the workers as well. Over time a worker will grow tired which increases the risk of overlookingmeasurements and misclassifying samples which will reduce the overall quality of the products -but there is also the risk of repetitive strain-injuries. Such an injury is difficult to recover from andcan hurt both the company and the individual. With an autonomous solution it is anticipated thatthese issues can be overcome and the production process might even become more time efficient.

SVMs have been used and developed over several decades and have been proven useful in binaryclassifications but also for multi-class classification through application of approaches based oncross-class comparisons and Fuzzy Logic [8, 9]. A wide selection of data types and applicationshave been delt with by SVMs by utilizing various combinations of methods for creating the sep-arating hyperplanes, estimating classification errors and adjusting the hyperplane in an efficientmanner. [10, 11, 12].

This thesis will focus mainly on various forms of SVMs for the sake of anomaly detection andclassification which will be compared to the capabilities of Back-Propagation NN. To help the al-gorithms achieve a higher accuracy some pre-processing will be applied to the data sets in advance

Figure 1: A model of the testbench. A sensor is attached to a shaft between an electric motor anda mechanism which can not be observed from outside. When the shaft is turned the sensor sendsthe moment measurements to a computer where a person will analyse the moment plot. In theillustration it is indicated that the mechanism is not properly lubricated, which should be revealedin the plot.

1


and the effects of this will be studied and analysed.

Due to the versatility of SVMs it is reasonable to believed that a SVM-based solution can be foundthat can produce a satisfactory accuracy and reliability for anomaly detection and classificationfor the mentioned task.

The remainder of the report will be structured as; The Problem Formulation will firstly detailthe problem which this thesis will attempt to solve as well as stating the hypothesis and researchquestions. The Background-section will cover the notation and basic concepts used in this reportafter which the key algorithms and methods used will be introduced more in depth than has beendone so far. Related Work will go through the development of SVMs and NNs from conceptionto the modern era and end with an account of the state of the art. All methods, algorithms andtechniques used as well as the reasoning to why they were used will be discussed in the Methodsection. The stance towards use of personal data and ethical considerations will be stated in Ethicaland Societal Considerations, followed by a section dedicated to describing the work process of thisthesis. In Results the findings will be presented in detail through tables and analysis. Conclusionswill seek to concisely summarize the findings and results of this thesis. In Discussion the resultswill be examined in regard to the research questions. The report will close with Future Work,suggesting what more might be done to continue the development of a Machine Learning solutionwith increased capabilities.

2


2. Problem Formulation

It may not be reliable to have people making inferences on complicated and information-densedata on a daily basis. People are prone to becoming tired, stressed and distracted which can leadto bad decisions. No decisive interpretation can be made with no fitness criteria to speak of andonly relying on the experience a person might have in recognizing a good sample from a bad one.Therefore it is desirable to study the feasibility of replacing a persons inference with that of amachine.

2.1 Hypothesis

The hypothesis for this thesis is stated as follows;

A Machine Learning implementation can handle anomaly detection and classification of measure-ments from a mechanism at least as well as an experienced person.

2.2 Research Questions

A set of research questions will be stated in order to test the hypothesis.

RQ1 Can Machine Learning perform anomaly detection on a comparable level as a person interms of accuracy and time on the given task?

RQ2 How does a person compare to Machine Learning in terms of anomaly detection rate andtime spent on the given task?

RQ3 How reliably can Machine Learning detect specific faults through the received data?

RQ1 is, in other words, a Yes or No question regarding the possibility of Machine Learninghaving capabilities similar to an experienced person in anomaly detection. RQ2 then asks for thedifference in performance between a person and Machine Learning in anomaly detection, which canbe answered in numerous ways depending on what differences were noted and whichever answerto RQ1 was found. Finally, RQ3 concerns the multi-class classification performance.

2.3 Limitations

This thesis have a couple of initial limitations. Firstly, the thesis is limited in time as it shall bedone in 20 weeks which shall cover initial research, implementation and writing of the report.Secondly, the hardware on which to run the implementations for the sake of testing and evaluatingis limited to a computer with Windows 7 installed, equipped with 16 gigabytes of RAM and anIntel Core i5-9600K CPU running at 3.7 Ghz.Thirdly, the initially available data is not produced by the testbench which a Machine Learningsolution is researched for, but a prototype of such a testbench. Proper data is anticipated to beproduced during the time-span of this thesis.

3


3. Background

As the name suggests, Machine Learning is the scientific study of getting machines to producesome seemingly cognitive response to external information, such as inferring that a cluster of pix-els is a specific hand-written character, based on prior examples. By training a Machine Learningalgorithm with appropriate examples (or ”training data”, ”samples”) it can be taught to classifysimilar examples in a predetermined set of classes. Depending on the specifics of the algorithmimplementation and the quality as well as the volume of the training data it is possible to makehighly accurate classifications - some of which might be difficult or even impossible for a humanto make.

For the sake of legibility and understanding of the equations and expressions which will be presentedfurther into this report, a notation convention will be determined below.

3.1 Concepts and notations

Before delving into the Machine Learning algorithms the necessary mathematics will be introduced.

Scalars Scalars will be denoted with italicized lower-case letters, for example u.

Vectors Vectors will be denoted by bold lower-case letters, for example u. All vectors will becolumn vectors unless transposed, which will be notated as uT .

Matricies Matricies will be denoted with bold upper-case letters, for example A.

Input domain (X ) The domain from which data will be drawn.

Feature map (Φ : X → H) A function which maps a space X into another space H.

Kernel (K) A kernel, be it linear, Gaussian or polynomial, performs a similarity measure overpairs of data samples often for the sake of pattern analysis. It is the same as calculating thedot product of two data samples in feature space H, ie K(x1,x2) = (Φ(x1) · Φ(x2)), but bynot calculating the coordinates for the samples in the feature space and instead calculatingthe dot products of the images of the samples in the feature space a lot of computationaleffort can be avoided.

Feature space (H) The space that an input space X is mapped into in order to enable linearseparation.

γ: Functional margin of an example with respect to a hyperplane The perpendicular dis-tance from a hyperplane to the closest member of a class element. An important value duringthe training of a SVM, this value indicates how big a difference the SVM currently has be-tween the classes to be separated. It is generally desired to achieve a large γ value throughthe training.

ζ: Slack variable In the cases where a SVM is to find a line or plane separating the trainingsamples strictly into their respective classes but is unable to achieve this it is necessary tointroduce a ”slack variable”. The slack variable turns the SVM from a ”hard-margin” into a”soft-margin” SVM where some misclassification is permissible during training. Each trainingsample is given a slack variable which will add a new term to the minimization expressionof the SVM: The sum of all slack variables multiplied by the cost parameter C. The slackvariable is commonly initialized to 1 and training is regarded as successful when every slackvariable has been minimized to zero.

L: Loss function A measurement of how badly an example was classified. L is generally positiveand can be calculated according to several expressions which affects the overall training ofthe given algorithm. A loss function can be used instead of a slack variable depending onthe formulation of the SVM.

4


Figure 2: A Neural Network with three input-nodes and an arbitrary number of nodes in an arbitrarynumber of hidden layers leading to a final output-node.

3.2 Back-Propagation Neural Network (BPNN)

A collection of nodes, grouped into layers, intended to mimic the core functionality of a biologicalbrain to some degree. It is constructed with an initial input-layer of nodes, each node correspond-ing to some data point for example an element in a vector, followed by any number of ”hidden”layers containing additional nodes which are all connected to every node in the preceding layerwith an associated weight coupled with an activation function, for example the sigmoid function.The final layer of the Neural Network is the output layer where a final output or ”decision” isproduced[13]. An illustration is shown in Figure 2.The network may be trained using the algorithm known as Back-Propagation: Whenever the net-work produces the wrong output a ”wave of correction” sweeps through the network to change theweights associated with the node connections in order to affect future decisions of the algorithm[14].How much a weight is changed during each correction is determined to the most part by the Learn-ing Ratio, which is passed as a parameter. A large Learning Ratio might seem advantageous asthe weights would take larger steps towards making better outputs but this could very well lead tothe algorithm ”stepping over” the optimal solution and never really converging. Conversely, toosmall of a learning rate would give a very slow convergence. There exists solutions to the problemof having an appropriate learning rate, such as self-adaptive learning rate proposed in 2009 wherethe learning rate is adapted with the Taylor theorem [15]. With this method the learning rate willdecrease in accordance with the rate at which the classification error of the network decreases.BPNN is quire popular, particularly for solving classification problems and has won several classi-fication competitions.

3.3 Support Vector Machines (SVM)

Often abbreviated to just ”SVM”, this is an algorithm attributed to Vapnik and Lerner who in1962 proposed a Generalized Portrait algorithm which was later developed, in large part by Vapnikbut also by others, into the algorithm that is frequently used and studied today [8, 16, 17, 18].

The core concept of a SVM is creating a hyperplane, also known as the decision boundary, in aspace which will separate two classes of points from each other. On either side of the hyperplanelies the boundaries of the margin, which is desired to be as large as possible because a wide mar-gin indicates a high separability of the classes. The training of a SVM is essentially solving fora margin of maximum width in which none of the training examples fall into. It is however notalways possible to have a large margin or to even have a proper margin at all - there might not bea feasible hyperplane which perfectly separates the desired classes in the input space X . In thiscase a certain amount of slack can be allowed, in which some misclassification is permitted duringtraining granted that the distance from the margin of the correct class to the individual point is

5


Figure 3: An illustration of a Support Vector Machine. The thick straight line represents the planewhich separates the two classes with the margin γ.

not too large. Instead of a slack variable a loss function can be used, commonly the Hinge-lossfunction but other functions exists. A SVM allowing some level of misclassification during trainingis known as a ”soft-margin SVM” and a SVM not allowing any degree of misclassification of thetraining data is known as a ”hard-margin SVM”.

The outputs of a SVM is ±1, the signage indicating the classification.

Equation 1, which is subject to Equation 2, shows the basic problem that needs to be solved in asoft-margin SVM: Equation 1 needs to be minimized by adjusting the weight vector w, the biasterm b and the slack variables ζi for the given l training samples. A training sample can be definedas xi = {x ∈ X , y ∈ [±1]} where y is the label of the data sample. The feature map Φ can be asimple linear map which just multiplies the input sample xi by 1 or it can be any other linear ornon-linear map.

minw∈H,b∈R,ζi∈R

1

2||w||2 +

1

l

l∑i=1

ζi (1)

subject to yi(wTφ(xi)− b) ≥ 1− ζi, ζi ≥ 0, i = 1, . . . , l (2)

The primal objective function for a Support Vector Machine with constraints.

During the 1990’s an addition to the SVM method was proposed [16]: The kernel trick, in which acarefully chosen kernel function K is applied to the input vectors in order to transform them intosome feature space H. This addition made it possible to separate sample vectors which were notlinearly separable otherwise.

3.3.1 One-class Support Vector Machines (OC-SVM)

In the year 2000, Scholkopf et al formulated a One-Class Support Vector Machine which centersaround the probability density of the input space through the application of a kernel[19]. Itcreates a hyperplane which encapsulates the spaces which the provided training samples are mostfrequently in. Because of this the OC-SVM is an unsupervised classifier which labels all sampleswhich fall within the encapsulated spaces as +1 and all others as -1. The quadratic problem that

6


minw∈H,ζ∈Rl,ρ∈R

1

2||w||2 +

1

v · l

l∑i=1

ζi − ρ (3)

subject to (w · Φ(xi)) ≥ ρ− ζi, ζi ≥ 0 (4)

The primal OC-SVM objective function.v is a parameter, l is the number of data samples, ρ is a bias term.

needs to be solved for a OC-SVM is given in Eq. 3 and is subject to 4. Figure 4 illustrates aOC-SVM.

Figure 4: An illustration of a One-Class Support Vector Machine. All samples which are withinthe boundaries set by the positive training samples are labeled +1 while all other samples are labeled-1.

3.3.2 SVM for multiclass classification

SVMs are normally binary classifiers, however, there exists simple methods to apply them tomulti-class problems[20]. One method is the 1vsAll, also known as 1vsRest, approach in which oneclassifier is constructed for each class - see Figure 5. Every classifier is then trained using datafor their respective classes as positive examples and all other classes as negative examples. Aslong as there is no ambiguity amongst the classes leading to an input being classified positively formultiple classes, whichever class that is assigned +1 is considered the correct classification. Thisapproach requires comparably few classifiers, however it is susceptible to being poorly balanceddue to uneven distribution of training examples.

Another method is the 1vs1 method, also known as pairwise decomposition, where k classes re-quires k k−12 classifiers - see Figure 6. Each classifier returns the most probable of two classesto a given input, with the complete number of classifiers giving an exhaustive comparison of allavailable classes to each other. Whichever class was deemed the most probable the most times isconsidered the correct classification. This approach is more robust than 1vsAll though it does leadto a quadratic increase of the number of classifiers.

3.4 Data processing

To increase the capabilities of an algorithm the data can be processed in advance. Depending onthe type, shape and volume of the data it might be appropriate to decrease the dimensionality,extract features or somehow model the data in a novel way. Some methods of processing data whichwill be used in this report will be presented below. It is important to note that the normalizationdescribed below will be used on every data set throughout all testing and evaluation before anyother kind of processing is applied.

7


k1

k2

k3

k4

c1

c2

c3

c4 Other

c1c2

c3

c4c5

c6

k1

k2

k3

k4

Figure 5: 1vsAll. Each classifier cn(represented by circles) labels an inputas either in the given class kn or as notin kn.

Figure 6: 1vs1. Each classifier cn(represented by lines between the classeskn) compares an input to everycombination of two available classes.

3.4.1 Normalization

Before any data is used in this thesis it will be normalized according to Equation 5. This is alsoknown as feature scaling.

xnew =x− xmin

xmax − xmin(5)

The equation used to normalize a sample vector x.

In Equation 5 a sample x is normalized by subtracting the minimum elemental value of x fromevery element of x and dividing this by the scalar difference between the maximum and minimumelemental value of x. This causes every element in x to have a value between 0 and 1.

3.4.2 PCA

Principal Component Analysis (PCA) is a useful tool for dimension reduction. It is commonlyused as a pre-processing step in order to improve the performance of classification algorithms byreducing the dimensions of the input data to a small number of features which still represents thedata to a large degree. Depending on the data, the dimensions of a sample can be decreased fromthousands to just a few without much loss in information. This is achieved by projecting the pointsonto the orthogonal eigenvectors of the covariance matrix. Each projection onto an eigenvectorgives one principal component. [21].

3.4.3 Autoencoder

An autoencoder is a two-step method that is built on Neural Networks. In the first step a set of”normal” data samples are fed to the auto-encoder, which will train a Neural Network that encodesthe data samples. The autoencoder will then attempt to recreate the encoded samples into theiroriginal form. Given that the autoencoder has been trained to only encode and decode the desiredor ”normal” samples, any anomalous sample is probable to be decoded with a noticeable error toits original form[22].

3.5 Cross-validation

A common method for thoroughly evaluating the performance of a classifier on a data set is calledcross-validation, illustrated in Figure 7. The process is fairly straight-forward and is used through-

8


out in this thesis.

Figure 7: 3-fold cross-validation on a set with four classes. For each of the three iterations theshaded section is excluded from the training of the classifier and used for testing.

A data set is partitioned into a number of equally large groups, five and ten groups being commonchoices. Iteratively every group except one is used for training while the remaining group is usedfor testing. When every group has been used for testing an overall performance can be calculated.It is desirable that all classes are represented in each group in relation to the proportion it has inthe set.

9


4. Related Work

In 1943 Warren McCulloch and Walter Pitts sowed the seed for what would become Artificial Neu-ral Networks[23]. They proposed a mathematical model called threshold logic which attemptedto mimic some of the functionality of the neurons in a brain. The ”perceptron” was proposedby Rosenblatt in 1958, a further mathematical and computational mimicry of biological cellularfunctionality. From this perceptron came the first Neural Networks. It was however not until 1974that the back-propagation algorithm was proposed, stemming from H. Kelleys work in 1960[24, 25].This invention revitalized the research into Neural Networks to the point that the major limitingfactor of the algorithm was the hardware. Due to the size of the network and the large amountsof neurons necessary for it to produce a useful result the amount of time to train the network wasoften in the range of months, depending on the problem and application of the network. As thespeed of processors increased and computer memory became more cheaply accessible the popular-ity of Neural Networks grew.

Wei Zhang et al constructed a multi-layered feed-forward parallel distributed processing modelin 1990 which used the Neural Network methodology[26]. This model was capable of classifyingletters even when they were tilted, shifted or distorted and would serve as the foundation for Con-volutional Neural Networks (CNN or ConvNets).

Another branch of Neural Networks is the Recurrent Neural Network (RNN), which is based onthe work of Rumelhart in 1986. RNNs incorporate its own output values as inputs to itself to givetemporal information. A kind of RNN is Long Short-Term Memory (LSTM) discovered in 1997by Hochreiter and Schmidhuber which proved excellent in speech-recognition and other context-dependent applications[27, 14].

In 1962 Vapnik and Lerner published an article (translated to English from Russian in 1963) abouttheir idea of a ”Generalized Portrait algorithm” where they also gave an axiomatic definition ofpatterns based on decomposition of images into subsets[8]. This proposed algorithm was onlyapplicable to linear sets of data and was highly susceptible to noise and outliers. T.M. Cover de-veloped the idea of hyperplanes for pattern separation in 1965, which laid the foundation for largemargin hyperplanes[17]. After some development, this algorithm was still fairly limited as it couldonly be applied to linearly separable binary classes. However, this changed with the introduction ofkernels which had previously been researched by Aiserman, Braverman and Rozonoer in 1964[28].Kernels were realized as a useful tool in SVMs in 1992 by Boser, Guyon and Vapnik and enabledclassification of non-linearly separable data by transforming the given data into a feature spacewhere linear separability was possible[16].

In 1995 Cortes and Vapnik introduced the ”soft margin” where each data sample xi is assigneda variable ζi ≥ 0[18]. During training it is attempted to find a solution where these ζi valuesare minimized as they are indicators of how ill-fitting the current iteration of the classificationhyperplane is. Up until this point SVMs utilized what is called a ”hard margin”, meaning that apoint of data is either on the right or wrong side of the classification hyperplane with no indicationof how right or wrong the samples was classified in terms of distance from the hyperplane.

There exists a myriad of adaptations of SVM for various problem solutions[8]. Xi-Zhao Wang andShu-Xia Lu incorporated Fuzzy Logic into a SVM where a fuzzy membership value was made partof the objective function as a factor to the loss values[9]. By utilizing Fuzzy Logic some of theinherent sensitivity to outliers in ordinary SVMs was overcome. The proposed Improved FuzzyMulti-category SVM (IFMSVM) achieved a slight but noticeable improvement to the classificationscores as compared to a 1vsAll, 1vs1 and Multi-category SVM on various sets of data.

Support Vector Machines continue to be developed in the 20th century. Since its conception allSVMs used a single plane with a surrounding parallel margin to perform classification. In 2006Mangasarian and Wild introduced the Generalized Eigenvalue Proximal SVM (GEPSVM), in which

10


minw(1),b(1),ζ

1

2(Aw(1) + e1b

(1))T (Aw(1) + e1b(1)) + c1e

T2 ζ (6)

subject to − (Bw(1) + e2b(1)) + ζ ≥ e2, ζ ≥ 0 (7)

minw(2),b(2),ζ

1

2(Bw(2) + e2b

(2))T (Bw(2) + e2b(2)) + c2e

T1 ζ (8)

subject to (Aw(2) + e1b(2)) + ζ ≥ e1, ζ ≥ 0 (9)

TSVM as a constrained minimization problem. A and B are matrices containing the trainingvectors, w is the weight-vector, e is a 1-vector of appropriate dimensions, b is the bias-terms, c is

a trade-off constant and ζ is the slack-vector.

a plane with a maximized margin towards the given samples are replaced by two non-parallel planesof maximum distance from each other[29]. The planes are eigen-vectors gained from finding a pairof related generalized eigen-value problems with the smallest eigen-values. Each plane attempts tobe as close as possible to one class while creating as large distance as possible to the other class,see Figure 8.

In 2007 Jayadeva et al proposed another multi-plane SVM which, just like Mangasarians and WildsGEPSVM, replaces the maximum-margin classifier with two non-parallel planes[30]. The main dif-ference between these two approaches is to do with the formulation of the planes. In GEPSVMthe planes are eigen-vectors while in Jayadevas approach, coined ”Twin Support Vector Machine”(TSVM), the planes are very much the same as in normal SVMs. Though they could not displaya significant increase in accuracy from a conventional SVM, the TSVM did have a remarkablyshorter training time since instead of a single major QP problem, as solved for in ordinary SVMs,two smaller Quadratic Programming (QP) problems are solved for the TSVM. Jayadeva et als for-mulation for this Twin SVM is given in Equations 6 - 9 as formulated in the notation conventionadopted by this thesis.

minw(1)

1

2||w(1)||2 +

v1

l2

l1∑i=1

L(x+i , yi, fi(x

+i ))

(10)

minw(2)

1

2||w(2)||2 − v2

l1

l2∑j=1

L(x−j , yj , fj(x

−j ))

(11)

TSVM as an Unconstrained Minimization Problem with a loss-function. w is the weight-vector, vis the difference of the current iteration of w and the previous iteration, l is the number of

samples, f() is the classification function.

In 2019 Sharma et al reformulated Jayadeva et als TSVM into two unconstrained minimizationproblems with a loss function. The objective functions are given in Equation 10 and 11. Intheir report Sharma et al proposed a stochastic solution to this minimization problem using aquasi-Newton method and approximations of the Hessian matrices as these are computationallyexpensive to calculate. This TSVM is denoted as SQN-PTWSVM (Stochastic Quasi-Newton Pin-ball Twin Support Vector Machine).

Sharma et al also made a comparison between the conventional Hinge loss function and the Pin-ball loss function with the conclusion that the latter had some prominent advantages over theformer[31]. Most importantly, the Pinball loss function produces stability by removing some sensi-tivity to noisy training data, thus promoting quicker convergence. If τ is set to zero the Pinball lossfunction would be the exact same as the Hinge loss function. Equation 12 showcases the Pinball

11


Lτ (x, y, f(x)) =

{−yf(x), if− yf(x) ≥ 0 (Incorrect classification)

τyf(x), if− yf(x) < 0 (Correct classification)(12)

The Pinball loss function. x is a data sample, y is the label of x, f(x) is the classification value ofx, τ is the penalty rate applied to correctly classified samples.

loss function. With extensive testing and comparisons of various SVMs it was shown that, on aver-age, the stochastic quasi-Newton optimization technique coupled with the Pinball loss function fora Twin Support Vector Machine was greater than both SVMs and conventional TSVMs in termsof both accuracy and training time.

Figure 8: An illustration of a Twin Support Vector Machine. A central separating plane is herereplaced by two non-parallel planes - one for each group of samples.

12


5. Method

The given problem which this thesis centers around is finding an algorithm (or system of algo-rithms) which reliably can detect anomalies in a univariate time-series and then go on to pin-pointwhat may be the cause of the anomalous behavior. This begs the question: Is this possible given thetime-frame of this thesis? Prior knowledge suggests that this is very much possible. The question-ing then moves to how to achieve this and also how does the solution compare to a person manuallysolving this problem. These latter questions are the basis for the research questions RQ1 and RQ2.

Research was conducted to find what solutions have been attempted for similar problems andwhat the outcomes were. It was found that Support Vector Machines (SVM) have been used foranomaly detection with good results[32, 33, 34]. A special case of SVM known as the One-ClassSVM (OC-SVM) is particularly good for anomaly detection since it does not require any balance ofthe number of samples in each class. This is important when dealing with anomaly detection sinceanomalous samples by definition are rare in comparison with the normal samples. Additionally,other variations of SVM have been used with good results for classification [1, 2, 10]. With the1vs1 and 1vsAll approaches it is possible to use SVM, by default a binary classifier, for multi-classclassification.

Since SVM shows promise for the primary problem of anomaly detection - and can indeed also beused for multi-class classification - this thesis will focus heavily on SVM which will cause the scopeof the thesis to become deep at the expense of width.

In addition to finding methods and algorithms for anomaly detection and classification it was alsonecessary to pre-process the data. The samples from the company was in the range of severalthousand points of data, which can become unwieldy. After further research two data processingsolutions with good potential were found: Principal Component Analysis (PCA) and Autoencod-ing. These methods would be used to decrease the dimensions of the data samples while stillretaining any useful information and to hopefully increase the performance of the solutions.

Neural Networks are very popular and frequently considered a go-to method for all kinds of clas-sifications. As a means of comparison a Back-Propagation Neural Network (BPNN) was used asa counter-point for the OC-SVM in the case of anomaly detection as well as for the binary SVMused for classification.

5.1 Constructing a system with alternative data

The provided data set from the company had a very high resolution but a low number of labeledsamples - particularly in the anomalous class. Therefore a handful of alternative data sets weregathered in order to more thoroughly evaluate the selected methods. The data sets were gatheredfrom the UCI Machine Learning Repository[35] and the UEA & UCR Time Series ClassificationRepository[36]. All data sets are either drawn from the real world or produced synthetically, andare put into one of two groups: ’Binary’ and ’Multiclass’.The ’Binary’ group consists of data with binary classes where one class will be regarded as the”normal” occurrences and the other class represents anomalies. The number of anomalous sampleswill be reduced to 10% of the data set to make them seemingly anomalous. The ’Multi-class’ groupof data sets will have more than two classes to enable the testing of the researched solutions interms of classification accuracy.

13


5.1.1 Data sets

The data sets used in this report are presented below. The data which was acquired from thecompany will be referred to as ”Company data”.

Binary

Hill Valley A synthetic data set where the data, when plotted along the horizontal axis, will formsteep ramps either going up then down (class +1) or down then up (class -1). There are 606samples with 100 points in each sample. Additionally, each sample is available either withor without noise.

Dodger Loop Weekend (DLW) A real-world data set gathered from a sensor measuring theamount of traffic on the 101 North Freeway in Los Angeles, close to the Dodgers Stadiumwhich is frequented during weekends. The classes are +1 (weekday) and -1 (weekend). Thereare 158 samples with 288 attributes each.

PowerCons A series of measurements of house-hold electric power consumption distributed intothe two classes +1 (warm season, April-September) and -1 (cold season, October-March).There are a total of 360 samples of 144 data points each.

Strawberry Obtained using Fourier transform infrared (FTIR) spectroscopy with attenuatedtotal reflectance (ATR) sampling, this set contains 983 samples of length 235 points eachwith the classes +1 (strawberry) and -1 (not a strawberry or adulterated strawberry).

Company data A moment-sensor has been mounted on an arm which turns a key component in amechanical apparatus. This data set is made of the measured moments and the correspondingangular degree which the moment was measured at. An approved sample is labeled +1 anda failed sample is labeled -1.

Multiclass

Waveform A synthetic set containing three classes (1-3) corresponding to different waveforms.Each waveform is a combination of two out of three base-forms. In the set are 5000 samplescomposed of 40 attributes, including noise.

Ethanol Level The data samples are spectographs of bottles containing alcoholic beverages offour different ethanol levels which correspond to the following classes: 1) E35, 2) E38, 3)E40, 4) E45. There are 1004 samples with 1751 data points in each.

UMD A synthetic set with 3 classes (1-3), 180 samples and 150 points per sample.

Melbourne Pedestrian (MP) A real-world data set gathered from a pedestrian counting sys-tem used at 10 different locations in and around Melbourne, Australia. There are 3450samples, each containing discrete measurements corresponding to 24 consecutive hours andeach class (1-10) corresponding to a specific location.

5.2 Stochastic quasi-Newton method for Twin SVM

Sharma et al used a stochastic quasi-Newton method to optimize their Twin SVM. This method isreproduced below. Equation 10 and 11 must be re-formulated as multi-variate functions in orderto apply the quasi-Newton method. The input for the functions is the corresponding weight-vectorand a sample θ which contains both the data x for the sample and the label y.

The stochastic gradients s are calculated according to Equation 13 and 14. Here c1 and c2 are pa-rameters which dictate the importance of the classification penalties, k is +1 if the current weightfor the given plane and sample incorrectly classifies the sample and −τ if it correctly classifies thesample. y is the label of the sample.

14


s1(w1,t, θ1,t) = (w1,t +v1

l2x−t −

c1l1

kyt(x+t )) (13)

s2(w2,t, θ2,t) = (w2,t −v2

l1x+t −

c2l2

kyt(x−t )) (14)

The stochastic gradients of Equations 10 and 11 when re-formulated as multi-variate functions.

The quasi-Newton method as described by Sharma et al is given in Equations 13 to 22, adjustedto fit the notation of this thesis. The process of the method can be described as such in plain text:While the weight-vectors have not stagnated to a certain tolerance and the maximum number of it-erations have not been exceeded, a sample from both classes are gathered. The stochastic gradientsare calculated according to Equation 13 and 14 which are then used to update the weight-vectorsaccording to Equation 15 and 16. The variable variations are then calculated as given in Equations17 to 20. Lastly the Hessian approximation matrices are updated as in Equation 21 and 22. Whenthe tolerance or iteration limit has been exceeded it is anticipated that the weight-vectors are fitto perform binary classification with the use of Equation 23.

w1,t+1 = w1,t − αt(B1,t + ΓI)−1s1(w1,t, θ1,t) (15)

w2,t+1 = w2,t − αt(B2,t + ΓI)−1s2(w2,t, θ2,t) (16)

Formula for updating the weights in the SQN-PTWSVM.

Equation 15 and 16 show how the weight-vectors are updated in the described quasi-Newtonmethod. The matrices B are square matrices with the same length as the weight-vectors. α is aparameter dictating the step-size of updating the weight-vectors. Γ is a regularization constantwhich prevents ill conditioning of the Hessian approximations. I is the identity matrix of equalsize to B.

v1,t = w1,t+1 −w1,t (17)

v2,t = w2,t+1 −w2,t (18)

r1,t = s1(w1,t+1, θ1,t)− s1(w1,t, θ1,t) (19)

r2,t = s2(w2,t+1, θ2,t)− s2(w2,t, θ2,t) (20)

Variable variations for the SQN-PTWSVM.

The only term yet introduced for the described method is the bias term δ which shall be greaterthan 0.

B1,t+1 = B1,t +r1,tr

T1,t

v1,tr1,t−

B1,tv1,tvT1,tB1,t

vT1,tB1,tv1,t

+ δI (21)

B2,t+1 = B2,t +r2,tr

T2,t

v2,tr2,t−

B2,tv2,tvT2,tB2,t

vT2,tB2,tv2,t

+ δI (22)

Formula for updating the B-matrices in the SQN-PTWSVM.

15


The classification for the Twin SVM is made with Equation 23. Using the two planes given by theweight-vectors w1 and w2 with their corresponding bias terms b1 and b2 the seemingly correctclass is given by whichever plane produces the smallest absolute value.

Class(i) = arg mini=1,2

|xTwi + bi|||wi||

(23)

Classification formula for the SQN-PTWSVM. Here i indicates the two planes of the TSVM.

5.3 Result evaluation

To evaluate a solution, 10-fold cross-validation will be used. The predictions from a solution willbe used to calculate the recall and specificities in the case of anomaly detection and the accuraciesin the case of classification. These values are calculated with the equations in Figure 9(b) wherethe components are given in 9(a). All tests will be performed on a computer running Windows 7with an Intel Core i5-9600K CPU running at 3.7 GHz equipped with 16 gigabytes of RAM.

TP True Positive

TN True Negative

FP False Positive

FN False Negative

(a) Abbriviation explanation

Accuracy =TP + TN

TP + TN + FP + FN

Recall =TP

TP + FN

Specificity =TN

TN + FP

(b) Evaluation equations

Figure 9: Expressions used for evaluation.

16


6. Ethical and Societal Considerations

In this thesis all data concerning personal information, company property (including data sets) andcompany procedures will be anonymized or omitted to the greatest extent possible. All descrip-tions and mentions of machines, products and procedures will be made as ambiguous as possible.

17


7. Work process

Below will be given a thorough narration of the process, progress and obstacles of the thesis.

7.1 Algorithm research and preparation

The first step to take was acquiring the data from the company. This proved more difficult thanit first seemed as the only available data was from a prototype of the testbench which was fairlylimited in resolution and quality - the final version of the hardware had yet to be installed atthis point in time. Proper data was promised to be delivered eventually, which was no pressingissue since a reasonable understanding of the data was gained through the prototype outputs. Thedata was a univariate time-series with a couple of hundred features per sample and each samplewas labeled as either ”approved” or ”not approved”. With this knowledge research into potentialsolutions was initiated.Classification and anomaly detection methods within Machine Learning was researched. SupportVector Machines (SVM) were understood to be a suitable family of classifiers for RQ1, as itwas a binary classification problem. After some more research it was learned that SVM hasbeen used with good results for multiclass classification by utilizing the 1vs1 approach (pairwisedecomposition) and the 1vsAll approach which opened up for SVM being the topic of RQ3 as well.Though Fuzzy SVM has been shown to have a slightly higher accuracy in the case of multiclassSVM the improvement was so small that the increased complexity as compared to 1vs1 and 1vsAllwas not deemed a feasible trade-off at this stage [9].As a frame of reference for the SVM implementations Back-Propagation Neural Network (BPNN)was chosen based on its current popularity as a classifier.From the multitude of SVMs found during the research three were chosen to be in the scope of thisthesis: Standard SVM, One-Class SVM (OC-SVM) and Stochastic Quasi-Newton Pinball TwinSVM (SQN-PTWSVM)[31].

7.2 Data processing research

Eventually, data from the final version of the testbench was produced and made available. Thedata was composed of several thousand features per sample which was a greater number of featuresthan anticipated. This prompted the need to decrease the size of the data samples as the rawdata was bulky and cumbersome. Further research was conducted which covered IndependentComponent Analysis (ICA), subtractive clustering, Fuzzy c-means clustering and more[37, 38, 39].Eventually it was settled for Principal Component Analysis (PCA) to decrease the dimensionality ofsamples. It was also learned about the positive effects which autoencoders can have on classificationperformance, which led to incorporating this into the thesis as well.

7.3 Anomaly detection implementation

It was thought best to tackle the research questions chronologically, meaning that the capabilitiesof OC-SVM on anomaly detection was the first to be implemented and evaluated in conjunctionwith a BPNN. For the sake of rigorous testing and demonstrating the versatility of the implemen-tations additional data sets were gathered from online sources, such as the UCI Machine LearningRepository and timeseriesclassification.com[35].Using MatLab, a BPNN was set up and its accuracy was tested on the data sets gathered for thesake of anomaly detection through 10-fold cross-validation. A OC-SVM was also set up and testedin the same fashion with three different kernels: Linear, Gaussian and Polynomial. Once the base-line performance of these two methods was established the data processing methods were appliedand the results documented. To gain a solid understanding of the effects of the data processingmethods the classifiers were initially tested with the processing methods separately before beingtested with both methods applied simultaneously. All results were recorded and can be found inthe corresponding subsection in section Results.

18


Figure 10: An illustration of the iterative method used for this thesis.

7.4 Classification

The final version of the testbench which the company had built features multiclass labeling capa-bilities. These had yet to be used for the data which had been made available for this thesis. Therewas hope that multiclass-labeled data would be produced in the time-span of this thesis, howeverthis did not happen. Instead, more data sets were gathered to test the general performance of theimplemented classifiers.The testing procedure of the classifiers were very much the same as for anomaly detection: Eachsolution was tested with 10-fold cross-validation on each data set and each SVM implementa-tion was tested with the linear, Gaussian and polynomial kernel. Once a baseline performancewas established the solutions were tested together with PCA. All the results are available in thecorresponding subsection in Results.

7.5 SQN-TWSVM

Due to its reportedly good performance, Sharma et als Stochastic Quasi-Newton Pinball TwinSVM (SQN-PTWSVM) was implemented based on their descriptions[31]. It was hoped that thisnew and promising SVM would provide even better results than the other solutions. No conclusiveresults could be made however, even after weeks of work. The notation and descriptions usedin their report was ambiguous and unclear at parts and there were no recommendations to theranges of several parameters, making it difficult to properly reproduce this algorithm. When it wasbelieved that a functional SQN-PTWSVM had been created matrix search was used to iterativelyfind functional parameter values. With extensive testing the results were still not very satisfyingand this method was eventually abandoned.

19


8. Results

In the sections below all algorithms and their variations will be discussed with respect to theirindividual results. The first subsection will cover the results of the implemented anomaly detec-tion solutions while the second subsection will cover the multiclass classification results. A finalcomparison of all the algorithms in their respective groups (anomaly detection and multiclassclassification) will be summarized in the third subsection.For every test, the random number generator of Matlab will be set to default for the sake oftest–retest reliability. 10-fold cross-validation is used exclusively on all data sets as is normaliza-tion according to Equation 5.Back-Propagation Neural Network will be abbreviated to simply ”NN” with the subscript indicat-ing the number of hidden layers.

8.1 Anomaly detection results

In the following sections the performance of BPNN and OC-SVM will be compared.The tables below will show the performance of BPNN and OC-SVM with and without data pre-processing in the form of the recall and specificity proportions as well as the training and testingtimes. As a reminder, the Recall is the proportion of correctly classified positive samples andSpecificity is the proportion of correctly classified negative samples. As this subsection concernsitself with anomaly detection the specificity is of greater interest than the recall as it is preferableto not give any sample the benefit of the doubt.For each test the anomalous samples, labeled -1, will make up 10% of the training data in order forthem to be regarded as ”rare” occurrences. For the testing data the original relation of negativesamples to positive samples determines the proportional occurrence of the classes.

8.1.1 Baseline performance

Table 1 and Table 2 shows the recall and specificity of the NN and OC-SVM solutions without anypre-processing other than normalization. It is important to look at both tables when evaluatingthese results: Only looking at Table 1 might give the false impression that all the solutions havequite good performance and that the OC-SVM with a Gaussian kernel is the best of them. Lookingat Table 2 however reveals that the Gaussian OC-SVM failed to catch a single anomalous samplein three out of six data sets. It goes without saying that this is dreadful for an anomaly detectionmethod.

As for the other solutions they all had a mixed score. For example the NN3 achieved a recall inthe range of 94-100% but a specificity of 8-97%. Both the linear and polynomial OC-SVMs scored

Table 1: Anomaly Detection baseline recall

Method

Data set HV(smooth)

HV(noisy) DLW PowerCons Strawberry Company data

NN1 1 0.98 0.99 0.98 1 0.93NN3 1 1 0.99 0.94 1 0.99

OC-SVM (Lin.) 1 1 0.99 0.97 0.98 0.99OC-SVM (Gaus.) 1 1 1 1 1 1OC-SVM (Poly.) 1 1 0.99 0.98 1 0.98

This table shows the performance of the anomaly detection methods in terms of their recall on avariety of data sets. No pre-processing has been applied to the data other than normalization.

20


a bit more balanced, having similar recalls as the NN3 but specificities in the ranges 32-100% and45-100% respectively.

Table 2: Anomaly Detection baseline specificity

Method

Data set HV(smooth)


NN1 0.93 0.19 1 0.74 0.91 0.45NN3 0.11 0.08 0.93 0.63 0.97 0.29

OC-SVM (Lin.) 1 1 0.93 0.77 0.65 0.32OC-SVM (Gaus.) 1 0.25 0 0 0.79 0OC-SVM (Poly.) 1 0.97 0.93 0.77 0.96 0.45

This table shows the performance of the anomaly detection methods in terms of their specificitieson a variety of data sets. No pre-processing has been applied to the data.

From the vast variation of these scores it is clear that there is room for improvement, particularlyon the PowerCons, Strawberry and Company data sets where not a single method achieved 100%specificity. Most of the methods did score quite well on the Dodger Loop Weekend set and thesmooth Hill-Valley set.

In Table 3 the training and testing times of all the solutions are shown. Here it is very clear thatthe Neural Networks are far slower than the OC-SVMs. However, the times are still measured inat most tens of seconds.

Table 3: Anomaly Detection baseline performance times

Method

Data set HV(smooth)


NN1 3.46 1.36 1.87 0.96 1.31 4.17NN3 3.52 1.73 3.08 1.61 3.56 56.88

OC-SVM (Lin.) 0.35 0.72 0.05 0.08 0.54 0.80OC-SVM (Gaus.) 0.17 0.55 0.06 0.07 0.10 0.58OC-SVM (Poly.) 0.29 0.22 0.05 0.06 6.59 0.70

The times, measure in seconds, for performing 10-fold cross-validation on the anomaly detectionmethods. No pre-processing has been applied to the data.

8.1.2 Performance with PCA

In an attempt to improve performance and possibly decrease training and testing time PCA wasapplied to the data sets. The evaluation procedure was performed iteratively in order to find thesmallest number of principal components which still gave a satisfying performance. The resultsare presented in Table 4 and Table 5 with the number of principal components presented in Table7. The training and testing times are presented in Table 6.

For this test no solution achieved a specificity lower than 16%, an improvement from the baselinewhere the Gaussian OC-SVM scored 0% on three sets. Regarding the specificity, the GaussianOC-SVM gained a lot from the application of PCA - its specificity on the Dodger Loop Weekenddata set rose from 0% to 86% - but it still remains the weakest classifier with a specificity as low as16% on the Company data set. Meanwhile, all classifiers except the Gaussian OC-SVM achieved aperfect specificity of 100% on the Dodger Loop Weekend set. The polynomial OC-SVM was alone

21


Table 4: Anomaly Detection recall with PCA

Method

Data set HV(smooth)


NN1 1 0.97 1 0.98 1 0.98NN3 1 0.97 1 0.97 0.99 0.99

OC-SVM (Lin.) 1 1 0.97 0.94 0.99 0.98OC-SVM (Gaus.) 1 1 1 0.99 1 0.98OC-SVM (Poly.) 1 1 1 0.99 0.99 0.96

Recall of the anomaly detection methods with PCA applied to the data sets. Each combination ofclassification method and data set uses a number of principal components according to what givesthe highest accuracy.

in managing 100% recall and specificity on more than two data sets.

At this point it seems as if all the methods perform comparatively well to each other. The biggestobstacle appears to be the Company data set where only one method managed a specificity greaterthan guessing-level.

Table 5: Anomaly Detection specificity with PCA

Method

Data set HV(smooth)


NN1 0.99 0.47 1 0.80 0.96 0.42NN3 0.71 0.33 1 0.86 0.99 0.39

OC-SVM (Lin.) 1 1 1 0.74 0.58 0.32OC-SVM (Gaus.) 1 1 0.86 0.71 0.79 0.16OC-SVM (Poly.) 1 1 1 0.80 0.90 0.61

Specificity of the anomaly detection methods with PCA applied to the data sets. Each combinationof classification method and data set uses a number of principal components according to what givesthe highest accuracy.

Though PCA seems to have given the solutions an overall boost in performance, there were somedecreases as well. Several methods lost 1-3% in recall on some sets while the linear and polynomialOC-SVMs lost 8% and 6% specificity on the Strawberry set respectively. The gains are howevergreater than the losses.

Regarding the times, almost all solutions received a positive effect from PCA with a few exceptionswhere the time was increased. For example the polynomial OC-SVMs time on the Strawberry setincreased from 6.59 seconds to 19.37 seconds and on the Company data set from 0.70 to 4.00seconds. This is however the by far greatest increase in time of all the methods, all other increaseswere but 0.01 seconds which could very well be due to rounding-off errors within Matlab.On the other hand, the NNs gained greatly from PCA. The NN3 sped up from 56.88 seconds to0.91 seconds - a decrease of 98.4%. When it comes to dimension reduction the NNs are the biggestbenefactors, mainly due to their initial time consumption as compared to the OC-SVMs.

The capabilities of PCA is shown in Table 7 as only 3 to 5 principal components on the DodgerLoop Weekend set are used out of 288 available - yet the recall and specificities in earlier tablesshow all-around good results.

22


Table 6: Anomaly Detection times with PCA

Method

Data set HV(smooth)


NN1 1.11 1.03 0.91 0.90 0.88 0.79NN3 1.66 1.31 1.28 0.95 1.00 0.91


The times, measure in seconds, for performing 10-fold cross-validation on the anomaly detectionmethods with PCA applied to the data. Each combination of classification method and data setuses a number of principal components according to what gives the highest accuracy.

Table 7: Number of principal components used for the Anomaly Detection methodswith PCA applied

Method

Data set HV(smooth)


NN1 14 9 4 10 16 10NN3 23 40 4 5 13 9

OC-SVM (Lin.) 1 1 3 7 13 11OC-SVM (Gaus.) 1 1 3 3 9 3OC-SVM (Poly.) 1 1 5 6 5 26

This table shows the number of principal components used for each combination of Anomaly De-tection method and data set for testing the impact of PCA.

8.1.3 Performance with autoencoders

Another attempted way to increase performance was by utilizing autoencoders. The test setuphere was the same as in previous experiments: 10-fold cross-validation on normalized data wherethe training data is made to consist of 10% anomalous samples while the testing data has a theoriginal proportion of anomalies to normal samples.

Table 8: Anomaly Detection recall with autoencoders

Method

Data set HV(smooth)


NN1 1 1 1 0.99 1 1NN3 1 0.99 1 1 1 1

OC-SVM (Lin.) 1 1 1 1 1 1OC-SVM (Gaus.) 1 0.95 1 0.99 1 1OC-SVM (Poly.) 1 0.91 1 0.99 1 1

Recall of the anomaly detection methods with autoencoders applied to the data sets.

In Table 8 it can be seen that the recall was in general higher than for the baseline performance

23


in Table 1. Every method managed to achieve perfect recall on at least four out of six data sets- four methods even got 100% on every data set. The overall lowest recall was 91%, done by thepolynomial OC-SVM.

Table 9: Anomaly Detection specificities with autoencoders

Method

Data set HV(smooth)


NN1 0 0 1 0.86 1 1NN3 0.97 0.97 1 0.94 1 1

OC-SVM (Lin.) 0.01 0.01 1 0.66 1 1OC-SVM (Gaus.) 0.82 0.59 1 0.69 1 0.61OC-SVM (Poly.) 0.79 1 1 0.91 1 1

Specificities of the anomaly detection methods with autoencoders applied to the data sets.

On the other side of the evaluation, the application of an autoencoder was a blessing in somecases and a curse in other. For example, the linear OC-SVM which scored 100% baseline speci-ficity on both of the Hill-Valley sets now scored just 1%. Simultaneously, the Neural Networkswere close to polar opposites to each other on this same set: The NN1 managed 0% specificityand the NN3 managed 97%. This latter NN only managed 11% and 8% in baseline specificity,highlighting the fact that an autoencoder has a large effect on the data. Whether this effect ispositive or negative is unclear at this point, as it seems to depend on both the data and the method.

Table 10: Anomaly Detection times with autoencoders

Method

Data set HV(smooth)


NN1 2.44 7.40 2.25 1.75 6.13 106.65NN3 27.49 28.53 20.97 6.93 45.80 805.98


The times, measure in seconds, of the anomaly detection methods with autoencoders applied to thedata sets.

As for the effect of autoencoders on time the results were clear. Only two data set/method com-binations had its time reduced while all others had it increased. This increase was in several casesby no small amount either. Primarily the Neural Networks suffered with the application of anautoencoder with some times increasing by a factor of 16 from 1.73 seconds to 28.53 seconds. Theby far largest time measured in this thesis can be found in Table 10 at 805 seconds.It is worth mentioning that the OC-SVMs also had their times increased a fair lot. The polynomialOC-SVM had its time on the noisy Hill-Valley set increased from 0.22 seconds to 11.52 - a factorof 52. It is also worth mentioning that autoencoders were not employed to in any way better thetimes of the methods but to boost the recall and specificity, however it can be of interest to studyits effects regarding time as well.

24


Table 11: Anomaly Detection recall with PCA and Autoencoders

Method

Data set HV(smooth)


NN1 1 1 1 0.99 1 1NN3 1 1 1 1 1 1

OC-SVM (Lin.) 1 1 1 1 1 1OC-SVM (Gaus.) 1 1 0.99 1 1 1OC-SVM (Poly.) 0.99 1 0.99 1 1 1

Recall of the anomaly detection methods with PCA and autoencoders applied to the data sets. Eachcombination of classification method and data set uses a number of principal components accordingto what gives the highest accuracy.

8.1.4 Performance with PCA and autoencoders

The final set of experiments for anomaly detection combines both of the data processing tech-niques: PCA and autoencoders. For all tests PCA will be applied on the data first, followed byautoencoders. It would have been interesting to investigate the effect of reversing this order buttime is a factor. The order was decided on the fact that PCA was implemented to decrease thedimension of the samples to hasten the trial process. Having an autoencoder, in essence a doubleNeural Network, handle samples in the range of hundreds and even thousands of points followedby dimension reduction seemed absurd.

The two methods were implemented sequentially and iteratively evaluated in order to find thebest number of principal components. Initially the dimensions of the data samples were reducedto a single principal component and fed into the autoencoder which attempted to create a modelto recreate the positive samples. Gradually the number of components were increased until theperformance stagnated.

The recalls for the solutions as presented in Table 11 shows that the vast majority of the recallscores either are perfect or close to perfect. Two solutions achieved perfect recalls on all sets: TheNN3 and the linear OC-SVM. All other solutions missed having an all-together perfect score byjust 1% on one or two sets each.

Table 12: Anomaly Detection specificity with PCA and Autoencoders

Method

Data set HV(smooth)


NN1 0.91 0.76 1 0.91 1 1NN3 1 1 1 1 1 1

OC-SVM (Lin.) 0.59 1 1 1 1 1OC-SVM (Gaus.) 0.78 1 1 1 1 1OC-SVM (Poly.) 0.81 1 1 1 1 1

Specificities of the anomaly detection methods with PCA and autoencoders applied to the data sets.Each combination of classification method and data set uses a number of principal componentsaccording to what gives the highest accuracy.

The specificities in Table 12 are in large part almost as good as the recalls in Table 11, indicatinga very good performance. The most challenging data set appeared to be the smooth Hill-Valleyset on which only one method scored a perfect 100% specificity: The NN3. The NN1 is arguably

25


the worst performing anomaly detector, scoring 76-91% on three sets while all other methods atmost failed a perfect score on one and the same set, namely the aforementioned Hill-Valley set.This could be a testament to the fact that Neural Networks achieve better performance as theygrow in size.

Table 13: Anomaly Detection times with PCA and Autoencoders

,

Method

Data set HV(smooth)


NN1 5.93 3.70 2.54 2.83 2.59 2.52NN3 9.04 8.85 3.66 6.44 8.62 5.58


The times, measure in seconds, of the anomaly detection methods with PCA and autoencodersapplied to the data sets. Each combination of classification method and data set uses a number ofprincipal components according to what gives the highest accuracy.

Table 13 clearly shows the difference in times between the Neural Nets and the OC-SVMs: Theworst times for the OC-SVMs are better than the best times for the Neural Networks.

It can be of interest to compare the number of principal components used for anomaly detectionwith only PCA applied and when both PCA and autoencoders have been applied, as seen in Tables7 and 14. It can be seen that on average the numbers seem unchanged - some are a bit higherwhile some are a bit lower. For the Company data set the number of components used changefrom the range of 3-26 to 2’s on all methods except one which use 10 components, in other wordsa large overall decrease for this particular set. At the same time, however, the smooth Hill-Valleyset seems a lot more challenging for the OC-SVMs. Initially these methods only needed a singlecomponent to manage great scores on this set, but at this point they need 4 and all the way to 40components without even performing as well as they did without autoencoders. Autoencoders areevidently not fit for every situation.

Table 14: Number of principal components with PCA and autoencoders

Method

Data set HV(smooth)


NN1 14 14 4 10 16 10NN3 5 4 2 7 3 2

OC-SVM (Lin.) 30 3 2 4 2 2OC-SVM (Gaus.) 4 3 3 4 2 2OC-SVM (Poly.) 19 3 2 4 2 2

This table shows the number of principal components used for each combination of Anomaly De-tection method and data set for evaluating the impact of both PCA and autoencoders being appliedto the data.

26


8.2 Classification results

The recorded accuracies and times for the selected classification methods will be presented in thefollowing sections. The selected methods are Neural Networks (one and three hidden layers), lin-ear/Gaussian/polynomial 1vsAll SVMs and linear/Gaussian/polynomial 1vs1 SVMs.

8.2.1 Classification baseline performance

Below all classification solutions and their baseline accuracies as well as their times are presented.

Table 15: Classification baseline accuracy

Method

Data setWaveform data Ethanol data UMD MP

NN1 0.58 0.74 0.95 0.10NN3 0.74 0.85 0.98 0.24

1vsAll SVM (Lin.) 0.78 0.23 1 0.611vsAll SVM (Gaus.) 0.84 0.03 0.78 0.801vsAll SVM (Poly.) 0.81 0.62 1 0.861vs1 SVM (Lin.) 0.85 0.62 1 0.841vs1 SVM (Gaus.) 0.85 0.34 0.94 0.891vs1 SVM (Poly.) 0.82 0.90 1 0.89

The baseline accuracies of the selected classification methods. No pre-processing has been appliedto the data.

As can be seen in Table 15 the methods perform vastly different on the data sets. For example, theGaussian 1vsAll SVM had an overall good performance on every set except the Ethanol data setwhere it scored just 3%. Meanwhile, the NN1 scored acceptably on all sets except the MelbournePedestrian data set where it only classified 10% correctly. At this initial step it can be seen thatthe polynomial 1vs1 SVM is the on average best classifier, scoring higher than any other classifieron every set except the Waveform data set where it was 3% under the linear and Gaussian 1vs1SVMs and 2% under the Gaussian 1vsAll SVM.

Table 16: Classification baseline times

Method


NN1 1.42 43.37 0.94 1.16NN3 2.07 250.37 2.01 2.05

1vsAll SVM (Lin.) 6.46 23.76 0.16 8.771vsAll SVM (Gaus.) 6.02 17.02 0.17 5.791vsAll SVM (Poly.) 200.64 251.28 1.39 55.501vs1 SVM (Lin.) 1.95 3.30 0.13 3.141vs1 SVM (Gaus.) 2.21 2.98 0.15 3.091vs1 SVM (Poly.) 44.14 54.27 1.04 9.47

The baseline times, measure in seconds, of the selected classification methods. No pre-processinghas been applied to the data.

In Table 16 it is made obvious that the 1vsAll and 1vs1 approaches are rather different to each

27


other due to the quite consistently larger time one of them requires on training and testing. How-ever, it is not the 1vs1 approach with its greater number of classifiers that takes the longest time -it is instead the 1vsAll approach. This could be due to the classifiers in the 1vs1 approach trainingon significantly smaller sets of data than the 1vsAll approach. The 1vsAll approach trains everyclassifier on all of the training data every time, while the 1vs1 only trains its classifiers on twosubsets of the training data.

The overall speed of the methods tested here is difficult to compare as some methods are fasterthan others on some data sets but slower on others. What can be said though is that the poly-nomial methods, particularly the 1vsAll SVM, take much longer time than any other method onalmost every data set.

8.2.2 Classification performance with PCA

Below are the accuracies and times for all the classification methods with PCA applied to the data.The number of principal components used for each method and data set can be found in Table 19.

Table 17: Classification accuracies with PCA

Method


NN1 0.59 0.95 1 0.17NN3 0.75 0.95 1 0.33

1vsAll SVM (Lin.) 0.78 0.23 0.99 0.611vsAll SVM (Gaus.) 0.78 0.23 0.99 0.611vsAll SVM (Poly.) 0.85 0.62 1 0.861vs1 SVM (Lin.) 0.85 0.62 1 0.841vs1 SVM (Gaus.) 0.85 0.34 1 0.891vs1 SVM (Poly.) 0.85 0.89 1 0.88

Accuracies of the classification methods. Each combination of classification method and data setuses a number of principal components according to what gives the highest accuracy.

As can be seen in Table 15 the accuracies vary quite a bit between both the data sets and themethods and the kernels. The NN1 achieved the lowest overall accuracy of 17% which was closelyfollowed by the linear and Gaussian 1vsAll SVMs with 23%. The best performing classifier at thisstage is the polynomial 1vs1 SVM, having accuracies within the range of 85-100%.

In Table 17 it is shown that the time difference between the 1vsAll and 1vs1 SVMs still hold true.There is just one minor exception to this relationship, namely the polynomial SVMs on the UMDset, where the 1vsAll approach is 0.3 seconds faster.

Regarding the effect of PCA on the times it can be seen that it is not only positive. The NN3

did get its time on the Ethanol data set diminished from 250 seconds to almost 2 seconds and thepolynomial 1vsAll SVM had its time on the Waveform data set cut from 200 seconds to 5 seconds.However, in some places the times did go up slightly. An example is the polynomial 1vsAll SVMon the Melbourne Pedestrian data set: Here the time went from 55.50 seconds to 59.04 seconds.The increases were by no means as dramatic and damning as the decreases and in this regard PCAcan still be regarded as a positive influence.

28


Table 18: Classification times with PCA

Method


NN1 1.29 1.09 0.90 1.30NN3 2.39 1.90 0.90 1.53

1vsAll SVM (Lin.) 5.34 4.18 0.13 8.961vsAll SVM (Gaus.) 5.69 7.22 0.13 9.261vsAll SVM (Poly.) 5.05 238.77 0.12 59.041vs1 SVM (Lin.) 1.31 0.91 0.13 3.061vs1 SVM (Gaus.) 1.60 0.64 0.12 3.161vs1 SVM (Poly.) 1.73 57.50 0.15 12.31

The times, measure in seconds, of the classification methods. Each combination of classificationmethod and data set uses a number of principal components according to what gives the highestaccuracy.

Table 19: Number of principal components used for classification with PCA applied

Method


NN1 3 130 15 13NN3 3 150 13 18

1vsAll SVM (Lin.) 10 56 8 241vsAll SVM (Gaus.) 10 75 8 221vsAll SVM (Poly.) 2 250 8 231vs1 SVM (Lin.) 2 50 7 151vs1 SVM (Gaus.) 2 50 4 151vs1 SVM (Poly.) 2 150 4 10

This table shows the number of principal components used for each combination of classificationmethod and data set for evaluating the impact of PCA beign applied to the data.

8.2.3 Stochastic Quasi-Newton TWSVM

Extensive testing could not get this solution to perform well at all. Usually it either labeled everysample as positive or every sample as negative. Some minor successes were recorded, such as 65%recall and 27% specificity being achieved on the Dodger Loop Weekend set, but the number ofparameters needing highly specific and precise values for every data set made this solution grosslyinconvenient and time consuming. A proper and reliable optimization algorithm might overcomethis along with a known recommended range for the parameters. The best parameter values foundfor the Dodger Loop Weekend set are presented in Table 20.

29


Parameter ValueΓ 2340α 8.16δ 3.5

Table 20: Best parameter values found for the SQN-PTWSVM on the Dodger Loop Weekend dataset.

9. Conclusions

The purpose of this thesis was two-fold: 1) Find a feasible anomaly detection method for the mo-ments as measured from the turning of a key component in a mechanical mechanism, and 2) Finda classifier which can pin-point the flaws in the mechanical mechanism which caused it to produceanomalous measurements. The results from these then has to be compared to the performance ofexperienced people on the same data.

Research led to the consideration of different kinds of Support Vector Machines (SVM) which wereinvestigated, implemented and tested against the performance of Neural Networks. Two methodswere implemented to aid these anomaly detectors and classifiers, namely Principal ComponentAnalysis (PCA) and autoencoders.One-Class Support Vector Machines (OC-SVM) with three different kernels - linear, Gaussian andpolynomial - were used for anomaly detection. PCA showed great results in decreasing the di-mensionality of samples while retaining the information within these dimensions. Autoencoders inconjunction with PCA gave an overall boost in anomaly detection performance for both the Sup-port Vector Machines and Neural Networks. For classification ordinary SVMs were used togetherwith methods to enable multiclass classification.

9.1 Anomaly Detection

With both PCA and autoencoders employed every method fared very well on almost every dataset. The smooth Hill-Valley set proved to be a challenge in terms of specificity to the OC-SVMs,but there was no issue at all with the noisy Hill-Valley set.

Looking beyond the mediocre performance on the smooth Hill-Valley set the OC-SVMs performedperfectly or very nearly perfectly in terms of both recall and specificity on every data set, as didthe Neural Network with three hidden layers. The smaller of the two Neural Networks failed toscore more than 91% on three data sets, putting this method at the bottom rank. Looking closeron the times for the methods it can be determined that the OC-SVMs are remarkably faster thanthe Neural Networks, making the former quite preferable. For example, inspecting the Strawberrydata set in Table 13 shows that the linear OC-SVM is almost 12 times faster than the smallerNeural Network.

It can with confidence be said that RQ1 can be answered ’Yes’, as not one but four solutions(counting the two Neural Networks as one solution) were demonstrated to achieve a perfect de-tection rate on the company provided data. This shows that Machine Learning can be used toperform anomaly detection to the same degree as a person. The limiting factor here is the fact thatthe labels on the data were determined by people and that no strict definitions of ’approved’ or’not approved’ exists. A sample is judged solely on the experience and disposition of the testbenchoperator and it is on this judgement which any supervised classification algorithm depends. Toenable the possibility of catching flawed or anomalous samples to a greater extent and accuracythan a person it is necessary to find or create new perspectives in which to analyse the samplesand to become self-reliant. This ties in with RQ2 as the results shows that Machine Learning is atthe very least as good as a person at recognizing flawed samples and can make such a recognitionin times that are genuinely impossible for a person to achieve.

30


9.2 Classification

As for classifying specific flaws in an anomalous sample from the mechanism no answer could begiven as no such data had yet been produced. Instead the proposed solutions, also based on SVMand compared to Neural Networks, were tested on a handful of alternative data sets. SVMs wereused for classification on these sets by implementing the 1vs1 and 1vsAll approaches with the samethree kernels as for anomaly detection: Linear, Gaussian and polynomial.Comparing the baseline accuracies in Table 15 and the accuracies where PCA has been appliedin Table 17 it is seen that PCA increases the accuracies for several of the methods and data sets,but for a few the effects were negative. For example, the Gaussian 1vsAll SVM on the MelbournePedestrian data set decreased from the baseline 80% accuracy to 61% with PCA.The resulting classifiers show mixed performance, with the overall worst accuracy being 17% andthe overall best accuracy being 100% which was achieved by every classifier except the linear andpolynomial 1vsAll SVMs on the UMD set.Given that the Melbourne Pedestrian data set has ten classes it can be considered the most difficultset for this test as blind guessing would give an accuracy circling 10%. This helps in displaying therobustness of the 1vs1 approach as all three of the kernels produced accuracies of 84-89% while allbut one of the other solutions achieved 61% or lower. In fact all three 1vs1 SVMs performed atleast as well as any of the 1vsAll SVMs on every data set and better than the Neural Networks onthree out of four data sets, making this approach highly favorable. Directly comparing the 1vs1and the 1vsAll approaches with their respective kernels shows that the 1vs1 approach consistentlyperforms equally or better than the 1vsAll approaches. On top of this the 1vs1 SVMs are a greatdeal faster than the 1vsAll SVMs as well, as seen in Table 18, thus removing any benefit to the1vsAll approach other than its slightly easier implementation. It must also be pointed out thatthe NN3 had a very clear advantage on the Ethanol data set as opposed to the 1vs1 SVMs, as theNN3 was not only faster but also had a higher accuracy for this particular set.

In short, the tests showed that a polynomial 1vs1 SVM had the overall best accuracy across alldata sets (85-100% accuracy, average 90.5%) with reasonable time efficiency (0.15-57.50 seconds,average of 17.92 seconds). It stands to be said however that a Neural Network is anticipated to beable to reach the same level of accuracy on the sets used if additional hidden layers are added. Italso stands to be said that more hidden layers equates to even longer training times.

The final research question, RQ3, could not be answered here as no data was available aboutthe faults that might manifest in the mechanism. Instead the solutions were tested for generalclassification performance on alternative data sets. The results showed great promise for the poly-nomial 1vs1 SVM which is why this method is recommended from the ones evaluated in this thesis.

The time for every proposed solution has been investigated in this thesis. The anomaly detectionperformance as well as the accuracy of the classifiers have been the primary concern, but the timeis also an important factor. With Machine Learning it is not uncommon to see algorithms takingabsurd amounts of time to converge, as can be seen in Table 10 where the NN3 took 805 secondson the Company data set. Such long times can weigh negatively for a method to such an extentthat even with otherwise good performance, the time makes it impractical. Most times recordedhere were however manageable.

In summary, a feasible system has been proposed. The hypothesis is partly proven: Anomalydetection on the provided data can very much be performed at least as well as an experiencedperson. It was however not possible to prove if Machine Learning could handle classification onthe provided data as well as an experienced person as no such data has yet been made.

31


10. Discussion

From the results of the tests performed with the selected methods a few things were made clear.Regarding the initial problem of anomaly detection all solutions had quite good performance, scor-ing perfectly, or nearly so, on close to every data set when both PCA and autoencoders had beenemployed. The Hill-Valley set was rather curious though. Looking back at Tables 2, 5 and 9 itcan be seen that the OC-SVMs perform perfectly whenever an autoencoder had not been appliedto this set. Figure 11 shows a comparison of samples from both Hill-Valley sets with and withoutautoencoders. Simultaneously the Neural Networks, which initially scored 93% and 11% specificityon this same set, scored almost perfectly after PCA and autoencoders had been applied. This phe-nomenon might be related to the shape of the Hill-Valley data though nothing conclusive could besaid.

(a) (b)

(c) (d)

Figure 11: An example comparison of samples of the Hill-Valley sets. (a) and (b) shows smoothsamples without and with an autoencoder respectively. (c) and (d) makes the same comparison fornoisy samples.

A proposed overall structure of a system which incorporates both an anomaly detector and aclassifier can be seen in Figure 12. The sensor readings on a sample are sent to the anomalydetector which either detects the sample as ’approved’ or ’not approved’ in which the readingsare forwarded to the classifier. The classifier attempts to identify the most probable cause of theanomalous readings and relays this to an operator information so that the sample can be fixed.The sample is then put back into the system and measured again, hopefully now being as it shouldbe.

As for the multiclass classifiers the results showed that the polynomial 1vs1 SVM holds the mostpromise. It was desirable to test this classifier on the mechanism data, but since no other labelsthan ’approved’ and ’not approved’ exists this was simply not an option.

10.1 Future Work

When multiclass data has been produced it could potentially be beneficial to not only construct aclassifier to determine what fault might be present such as a 1vs1 SVM, but to also implement aconfidence measure for each class. This could be useful in the situations where multiple faults have

32


Figure 12: A proposed overall structure of a system incorporating the researched solutions. Mea-surements are fed to the anomaly detection method. A measured sample is either approved orforwarded to the classifier, which provides information as to a probable cause for the sample notbeing approved by the anomaly detection method.

manifested and it is of interest to know the severity of each fault. If 1vs1 SVMs are implementedit could be possible to create a confidence score for each classification within the approach. The1vs1 approach only indicates the most probable class of the data through a form of majority vote,however it does make a comparison to all other classes as well. Some measure of confidence foreach class could be produced instead of a single class label, thus indicating to which degree anyfault might be potentially present. This could also be coupled with Fuzzy Logic as not all faultsmight be equally compromising.

In order to find a solution which might surpass human inference it would be necessary to eitherconstruct some fitness criteria or to implement an unsupervised solution, such as Support VectorClustering with a Gaussian kernel[40]. Additional data processing might also be favorable - suchas feature extraction, Independent Component Analysis (ICA) or implementation of additionalsensors to measure other aspects of the samples.

A Twin SVM in conjunction with the 1vs1 approach could potentially help bring more robustnessto the classification, due to the class-enveloping nature of the planes. Sadly the Stochastic Quasi-Newton Pinball Twin SVM attempted in this thesis could not be implemented properly. Perhapsgoing back a step in the development to Jayadeva et als Twin SVM as given in Equations 6 to9 would enable a proper evaluation of this kind of SVM. Additionally, if the SQN-PTWSVM isinvestigated further the parameters could be optimized through Differential Evolution [41].

33


References

[1] J. Xie, “Optical character recognition based on least square support vector machine,” in 2009Third International Symposium on Intelligent Information Technology Application, vol. 1, Nov2009, pp. 626–629.

[2] X. Li, X. Dong, J. Lian, Y. Zhang, and J. Yu, “Knockoff filter-based feature selection fordiscrimination of non-small cell lung cancer in ct image,” IET Image Processing, vol. 13,no. 3, pp. 543–548, 2019.

[3] T. Turki, “An empirical study of machine learning algorithms for cancer identification,” in2018 IEEE 15th International Conference on Networking, Sensing and Control (ICNSC),March 2018, pp. 1–5.

[4] P. Shimpi, S. Shah, M. Shroff, and A. Godbole, “A machine learning approach for the classifi-cation of cardiac arrhythmia,” in 2017 International Conference on Computing Methodologiesand Communication (ICCMC), July 2017, pp. 603–607.

[5] A. Datta, M. Augustin, N. Gupta, S. Viswamurthy, K. M. Gaddikeri, and R. Sundaram,“Impact localization and severity estimation on composite structure using fiber bragg gratingsensors by least square-support vector regression,” IEEE Sensors Journal, 2019.

[6] V. Pliuhin, M. Pan, V. Yesina, and M. Sukhonos, “Using azure maching learning cloud technol-ogy for electric machines optimization,” in 2018 International Scientific-Practical ConferenceProblems of Infocommunications. Science and Technology (PIC S T), Oct 2018, pp. 55–58.

[7] D. Silder, J. Schrittwieser, K. Simonyan, and et al, “Mastering the game of go without humanknowledge,” Nature, vol. 550, p. 354, October 2017.

[8] V. Vapnik and A. Lerner, “Pattern recognition using generalized portraits (translated fromrussian),” Avtomatika i Telemekhanika, vol. 24, no. 6, pp. 774–780, Dec 1963.

[9] X. Wang and S. Lu, “Improved fuzzy multicategory support vector machines classifier,” in2006 International Conference on Machine Learning and Cybernetics, Aug 2006, pp. 3585–3589.

[10] G. Yan and S. Fenzhen, “Study on machine learning classifications based on oli images,” inProceedings 2013 International Conference on Mechatronic Sciences, Electric Engineering andComputer (MEC). IEEE, 2013, pp. 1472–1476.

[11] H. Liu, X. Wang, X. Bi, X. Wang, and J. Zhao, “A multi-feature svm classification of thangkaheaddress,” in 2015 8th International Symposium on Computational Intelligence and Design(ISCID), vol. 2, Dec 2015, pp. 160–163.

[12] Y. Tan and G. Zhang, “The application of machine learning algorithm in underwriting pro-cess,” in 2005 International Conference on Machine Learning and Cybernetics, vol. 6, Aug2005, pp. 3523–3527 Vol. 6.

[13] R. Rojas, Neural networks: a systematic introduction. Springer Science & Business Media,2013.

[14] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by back prop-agating errors,” Nature, vol. 323, pp. 533–536, 10 1986.

[15] Y. Li, Y. Fu, H. Li, and S. Zhang, “The improved training algorithm of back propagation neuralnetwork with self-adaptive learning rate,” in 2009 International Conference on ComputationalIntelligence and Natural Computing, vol. 1, June 2009, pp. 73–76.

[16] B. E. Boser, I. M. Guyon, and V. N. Vapnik, “A training algorithm for optimal marginclassifiers,” in Proceedings of the fifth annual workshop on Computational learning theory.ACM, 1992, pp. 144–152.

34


[17] T. M. Cover, “Geometrical and statistical properties of systems of linear inequalities withapplications in pattern recognition,” IEEE Transactions on Electronic Computers, vol. EC-14, no. 3, pp. 326–334, June 1965.

[18] C. Cortes and V. Vapnik, “Support-vector networks,” Machine learning, vol. 20, no. 3, pp.273–297, 1995.

[19] B. Scholkopf, R. C. Williamson, A. J. Smola, J. Shawe-Taylor, and J. C. Platt, “Supportvector method for novelty detection,” in Advances in neural information processing systems,2000, pp. 582–588.

[20] J. Weston and C. Watkins, “Multi-class support vector machines,” Citeseer, Tech. Rep., 1998.

[21] R. Bro and A. K. Smilde, “Principal component analysis,” Analytical Methods, vol. 6, no. 9,pp. 2812–2831, 2014.

[22] F. Q. Lauzon, “An introduction to deep learning,” in 2012 11th International Conference onInformation Science, Signal Processing and their Applications (ISSPA). IEEE, 2012, pp.1438–1439.

[23] W. S. McCulloch and W. Pitts, “A logical calculus of the ideas immanent in nervous activity,”The bulletin of mathematical biophysics, vol. 5, no. 4, pp. 115–133, 1943.

[24] P. Werbos and P. J. (Paul John, “Beyond regression : new tools for prediction and analysisin the behavioral sciences /,” 01 1974.

[25] H. J. Kelley, “Gradient theory of optimal flight paths,” Ars Journal, vol. 30, no. 10, pp.947–954, 1960.

[26] W. Zhang, K. Itoh, J. Tanida, and Y. Ichioka, “Parallel distributed processing model withlocal space-invariant interconnections and its optical architecture,” Applied optics, vol. 29,no. 32, pp. 4790–4797, 1990.

[27] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9,no. 8, pp. 1735–1780, 1997.

[28] M. Aiserman, E. M. Braverman, and L. Rozonoer, “Theoretical foundations of the poten-tial function method in pattern recognition (translated from russian),” Avtomat. i Telemeh,vol. 25, pp. 917–936, 1964.

[29] O. L. Mangasarian and E. W. Wild, “Multisurface proximal support vector machine classi-fication via generalized eigenvalues,” IEEE Transactions on Pattern Analysis and MachineIntelligence, vol. 28, no. 1, pp. 69–74, Jan 2006.

[30] R. Jayadeva, Khemchandani, S. Chandra et al., “Twin support vector machines for patternclassification,” IEEE Transactions on pattern analysis and machine intelligence, vol. 29, no. 5,pp. 905–910, 2007.

[31] S. Sharma, R. Rastogi, and S. Chandra, “Large-scale twin parametric support vector machineusing pinball loss function,” IEEE Transactions on Systems, Man, and Cybernetics: Systems,pp. 1–17, 2019.

[32] W. Hu, Y. Liao, and V. R. Vemuri, “Robust anomaly detection using support vector ma-chines,” in Proceedings of the international conference on machine learning, 2003, pp. 282–289.

[33] K. Heller, K. Svore, A. D. Keromytis, and S. Stolfo, “One class support vector machines fordetecting anomalous windows registry accesses,” 2003.

[34] M. Amer, M. Goldstein, and S. Abdennadher, “Enhancing one-class support vector machinesfor unsupervised anomaly detection,” in Proceedings of the ACM SIGKDD Workshop onOutlier Detection and Description. ACM, 2013, pp. 8–15.

35


[35] D. Dua and C. Graff, “UCI machine learning repository,” 2017. [Online]. Available:http://archive.ics.uci.edu/ml

[36] “Uea ucr time series classification repository,” http://www.timeseriesclassification.com/dataset.php, accessed: 2019-03-19.

[37] I. Dagher and R. Nachar, “Face recognition using ipca-ica algorithm,” IEEE transactions onpattern analysis and machine intelligence, vol. 28, no. 6, pp. 996–1000, 2006.

[38] L. Gu and X. Lu, “Semi-supervised subtractive clustering by seeding,” in 2012 9th Interna-tional Conference on Fuzzy Systems and Knowledge Discovery. IEEE, 2012, pp. 738–741.

[39] J. I. Collazo-Cuevas, M. A. Aceves-Fernandez, E. Gorrostieta-Hurtado, J. Pedraza-Ortega,A. Sotomayor-Olmedo, and M. Delgado-Rosas, “Comparison between fuzzy c-means clusteringand fuzzy clustering subtractive in urban air pollution,” in 2010 20th International Conferenceon Electronics Communications and Computers (CONIELECOMP). IEEE, 2010, pp. 174–179.

[40] A. Ben-Hur, D. Horn, H. T. Siegelmann, and V. Vapnik, “Support vector clustering,” Journalof machine learning research, vol. 2, no. Dec, pp. 125–137, 2001.

[41] R. Storn and K. Price, “Differential evolution – a simple and efficient heuristic for globaloptimization over continuous spaces,” Journal of Global Optimization, vol. 11, no. 4, pp.341–359, Dec 1997. [Online]. Available: https://doi.org/10.1023/A:1008202821328

36

http://archive.ics.uci.edu/ml

http://www.timeseriesclassification.com/dataset.php

http://www.timeseriesclassification.com/dataset.php

https://doi.org/10.1023/A:1008202821328

machine learning for mechanical analysis

Documents