time-efficient variants of twin support vector … · iii certi cate this is to certify that the...

232
TIME-EFFICIENT VARIANTS OF TWIN SUPPORT VECTOR MACHINE WITH APPLICATIONS IN IMAGE PROCESSING by POOJA SAIGAL Department of Computer Science Submitted in fulfillment of the requirements of the degree of Doctor of Philosophy to the South Asian University, New Delhi, India August, 2017

Upload: duongnhu

Post on 14-Aug-2019

212 views

Category:

Documents


0 download

TRANSCRIPT

TIME-EFFICIENT VARIANTS OF TWIN SUPPORT

VECTOR MACHINE WITH APPLICATIONS IN

IMAGE PROCESSING

by

POOJA SAIGAL

Department of Computer Science

Submitted

in fulfillment of the requirements of the degree of

Doctor of Philosophy

to the

South Asian University,

New Delhi, India

August, 2017

c© South Asian University (SAU), Delhi, 2017

All Rights Reserved.

Dedicated to Amit, Akaisha

and my Family

i

Declaration

I hereby declare that the thesis entitled Time-efficient Variants of Twin

Support Vector Machine with Applications in Image Processing being

submitted to the South Asian University, New Delhi for the award of the degree of

Doctor of Philosophy contains the original work carried out by me under the super-

vision of Dr. Reshma Rastogi. The research work reported in this thesis is original

and has not been submitted either in part or full to any university or institution for

the award of any degree or diploma.

Pooja Saigal

Enrollment No.: SAU/CS(P)/2013/004

iii

Certificate

This is to certify that the thesis entitled “Time-efficient Variants of Twin

Support Vector Machine with Applications in Image Processing submit-

ted by Pooja Saigal to the South Asian University, New Delhi for the award of the

degree of Doctor of Philosophy, is a record of the bonafide research work carried out

by her under my supervision and guidance. The thesis has reached the standards

fulfilling the requirements of the regulations relating to the degree.

The results contained in this thesis have not been submitted in part or full to any

other university or institute for the award of any degree or diploma.

Dr. Reshma Rastogi

(Supervisor)

Department of Computer Science,

South Asian University,

New Delhi, India

v

Acknowledgments

The tenure of my Ph.D. at South Asian University has been an enriching and

fruitful experience. I am indebted to many people who made this work possible and

it is my pleasure to express my gratitude towards them.

I owe my deepest gratitude to my supervisor Dr. Reshma Rastogi. Working

with her has been a real pleasure to me, with tremendous learning and growth. She

has been a steady support throughout the duration of my Ph.D. and has oriented

me with promptness. She has always been patient and encouraging in times of new

ideas and difficulties. The discussions with her have led to key insights. Her ability

to identify and approach compelling research problems with high scientific standards

and hard work, motivated me to give my best to this research work. I also admire

her for making me feel like a friend. I could not have imagined having a better

supervisor and mentor for my research work.

I am extremely thankful to Dr. Suresh Chandra, for his valuable suggestions

and encouragement that improved the quality of my research work. It is very dif-

ficult to find a person like Dr. Chandra who is so humble and has an astounding

understanding of mathematics. I have been very privileged to get to know him and

to work with him.

I am extremely grateful to South Asian University for providing the financial

support, in the form of scholarship, to carry out this work. I also thank SAU for

providing a conducive environment and a well equipped Machine Learning and Com-

putational Intelligence Laboratory. I am grateful to Dr. Kavita Sharma (President,

SAU). I would like to express my gratitude towards Dean, Faculty of Mathematics

and Computer Science, Dr. R.K.Mohanty and Chairperson, Department of Com-

puter Science Dr. Muhammad Abulaish, for their support and encouragement. I am

also thankful to Dr. Pranab K. Muhuri, Dr. Amit Banerjee and Dr.Danish Lohani.

I am grateful to all the members of DRC. I am thankful to my RPC members Dr.

Deepa Sinha (Department of Mathematics) and Dr. Muhammad Abulaish, for their

valuable suggestions and encouragement. I also owe my gratitude towards Dr. Ekta

Walia for her help during my initial days of Ph.D. Coursework.

The last four years have been a period of immense learning with extensive work

vi

and I would like to thank all my colleagues for proving an excellent research en-

vironment. I appreciate their support and cooperation during my stay at SAU. I

am thankful to Aman Pal, Sweta Sharma, Pritam Anand and Yashi for being great

friends and supporting me at the time of need. The discussions with them stimu-

lated new ideas and gave different perspectives for handling a problem. I would like

to thank all my colleagues from the Department of Computer Science and Mathe-

matics.

Finally, I would like to thank most important people in my life. This thesis

would not have been possible without their constant support and encouragement.

My husband Amit Saigal is my strength. He supported me unconditionally in every

sphere of life and has motivated me throughout my research work at SAU. I have

learnt the qualities of perseverance and dedication from him. There were multiple

times when I felt dejected and he helped me out. I will never be able to thank

him enough for his steady support at difficult times. These four years have been

a learning experience for my loving daughter Akaisha, who has learnt to be inde-

pendent as I was not always there to help her. I am grateful to my parents-in-law

who supported me with great patience and took upon my responsibilities at home,

in my absence. I am grateful to God that I am born in a family who is so caring

and supportive. I could not thank my mother Mrs. Neelam Khanna enough, for

supporting me emotionally and listening to all my feelings. I am blessed to have

Nidhi and Vaibhav as my siblings, whose love and support helped me to get out

of hard times. My idol is my father Mr. Shiv Kumar Khanna, who is showering

his blessings on me from heaven. He always inspired me and had firm belief in my

capabilities.

Pooja Saigal

vii

Preface

Human beings can display behavior that can be called as intelligent, by learning

from the experiences. Learning gives us flexibility to adapt and adjust to new envi-

ronment. The aim of learning is to generalize which essentially means to establish

similarity between situations, so that the rules which are applicable in one situa-

tion can be applied or extended to other situations. Machine learning is a rapidly

progressing stream of artificial intelligence that enables a machine to learn from

the empirical data and builds models to make reliable future predictions. Depend-

ing on the availability of output values (labels), machine learning can be broadly

categorized into two paradigms: supervised and unsupervised learning.

One of the most distinguished works in supervised learning is classification using

Support Vector Machines (SVMs). Another major breakthrough is the develop-

ment of Twin Support Vector Machine (TWSVM) which has better generalization

ability than SVM and is almost four times faster than conventional SVMs. This re-

search work is an attempt to explore the existing SVM and TWSVM based learning

algorithms and to develop new ones which could deliver better results than well-

established methodologies. Our focus is on development of time-efficient supervised

and unsupervised TWSVM-based learning algorithms, with good generalization abil-

ity, and to apply them for image processing tasks.

This thesis presents novel nonparallel hyperplane classification algorithms along

with their extension to multi-category classification and clustering approaches. Im-

provements on ν-Twin Support Vector Machine (Iν-TWSVM) is a classification

algorithm which solves a smaller-sized quadratic programming problem (QPP) and

an unconstrained minimization problem (UMP), instead of solving a pair of QPPs

as done for TWSVM, to generate two nonparallel proximal hyperplanes. The faster

version of Iν-TWSVM, termed as Iν-TWSVM (Fast), modifies the first problem of

Iν-TWSVM as minimization of a unimodal function for which line search methods

can be used; this further avoids solving the QPP in the first problem. Both these

classifiers have good generalization ability. Two more classifiers i.e. Angle-based

Twin Parametric-Margin Support Vector Machine (ATP-SVM) and Angle-based

Twin Support Vector Machine (ATWSVM), have been developed which try to max-

viii

imize the angle between the normal vectors to the two nonparallel hyperplanes so

as to generate larger separation between the two classes. ATP-SVM solves only one

modified QPP with fewer number of representative patterns. It avoids the explicit

computation of inverse of matrices in the dual problem and has efficient learning

time. ATWSVM finds the two hyperplanes by solving a QPP and a UMP.

This work presents a multi-category classification algorithm termed as Reduced

tree for Ternary Support Vector Machine (RT-TerSVM), which organizes the clas-

sifiers in the form of a ternary tree. This algorithm uses a novel classifier Ternary

Support Vector Machine (TerSVM) to generate three nonparallel hyperplanes. An-

other novel multi-category classification algorithm termed as Ternary Decision Struc-

ture (TDS) has been developed that can extend binary classifiers to multi-category

framework. TDS is more time efficient than the classical One-Against-All (OAA)

approach. For a K-class problem, a balanced TDS requires dlog3Ke comparisons for

evaluating a test pattern. TDS associates ternary output labels +1, 0 or −1 with

the training patterns. Another multi-category approach Binary Tree (BT) of classi-

fiers is developed on the lines of TDS and it generates binary output at each level of

the tree. Our work compares the behavior of nonparallel hyperplanes classifiers viz.

Generalized Eigenvalue Proximal SVM (GEPSVM) and its variants, using different

multi-category approaches.

This work includes development of an unsupervised clustering algorithm termed

as Tree-based Localized Fuzzy Twin Support Vector Clustering (Tree-TWSVC),

which recursively builds a cluster model as a Binary Tree. Here, each node com-

prises of a novel TWSVM-based classifier termed as Localized Fuzzy TWSVM (LF-

TWSVM). Since there is uncertainty in associating cluster labels with patterns, so we

used fuzzy cluster membership. Tree-TWSVC has efficient learning time, achieved

due to tree structure and its formulation leads to solving a series of system of linear

equations. All the above mentioned classification and clustering algorithms have

been applied to perform image processing tasks like content based image retrieval,

image segmentation and handwritten digit recognition.

ix

List of Publications

Papers in Journals:

1. Rastogi, R., Saigal, P. and Chandra, S., 2018: Angle-based Twin Parametric-

margin Support Vector Machine for Pattern Classification. Knowledge-

Based Systems, 139, pp. 64-77.

2. Rastogi, R., Saigal, P. and Chandra, S., 2017. Angle-based Twin Support

Vector Machine. Annals of Operations Research, DOI: 10.1007/s10479-

017-2604-2.

3. Rastogi, R. and Saigal, P., 2017. Tree-based Localized Fuzzy Twin Support

Vector Clustering with Square Loss Function. Applied Intelligence, 47 (1),

pp. 96-113.

4. Khemchandani, R., Saigal, P. and Chandra, S., 2016. Improvements on ν-

Twin Support Vector Machine. Neural Networks, 79, pp. 97-107.

5. Khemchandani, R. and Saigal, P., 2015. Color Image Classification and Re-

trieval Through Ternary Decision Structure Based Multi-category TWSVM.

Neurocomputing, 165, pp. 444-455.

Conference proceedings:

5. Saigal, P. and Khemchandani, R., 2015, December. Nonparallel Hyperplane

Classifiers for Multi-category Classification. 2015 IEEE Workshop on Compu-

tational Intelligence: Theories, Applications and Future Directions (WCI), pp.

1-6.

Communicated Papers:

1. Saigal, P., Rastogi, R. and Chandra, S.: Ternary Support Vector Machine

with Extension for Multi-category Classification.

x

Table of Contents

Declaration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i

Certificate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

List of Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvi

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviii

List of Symbols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviii

1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1. Classification Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.1.1. Twin Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . 3

1.1.2. Least Square Twin Support Vector Machine . . . . . . . . . . . . . . . 6

1.1.3. Twin Bounded Support Vector Machine . . . . . . . . . . . . . . . . . 7

1.1.4. Twin Parametric-Margin Support Vector Machine . . . . . . . . . . . . 8

1.1.5. ν-Twin Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . 9

1.1.6. Nonparallel Support Vector Machine with One Optimization Problem . 10

1.2. Clustering Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.2.1. K-Means Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.2.2. Maximum-Margin Clustering . . . . . . . . . . . . . . . . . . . . . . . . 11

1.2.3. Twin Support Vector Machine for Clustering . . . . . . . . . . . . . . 12

1.3. Multi-Category Extension of Binary Classifiers . . . . . . . . . . . . . . . . . 14

1.3.1. One-Against-One Twin Support Vector Machine . . . . . . . . . . . . . 15

1.3.2. Twin-KSVC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

1.4. Brief Introduction to Image Processing . . . . . . . . . . . . . . . . . . . . . . 16

1.4.1. Content-based Image Retrieval . . . . . . . . . . . . . . . . . . . . . . . 17

1.4.2. Image Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

1.4.3. Image Segmentation through Pixel Classification . . . . . . . . . . . . . 17

1.5. Contribution of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

xii

2. Improvements on ν-Twin Support Vector Machine . . . . . . . . . . . . . . . . . . 23

2.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.2. Improvements on ν-Twin Support Vector Machine . . . . . . . . . . . . . . . 24

2.2.1. Iν-TWSVM (Linear classifier) . . . . . . . . . . . . . . . . . . . . . . . 25

2.2.2. Iν-TWSVM (Kernel classifier) . . . . . . . . . . . . . . . . . . . . . . . 29

2.3. Improvements on ν-Twin Support Vector Machine (Fast) . . . . . . . . . . . 31

2.4. Multi-category Extensions of Iν-TWSVM . . . . . . . . . . . . . . . . . . . . 32

2.4.1. One-Against-All Iν-TWSVM . . . . . . . . . . . . . . . . . . . . . . . . 32

2.4.2. Binary Tree of Iν-TWSVM . . . . . . . . . . . . . . . . . . . . . . . . . 33

2.5. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

2.6. Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

2.6.1. Synthetic Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

2.6.2. Binary Classification Results: UCI and Exp-NDC datasets . . . . . . . 36

2.6.3. Statistical Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

2.6.4. Scatter Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

2.6.5. Multi-category Classification Results: UCI Datasets . . . . . . . . . . . 44

2.7. Application: Image Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . 44

2.8. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3. Angle-based Nonparallel Hyperplanes Classifiers . . . . . . . . . . . . . . . . . . . 47

3.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.2. Angle-based Twin Parametric-Margin Support Vector Machine . . . . . . . . 49

3.2.1. Selection of Representative Points . . . . . . . . . . . . . . . . . . . . . 49

3.2.2. ATP-SVM (Linear version) . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.2.3. ATP-SVM (Kernel version) . . . . . . . . . . . . . . . . . . . . . . . . . 53

3.3. Angle-based Twin Support Vector Machine . . . . . . . . . . . . . . . . . . . 55

3.3.1. ATWSVM (Linear version) . . . . . . . . . . . . . . . . . . . . . . . . . 56

3.3.2. ATWSVM (Kernel version) . . . . . . . . . . . . . . . . . . . . . . . . . 59

3.4. Other Versions of ATWSVM . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

3.5. Multi-category Extension of ATP-SVM and ATWSVM . . . . . . . . . . . . . 63

3.6. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

3.7. Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

3.7.1. Synthetic Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

3.7.2. Binary Classification Results: UCI and NDC Datasets . . . . . . . . . 70

3.7.3. Statistical Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

3.7.4. Multi-category Classification Results: UCI Datasets . . . . . . . . . . . 78

3.8. Application: Segmentation through Pixel Classification of Color Images . . . 80

xiii

3.9. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

4. Ternary Support Vector Machine with Extension for Multi-category Classification 87

4.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

4.2. Ternary Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . . . 89

4.2.1. TerSVM (Linear version) . . . . . . . . . . . . . . . . . . . . . . . . . . 90

4.2.2. TerSVM (Kernel version) . . . . . . . . . . . . . . . . . . . . . . . . . . 94

4.2.3. TerSVM as Binary Classifier . . . . . . . . . . . . . . . . . . . . . . . . 96

4.3. Multi-category Classification Algorithm: Reduced Tree for TerSVM . . . . . . 96

4.4. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

4.5. Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

4.5.1. Synthetic Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

4.5.2. Multi-category Classification Results: UCI Datasets . . . . . . . . . . . 105

4.6. Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

4.6.1. Hand-written Digits Recognition: USPS Dataset . . . . . . . . . . . . . 110

4.6.2. Color Image Classification . . . . . . . . . . . . . . . . . . . . . . . . . 111

4.7. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

5. Multi-category Classification Approaches for Nonparallel Hyperplanes Classifiers . 115

5.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

5.2. Ternary Decision Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

5.2.1. Binary Tree Multi-category Approach . . . . . . . . . . . . . . . . . . . 119

5.2.2. Content-based Image Classification using TDS-TWSVM . . . . . . . . 120

5.2.3. Content-based Image Retrieval using TDS-TWSVM . . . . . . . . . . . 121

5.2.4. Comparison of TDS-TWSVM with Other Multi-Category Approaches . 122

5.3. Eigenvalue Problem Based Classifiers . . . . . . . . . . . . . . . . . . . . . . . 122

5.3.1. Generalized Eigenvalue Proximal Support Vector Machine . . . . . . . 123

5.3.2. Regularized GEPSVM . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

5.3.3. Improved GEPSVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

5.4. Extension of NHCAs for Multi-category Classification . . . . . . . . . . . . . 126

5.5. Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

5.5.1. Multi-category Classification Results: UCI Datasets . . . . . . . . . . . 127

5.6. Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

5.6.1. Color Image Classification . . . . . . . . . . . . . . . . . . . . . . . . . 130

5.6.2. Content-based Image Retrieval . . . . . . . . . . . . . . . . . . . . . . . 130

5.7. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

xiv

6. Tree-Based Localized Fuzzy Twin Support Vector Clustering with Square Loss

Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

6.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

6.2. Tree-based Localized Fuzzy Twin Support Vector Clustering . . . . . . . . . . 139

6.2.1. Localized Fuzzy TWSVM Classifier (Linear version) . . . . . . . . . . . 141

6.2.2. LF-TWSVM (Kernel version) . . . . . . . . . . . . . . . . . . . . . . . 144

6.2.3. Clustering Algorithms: BTree-TWSVC and OAA-Tree-TWSVC . . . . 145

6.3. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

6.4. Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

6.4.1. Clustering Results: UCI Datasets . . . . . . . . . . . . . . . . . . . . . 156

6.4.2. Clustering Results: Large Sized Datasets . . . . . . . . . . . . . . . . . 159

6.5. Application: Image Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . 160

6.6. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161

7. Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165

7.1. Advantages of our Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165

7.2. Utility and Comparative Analysis of Algorithms . . . . . . . . . . . . . . . . . 167

7.3. Pitfalls to be Avoided . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168

7.4. The Road-map Ahead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171

Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183

A. Evaluation Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185

B. Loss Function of TWSVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189

C. UCI Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191

D. Synthetic Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195

E. Image Features and Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199

E.1. Image Descriptors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199

List of Figures

2. Improvements on ν-Twin Support Vector Machine . . . . . . . . . . . . . . . . . .

2.1. Two moons dataset: Classification result with Iν-TWSVM . . . . . . . . . . 35

2.2. The hyperplanes obtained for cross-planes dataset . . . . . . . . . . . . . . . 36

2.3. Two-dimensional projections of 21 test data points of Thyroid dataset . . . . 43

2.4. Two-dimensional projections of 70 test data points of WPBC dataset . . . . 43

3. Angle-based Nonparallel Hyperplanes Classifiers . . . . . . . . . . . . . . . . . . .

3.1. Geometrical illustration of angle between normal vectors to ATP-SVM hy-

perplanes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.2. Geometrical illustration of angle between normal vectors to ATWSVM hy-

perplanes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

3.3. Classifiers obtained for synthetic dataset (Syn1). a. ATWSVM b. TBSVM . 61

3.4. Three-class classification with (a.) OAA-NHC (b.) BT-NHC . . . . . . . . . 64

3.5. Geometric interpretation of ATP-SVM, NSVMOOP and TPMSVM . . . . . 65

3.6. Influence of parameters on the performance of ATWSVM classifier. The

parameters c1 and c5 are assigned same value, c3=0.1 and c2 + c4 = 1 . . . . 68

3.7. Hyperplanes obtained by ATP-SVM and NSVMOOP for cross-planes dataset 69

3.8. Complex XOR dataset and the hyperplanes obtained by classifiers . . . . . . 69

3.9. Results on Ripley’s dataset with linear classifiers a. ATP-SVM b. NSV-

MOOP c. TPMSVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

4. Ternary Support Vector Machine with Extension for Multi-category Classification

4.1. Geometrical illustration of angle between normal vectors to the hyperplanes . 93

4.2. RT-TerSVM for dataset with 5 classes . . . . . . . . . . . . . . . . . . . . . . 97

4.3. Synthetic dataset with 300 data points. Hyperplanes obtained by a. TerSVM;

b. Twin-KSVC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

4.4. Linear TerSVM classifier with three classes . . . . . . . . . . . . . . . . . . . 105

4.5. Learning time of classifiers for UCI datasets (linear) . . . . . . . . . . . . . . 108

4.6. Learning time of classifiers for large-sized UCI datasets (linear) . . . . . . . . 108

xvi

4.7. Learning time of classifiers for UCI datasets (non-linear) . . . . . . . . . . . . 109

4.8. Learning time of classifiers for large-sized UCI datasets (non-linear) . . . . . 110

5. Multi-category Classification Approaches for Nonparallel Hyperplanes Classifiers .

5.1. Ternary Decision Structure of classifiers with 10 classes . . . . . . . . . . . . 118

5.2. Illustration of TDS-TWSVM . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

5.3. Three-class problem classified by OAA and TDS . . . . . . . . . . . . . . . . 127

5.4. Image Retrieval Result for a Sample Query Image from Wang’s Dataset (a.)

Query Image (b.) 20 Images retrieved by TDS-TWSVM . . . . . . . . . . . . 132

5.5. Time Complexity Comparison of TDS-TWSVM and OAA-TWSVM . . . . . 134

6. Tree-Based Localized Fuzzy Twin Support Vector Clustering with Square Loss

Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6.1. Illustration of tree of classifiers. . . . . . . . . . . . . . . . . . . . . . . . . . . 140

6.2. Learning time (Linear) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160

6.3. Learning time (Non-linear) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160

6.4. Segmentation results on BSD images (a.) Original image (b.) MSS-KSC (c.)

TWSVC (d.) BTree-TWSVC . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

B. Loss Function of TWSVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

B.1. Flipping of labels. a. Hinge loss function; b. Square loss function . . . . . . . 190

D. Synthetic Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

D.1. Synthetic Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195

D.2. Two moons dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197

E. Image Features and Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

E.1. Sample Wang’s Color Images . . . . . . . . . . . . . . . . . . . . . . . . . . . 204

E.2. Sample COREL 5K Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204

E.3. Sample MIT VisTex Sub-images . . . . . . . . . . . . . . . . . . . . . . . . . 205

E.4. Sample OT-scene Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205

E.5. Sample USPS digits (0-9) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206

List of Tables

2. Improvements on ν-Twin Support Vector Machine . . . . . . . . . . . . . . . . . .

2.1. Classification accuracy for synthetic datasets . . . . . . . . . . . . . . . . . . 36

2.2. Classification results with linear classifier on UCI datasets . . . . . . . . . . . 38

2.3. Classification results with linear classifier on Exp-NDC datasets . . . . . . . . 39

2.4. Classification results with non-linear classifier on UCI datasets . . . . . . . . 40

2.5. Classification result with non-linear classifier on Exp-NDC datasets . . . . . . 41

2.6. Friedman test and p-values with linear classifiers for UCI datasets . . . . . . 41

2.7. Friedman test and p-values with non-linear classifiers for UCI datasets . . . . 42

2.8. Classification results with linear multi-category classifiers for UCI datasets . . 44

2.9. Pixel Classification of color images from BSD image dataset. . . . . . . . . . 45

3. Angle-based Nonparallel Hyperplanes Classifiers . . . . . . . . . . . . . . . . . . .

3.1. Classification results with linear classifiers on binary UCI datasets . . . . . . 72

3.2. Variation in classification accuracy based on selection of classes . . . . . . . . 73

3.3. Classification results with non-linear classifier on binary UCI datasets . . . . 74

3.4. Classification results with linear classifiers on NDC datasets . . . . . . . . . . 76

3.5. Classification result with non-linear classifiers on NDC datasets . . . . . . . . 77

3.6. Friedman test ranks with linear classifiers for UCI datasets . . . . . . . . . . 78

3.7. Classification results with non-linear classifier on multi-category UCI datasets 79

3.8. Friedman test and p-values with multi-category classifiers for UCI datasets . 80

3.9. Segmentation results for BSD color images . . . . . . . . . . . . . . . . . . . . 81

3.10. Segmentation results (binary) on color images from BSD image dataset . . . 83

3.11. Segmentation results (binary) on color images from BSD image dataset . . . 84

3.12. Segmentation result for BSD color images . . . . . . . . . . . . . . . . . . . . 85

3.13. Segmentation results (multi-region) with normalized cut, K-Means and ATP-

SVM on color images of BSD dataset . . . . . . . . . . . . . . . . . . . . . . 86

4. Ternary Support Vector Machine with Extension for Multi-category Classification

4.1. Classification results with linear classifier on multi-category UCI datasets . . 106

xviii

4.2. Classification results with linear classifier on large-sized multi-category UCI

datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

4.3. Classification results with non-linear classifier on multi-category UCI datasets 107

4.4. Classification results with non-linear classifier on large-sized multi-category

UCI datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

4.5. Classification accuracy with linear classifier on three-class datasets created

from USPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

4.6. USPS Error Rate with different approaches . . . . . . . . . . . . . . . . . . . 111

4.7. Classification accuracy for image datasets . . . . . . . . . . . . . . . . . . . . 112

5. Multi-category Classification Approaches for Nonparallel Hyperplanes Classifiers .

5.1. Comparison of NHCAs with linear classifiers . . . . . . . . . . . . . . . . . . 128

5.2. Comparison of NHCAs with nonlinear classifiers . . . . . . . . . . . . . . . . 129

5.3. Classification accuracy on different image datasets . . . . . . . . . . . . . . . 130

5.4. Average Retrieval Rate (%) for Wang’s Color Dataset . . . . . . . . . . . . . 131

5.5. Average Retrieval Rate (%) for COREL 5K Dataset . . . . . . . . . . . . . . 131

5.6. Average Retrieval Rate (%) for MIT VisTex Dataset . . . . . . . . . . . . . . 133

5.7. Average Retrieval Rate(ARR) (%) for OT-Scene Dataset . . . . . . . . . . . 133

5.8. Average Time (sec) required to build the classifier . . . . . . . . . . . . . . . 134

6. Tree-Based Localized Fuzzy Twin Support Vector Clustering with Square Loss

Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6.1. Clustering with TWSVC and Tree-TWSVC for four clusters . . . . . . . . . . 151

6.2. Clustering accuracy for UCI datasets (Linear version) . . . . . . . . . . . . . 157

6.3. OoS Clustering accuracy for UCI datasets (Linear version) . . . . . . . . . . . 157

6.4. Clustering accuracy for UCI datasets (Non-linear version) . . . . . . . . . . . 158

6.5. OoS Clustering accuracy for UCI datasets (Non-linear version) . . . . . . . . 159

6.6. Segmentation result for BSD color images . . . . . . . . . . . . . . . . . . . . 163

List of Symbols

α, β, γ Lagrange multiplier vectors

ηi Projection vector for ith class patterns

‖.‖2 L2-norm

Rn n-dimensional real space

ν-SVM ν-Support Vector Machine

ν-TWSVM ν-Twin Support Vector Machine

νi User defined weight associated with ρi

φ Mapping induced by the kernel function

ρi Minimum separating distance between the patterns of ith class and hyper-

plane of other class

|.| Absolute distance

ξi Slack variable or error vector for ith class

ξj+1i Slack variable for ith class in (j + 1)th iteration

A Data matrix for positive class

Ai Row vector representing ith pattern in n-dimensional real space

Am() Angular component of ART

B Data matrix for negative class

C Augumented matrix [A ; B]T

ci User defined parameter that assigns a weight to the associated term

xx

CR− LBP − Co Texture Features: Complete Robust Local Binary Pattern with

Co-occurence Matrix

diag() Diagonal matrix

ei Column vector of 1’s of appropriate dimension

f(r, θ) Image intensity function in polar coordinates

Fnm ART coefficients

G,H Augmented data matrices

J(V ) Squared error function in K-Means Clustering

K Number of clusters in K-Means Clustering

L Lagrangian function

MA Mean of class A

mi Number of samples in ith class

n Feature dimension

P,Q Augmented data matrices

Rn() Radial component of ART

T (·) First-order Taylor’s series expansion

ui, bi Parameters of ith hyperplane in kernel version

vi Cluster center in K-Means Clustering

V ∗n,m(r, θ) ART basis function and is complex conjugate of Vn,m(r, θ)

wi, bi Parameters of ith hyperplane

wj+1i , bj+1

i Parameters of ith hyperplane in (j + 1)th iteration

x Column vector representing a pattern in Rn

Xi, Xi Data matrix for ith class and Data matrix for patterns other than those in

ith class respectively

xxi

yi Label of ith pattern, yi ∈ +1,−1

zi Augmented vector for (wi, bi) hyperplane

(R,G,B) RGB codes for color images

ART Angular Radial Transform

ATP-SVM Angle-based Twin Parametric-Margin Support Vector Machine

ATWSVM Angle-based Twin Support Vector Machine

BT Binary tree

GEPSVM Generalized Eigenvalue Proximal SVM

Iν-TWSVM Improvements on ν-Twin Support Vector Machine

IGEPSVM Improved GEPSVM

Ker Kernel

LF-TWSVM Localized Fuzzy TWSVM

LS-TWSVM Least-squares Twin Support Vector Machine

NHCAs Nonparallel Hyperplanes Classification Algorithms

OAA One-Against-All

OAO One-Against-One

QPP Quadratic Programming Problem

RegGEPSVM Regularized GEPSVM

RT-TerSVM Reduced Tree for Ternary Support Vector Machine

SDP Semi-definite Program

SVM Support Vector Machine

TBSVM Twin Bounded Support Vector Machine

TDS Ternary Decision Structure

xxii

TerSVM Ternary Support Vector Machine

Tree-TWSVC Tree-based Localized Fuzzy Twin Support Vector Clustering

Twin-KSVC Twin Multi-class Support Vector Classification

TWSVM Twin Support Vector Machine

UMP Unconstrained Minimization Problem

Chapter 1

Introduction

Machine learning is a branch of artificial intelligence which deals with design and

development of computer programs that learn and build decision models from the

empirical data. These models can be used to predict outputs, as done by a human

experts and can modify themselves when exposed to a new set of data. The focus is

on automatic learning and recognition of complex patterns in the data. A learning

algorithm should be able to progress from already seen patterns to broader general-

izations. This is referred as inductive inference. Machine learning can be categorized

as supervised, unsupervised and semi-supervised learning, based on the availability

of data labels or output. Supervised learning uses labeled training patterns; unsuper-

vised learning is ‘learning without label information’ and semi-supervised learning

requires few labeled patterns with a huge amount of unlabeled data. Most of the

classification and regression problems fall under the category of supervised learning,

whereas clustering is an unsupervised learning technique. Semi-supervised learning

lies between the other two approaches. When the cost of generating the labels is

very high, then classification problem can be handled as a semi-supervised problem.

The popular supervised learning approaches include Artificial Neural Networks

(ANN), Logistic Regression, Naive Bayes, Decision Trees, k-Nearest Neighbor (kNN)

and Support Vector Machine (SVM). ANNs are black box heuristic algorithms that

are computationally intensive to train and therefore hard to debug. Naive Bayes

classifiers make a very strong assumption about data distribution i.e. any two at-

tributes are independent given the output class; if this is not the case, it results

in a bad “naive” classifier. Decision trees suffer from over-fitting and optimal de-

2

cision tree is NP-complete problem. The computation cost of kNN is very high as

it computes the distance between every pair of training patterns. Although none of

the algorithm proves to be the best for all types of problems, but they have their

application areas where they do well.

Support Vector Machine (SVM) has proved to be an effective classification tool

[1, 2] in the field of machine learning. SVM has its foundation in statistical learning

theory and its formulation is based on structural risk minimization (SRM) princi-

ple [3, 4]. The optimization task for SVM involves the minimization of a convex

quadratic function subject to linear inequality constraints. Since, SVM solves a

convex optimization problem, it guarantees optimal solution. SVM was initially

proposed for classification problems, but later it was extended to regression. SVM

has good generalization ability and with an appropriate kernel, it can handle linearly

inseparable data. It is also fairly robust against over-fitting and is popularly used

for high dimensional data. Over the past few decades, various amendments to SVM

have been suggested, such as Lagrangian Support Vector Machine (LSVM) [5], a

Smooth Support Vector Machine (SSVM) for classification [6], Least Squares Sup-

port Vector Machine (LS-SVM) [7] and Proximal Support Vector Machine (PSVM)

[8]. Contrary to parallel hyperplane classifiers like SVM, Mangasarian and Wild

proposed Generalized Eigenvalue Proximal SVM (GEPSVM) [9] which is a nonpar-

allel hyperplanes classifier (NHC) and generates two hyperplanes instead of one.

Twin Support Vector Machine (TWSVM) [10, 11] is another binary classifier that

is motivated by GEPSVM and is almost four times faster than SVM.

The motivation behind this research work is to explore existing machine learn-

ing algorithms based on SVM and TWSVM, and to develop new ones which could

deliver better results than well-established methodologies. This research work in-

cludes study of convex optimization problems and introduces new classification and

clustering tools with good generalization ability and at the same time, they are time-

efficient. Since the classification algorithms cater to problems with two classes only,

we have tried to develop effective algorithms which could extend existing binary

classifiers to multi-category scenario. Taking motivation from SVM and TWSVM,

we have explored the option of developing a classifier with 3-classes and its extension

in multi-category scenario. Our work includes development of clustering algorithms

1.1. Classification Techniques 3

that use supervised tools in iterative framework and deliver better results than state-

of-the-art clustering methods. Machine learning has been used for various real world

applications like face detection, malicious software detection, weather forecasting,

web page classification, genetics and numerous other problems. This motivated us

to apply machine learning tools to some real world problem and for this work, we

have focused on image processing tasks like image classification, retrieval and seg-

mentation. The following sections explore the existing classification and clustering

techniques.

1.1 Classification Techniques

The pattern classification problem deals with the generation of a classifier function

which can separate the data belonging to two or more classes. It learns from the

training data and should generalize well i.e. should be able to classify unseen test

data with satisfactory accuracy. The classifier is trained with ‘training data’, param-

eters are tuned with ‘validation data’ and the performance of classifier is evaluated

using unseen ‘test data’.

For a binary classification problem, let the patterns belonging to positive and

negative classes be represented by matrices A and B respectively and the number of

patterns in these classes be given by m1 and m2 (m = m1 +m2); therefore, the order

of matrices A and B are (m1×n) and (m2×n) respectively. Here, n is the dimension

of feature space and Ai (i = 1, 2, ...,m1) is a row vector in n-dimensional real space

Rn, that represents feature vector of a data sample. The labels yi ∈ +1,−1 for

positive and negative classes are given by +1 and −1 respectively. In this thesis,

‘positive class’ and ‘Class +1’ are used interchangeably; similarly ‘negative class’

and ‘Class −1’ would refer to same set of patterns.

1.1.1 Twin Support Vector Machine

SVM is a parallel planes classifier, which separates the data using two hyperplanes

that are parallel to each other. Recently, Jayadeva et al. [10] proposed Twin Sup-

port Vector Machine (TWSVM) as a nonparallel hyperplanes classifier. Our research

work is motivated by TWSVM and is mainly concentrated on nonparallel hyper-

planes classifiers. In the following section, we present a brief review of TWSVM and

4

some of its variants.

TWSVM [12] is a supervised learning tool that classifies data by generating

two nonparallel hyperplanes which are proximal to their respective classes and at

least unit distance away from the patterns of other class. TWSVM solves a pair

of quadratic programming problems (QPPs) and is based on empirical risk mini-

mization (ERM) principle. The binary classifier TWSVM [10, 11] determines two

nonparallel hyperplanes by solving two related SVM-type problems, each of which

has fewer constraints than those in a conventional SVM. The hyperplanes are given

by

xTw1 + b1 = 0 and xTw2 + b2 = 0, (1.1)

where w1, b1, w2, b2 are the parameters of normals to the two hyperplanes, referred

as positive and negative hyperplanes. The proximal hyperplanes are obtained by

solving the following pair of QPPs.

TWSVM1:

minw1,b1,ξ2

1

2‖Aw1 + e1b1‖22 + c1e

T2 ξ2

subject to −(Bw1 + e2b1) + ξ2 ≥ e2, ξ2 ≥ 0. (1.2)

TWSVM2:

minw2,b2,ξ1

1

2‖Bw2 + e2b2‖22 + c2e

T1 ξ1

subject to (Aw2 + e1b2) + ξ1 ≥ e1, ξ1 ≥ 0. (1.3)

Here, c1 (or c2) > 0 is a trade-off factor between error vector ξ2 (or ξ1) due to

misclassified negative (or positive) class patterns and distance of hyperplane from

positive (or negative) class; e1, e2 are vectors of ones of appropriate dimensions

and ‖.‖2 represents L2 norm. The first term in the objective function of (1.2) or

(1.3) is the sum of squared distances of the hyperplane to the data patterns of its

own class. Thus, minimizing this term tends to keep the hyperplane closer to the

patterns of one class and the constraints require the hyperplane to be at least unit

distance away from the patterns of other class. Since this constraint of unit distance

1.1. Classification Techniques 5

separability cannot be always satisfied; so, TWSVM is formulated as a soft-margin

classifier and a certain amount of error is allowed. If the hyperplane is less than

unit distance away from data patterns of other class, then the error variables ξ1 and

ξ2 measure the amount of violation. The objective function minimizes L1-norm of

error variables to reduce misclassification. The solution of the problems (1.2) and

(1.3) can be obtained indirectly by solving their Lagrangian functions and using

Karush-Kuhn-Tucker (KKT) conditions [13]. The Wolfe dual of (TWSVM1) and

(TWSVM2) are as follows:

DTWSVM1:

maxα

eT2 α−1

2αTG(HTH)−1GTα

subject to 0 ≤ α ≤ c1, (1.4)

DTWSVM2:

maxβ

eT1 β −1

2βTP (QTQ)−1P Tβ

subject to 0 ≤ β ≤ c2. (1.5)

Here, H = [A e1], G = [B e2], P = [A e1], Q = [B e2] are augmented matrices

of respective classes. The augmented vectors z1 = [wT1 , b1]T and z2 = [wT2 , b2]T are

given by

z1 = −(HTH)−1GTα, (1.6)

z2 = (QTQ)−1P Tβ, (1.7)

where α = (α1, α2, ..., αm2)T and β = (β1, β2, ..., βm1)T are Lagrange multipliers.

As we obtain the solutions (w1, b1) and (w2, b2) of the problems (1.2) and (1.3)

respectively, a new data sample x ∈ Rn is assigned to class r (r = 1, 2), depending

on which of the two planes given by (1.1) it lies closer to i.e.

r = arg (minl=1,2

|xTwl + bl|‖wl‖2

), (1.8)

where |.| is the perpendicular distance of point x from the plane xTwl + bl = 0, l =

6

1, 2. The label assigned to the test data is given as y =

+1 (r = 1)

−1 (r = 2).

The complexity of SVM problem is of the order m3, where m is the total number

of patterns appearing in the constraints and TWSVM solves two problems (1.2) and

(1.3), each of which has approximately (m/2) constraints. Therefore, the ratio of

learning-time of SVM and TWSVM is approximately [(m3)/(2 × (m/2)3)] = 4 : 1;

this makes TWSVM almost four times faster than SVM [10].

TWSVM has been extended to handle linearly inseparable data by considering

two kernel generated surfaces, given as:

Ker(xT , CT )u1 + b1 = 0, (1.9)

Ker(xT , CT )u2 + b2 = 0, (1.10)

where CT = [A ; B]T is the augmented data matrix and Ker is an appropriately

chosen kernel. The primal QPP of non-linear TWSVM corresponding to the surface

(1.9) is given by

K-TWSVM1:

minu1,b1,ξ2

1

2‖Ker(A,CT )u1 + e1b1‖22 + c1e

T2 ξ2

subject to −(Ker(B,CT )u1 + e2b1) + ξ2 ≥ e2, ξ2 ≥ 0. (1.11)

The second problem of non-linear TWSVM can be defined in similar manner as

(1.11) and their solution is obtained from the dual problems, as done for linear case

[10].

In the last decade, TWSVM has attracted many researchers and a lot of work

has been done based on TWSVM. It is beyond the scope of this thesis to discuss all

of them. But few variants of TWSVM are briefly discussed in the following section,

which give a better understanding of our research work.

1.1.2 Least Square Twin Support Vector Machine

Least Square Twin Support Vector Machine (LS-TWSVM) [14] is motivated by

TWSVM and solves a pair of QPPs on the lines of LS-SVM [7]. LS-TWSVM modifies

the primal problems of TWSVM and solves them directly instead of finding the

1.1. Classification Techniques 7

dual problems. Further, the solution of primal problems is reduced to solving two

systems of linear equations instead of solving two QPPs along with two systems of

linear equations, as required in TWSVM. The primal problems of LS-TWSVM deal

with equality constraints and are given as follows:

LS-TWSVM1:

minw1,b1,ξ2

1

2‖Aw1 + e1b1‖22 +

c1

2ξT2 ξ2

subject to −(Bw1 + e2b1) + ξ2 = e2. (1.12)

LS-TWSVM2:

minw2,b2,ξ1

1

2‖Bw2 + e2b2‖22 +

c2

2ξT1 ξ1

subject to (Aw2 + e1b2) + ξ1 = e1. (1.13)

The QPPs (1.12), (1.13) use L2-norm of error variables ξ1, ξ2 with weights c1, c2;

whereas TWSVM uses L1 norm of error variables. This makes the constraint ξ2 ≥ 0

and ξ1 ≥ 0 of (1.2) and (1.3) respectively, redundant.

Linear LS-TWSVM obtains the classifier with two matrix inverse operations,

each of order (n+ 1)× (n+ 1), where n << m. LS-TWSVM has been extended to

non-linear kernel by considering the kernel generated surfaces [14].

1.1.3 Twin Bounded Support Vector Machine

Similar to TWSVM, Twin Bounded Support Vector Machine (TBSVM) [15] also

constructs two nonparallel hyperplanes, as given in (1.1), by solving two QPPs. How-

ever, TBSVM distinguishes itself from TWSVM by adding a regularization term,

in the primal problems of TWSVM, with the idea of maximizing the margin [15].

TWSVM takes care of the empirical risk whereas TBSVM minimizes both the em-

pirical as well as structural risk. TBSVM considers the following primal problems:

TBSVM1:

minw1,b1,ξ2

1

2‖Aw1 + e1b1‖22 + c1e

T2 ξ2 +

1

2c3(‖w1‖22 + b21)

subject to −(Bw1 + e2b1) + ξ2 ≥ e2, ξ2 ≥ 0. (1.14)

8

TBSVM2:

minw2,b2,ξ1

1

2‖Bw2 + e2b2‖22 + c2e

T1 ξ1 +

1

2c4(‖w2‖22 + b22)

subject to (Aw2 + e1b2) + ξ1 ≥ e1, ξ1 ≥ 0. (1.15)

The constants c1, c2, c3 and c4 are positive parameters which associate weights

with the corresponding terms. The TBSVM QPPs are solved in similar manner as

TWSVM and can be extended to non-linear kernel version.

1.1.4 Twin Parametric-Margin Support Vector Machine

Twin Parametric-Margin Support Vector Machine (TPMSVM) is a binary classifier

that determines two nonparallel parametric-margin hyperplanes by solving two re-

lated SVM-type problems [16], each of which is smaller than a conventional SVM

[2] or Parametric ν-Support Vector Machine (par-ν-SVM) [17] problem. TPMSVM

separates the data of the two classes if and only if

Aiw1 + b1 ≥ 0, for Ai ∈ A,

Biw2 + b2 ≤ 0, for Bi ∈ B, (1.16)

where Ai and Bi represent ith data sample of their respective classes. The primal

formulation for the pair of QPPs in TPMSVM is given as follows:

TPMSVM1:

minw1,b1,ξ1

1

2‖w1‖22 +

c1

m2eT2 (Bw1 + e2b1) +

c2

m1eT1 ξ1

subject to Aw1 + e1b1 ≥ 0− ξ1, ξ1 ≥ 0, (1.17)

TPMSVM2:

minw2,b2,ξ2

1

2‖w2‖22 −

c3

m1eT1 (Aw2 + e1b2) +

c4

m2eT2 ξ2

subject to Bw2 + e2b2 ≤ 0 + ξ2, ξ2 ≥ 0. (1.18)

The constants c1, c2, c3, c4 > 0 are trade off factors; e1, e2 are vectors of ones,

in real space, of appropriate dimensions and ‖.‖2 represents L2-norm. The first

term of the objective function of (1.17) and (1.18) controls the complexity of the

1.1. Classification Techniques 9

model. The second term of (1.17) minimizes the sum of projection values of negative

class training patterns on the hyperplane of positive class, with parameter c1. The

objective function also minimizes the sum of error, which occurs due to the data

patterns lying on wrong sides of the hyperplanes. The constraints of (1.17) require

that the projection values of positive training patterns on the positive hyperplane

should be at least zero. A slack vector ξ1 measures the amount of error due to positive

training points. The optimization problem of (1.18) can be defined analogously.

1.1.5 ν-Twin Support Vector Machine

X.Peng [18] proposed a modification to TWSVM, termed as ν-Twin Support Vector

Machine (ν-TWSVM) and introduced two parameters ν1 and ν2 instead of the trade-

off parameters c1 and c2 of TWSVM. The parameters ν1, ν2 in the ν-TWSVM

control the bounds on number of support vectors and the margin errors. The primal

optimization problems of ν-TWSVM are as follows:

ν-TWSVM1:

minw1,b1,ρ1,ξ2

1

2‖Aw1 + e1b1‖22 − ν1ρ1 +

1

m2eT2 ξ2

subject to −(Bw1 + e2b1) + ξ2 ≥ e2ρ1,

ξ2 ≥ 0, ρ1 ≥ 0. (1.19)

ν-TWSVM2:

minw2,b2,ρ2,ξ1

1

2‖Bw12 + e2b2‖22 − ν2ρ2 +

1

m1eT1 ξ1

subject to (Aw2 + e1b2) + ξ1 ≥ e1ρ2,

ξ1 ≥ 0, ρ2 ≥ 0. (1.20)

Here, ρi (i = 1, 2) measure the minimum separating distance between the patterns

of one class and hyperplane of other class and are optimized in (1.19) and (1.20).

Both the optimization problems try to maximize this distance. The role of ρi is

to separate the data patterns of one class from the hyperplane of other class by a

margin of ρi/(wTi wi) where (i = 1, 2) [18]. The parameter ν2 (or ν1) determines an

upper bound on the fraction of positive class (or negative class) margin errors and a

lower bound on the fraction of positive class (or negative class) support vectors [18].

10

1.1.6 Nonparallel Support Vector Machine with One Optimization

Problem

Tian and Ju [19] proposed a binary classifier Nonparallel Support Vector Machine

with One Optimization Problem (NSVMOOP), that determines the two nonparallel

proximal hyperplanes by solving a single optimization problem. NSVMOOP aims

at maximizing the angle between the normal vectors of the two hyperplanes. NSV-

MOOP combines the two QPPs of TWSVM together and formulates a single QPP

which is given as

NSVMOOP:

minw1,b1,η1,ξ1,w2,b2,η2,ξ2

1

2(‖w1‖22 + ‖w2‖22)

+c1(ηT1 η1 + ηT2 η2 + eT1 ξ1 + eT2 ξ2) + c2(w1.w2),

subject to Aw1 + e1b1 = η1,

Bw2 + e2b2 = η2,

−(Bw1 + e2b1) + ξ2 ≥ e2, ξ2 ≥ 0

(Aw2 + e1b2) + ξ1 ≥ e1, ξ1 ≥ 0, (1.21)

where c1 and c2 are positive trade-off parameters. The first set of terms in the

objective function of (1.21) are the regularization terms. The second set of terms

consist of two types of errors. The error terms ηT1 η1 and ηT2 η2 are the sum of

the squared distances of data patterns from their own hyperplane, and hence their

minimization keeps the respective hyperplanes proximal to the patterns of their own

class. The other error terms eT1 ξ2 and eT1 ξ2 are the sum of errors contributed due

to violation of corresponding constraints. The term w1.w2 in the objective function

is the inner product of normal vectors to the hyperplanes and its minimization

essentially maximizes the separation between the two classes.

1.2 Clustering Techniques

Clustering is an unsupervised learning task, which aims at partitioning data into a

number of clusters [20, 21]. Patterns that belong to the same cluster should have

affinity with each other and must be distinct from the patterns in other clusters.

1.2. Clustering Techniques 11

Clustering has its application in various domains of data analysis which include

medical science, finance, pattern recognition and image analysis [22, 23, 24].

For a K-cluster problem, let there are m data patterns X = (x1, x2, ..., xm)T

where xi ∈ Rn, with their corresponding labels in 1, 2, ...,K; X is m × n matrix.

Few widely accepted clustering algorithms include K-Means clustering [25], Fuzzy

c-means clustering [26], Hierarchical clustering [21] etc. All these are unsupervised

learning algorithms, but recently supervised learning approaches have been used to

solve clustering problems such as Maximum-Margin Clustering (MMC) [27], Twin

Support Vector Machine for Clustering (TWSVC) [28] etc. Some of the clustering

approaches, which directly influence our research work, are briefly explained in the

following section.

1.2.1 K-Means Clustering

K-Means clustering [25] is a popular unsupervised learning algorithm that identifies

a given number of clusters (K) in a dataset. The idea is to initially define K-cluster

centers and improve them iteratively. This algorithm aims at minimizing the squared

error function given by:

J(V ) =K∑i=1

ci∑j=1

(‖xj − vi‖22), (1.22)

where ‖xj − vi‖22 is the Euclidean distance between the data pattern xj and the

cluster center vi; ci is the number of data points in ith cluster and K is the number

of clusters. For each iteration, new cluster center is calculated, until the termination

criteria is reached.

1.2.2 Maximum-Margin Clustering

Motivated by the success of maximum margin methods in supervised learning, Xu

et al. proposed Maximum-Margin Clustering (MMC) [27] that aims at extending

maximum margin methods to unsupervised learning. Since, its optimization problem

is non-convex, MMC relaxes the optimization problem as semidefinite programs

(SDP).

For the training set (xi)mi=1, where xi is the input in n-dimension space and

y = y1, ..., ym are unknown cluster labels, the primal problem for MMC is given

12

as

Miny

Minw,b,ξ

‖w‖22 + 2CξT e

subject to yi(wTφ(xi) + b) ≥ 1− ξi,

ξi ≥ 0, yi ∈ +1 ,−1, i = 1, ...,m

−l ≤ eT y ≤ l, (1.23)

where φ is the mapping induced by the kernel function and ξ is a vector of error

variables. ‖.‖22 represents L2-norm. e is a vector of ones of appropriate dimension

and (w, b) are the parameters of the hyperplane that separates the two clusters. The

parameter C is a trade off factor and l ≥ 0 is a user-defined constant that controls

class imbalance condition. Since, the constraint yi ∈ +1,−1 ⇔ y2i − 1 = 0 is

non-convex, therefore (1.23) is non-convex optimization problem. As discussed in

[27], MMC relaxes the non-convex optimization problem and solves it as SDP. SDP

is convex but computationally very expensive and can handle only small data sets.

Zhang et al. proposed an iterative SVM approach to solve the MMC problem (1.23)

based on alternating optimization [29].

1.2.3 Twin Support Vector Machine for Clustering

Twin Support Vector Machine for Clustering (TWSVC) [28] is a plane-based clus-

tering method which uses TWSVM classifier and follows One-Against-All (OAA)

approach to determine K cluster center planes for a K-cluster problem. Since,

TWSVC considers all the data patterns (in OAA manner) for finding the cluster

planes, it requires that each plane should be close to its own cluster and away from

other clusters’ data points on both the sides. Let the data for ith cluster be rep-

resented by Xi and the data points of all other clusters are given by Xi. For a

K-cluster problem, TWSVC seeks K cluster center planes, which are given as

xTwi + bi = 0, i = 1, 2, ...,K. (1.24)

The planes are proximal to the data points of their own cluster. TWSVC uses

initialization algorithm to get the initial cluster labels for data points and determines

the initial cluster planes. The algorithm alternatively updates the labels of data

1.2. Clustering Techniques 13

points and cluster center planes until the termination condition is satisfied [28]. The

cluster planes are obtained by considering the following set of problems, with initial

cluster plane parameters [w0i , b

0i ],

TWSVC:

minwj+1

i ,bj+1i ,ξj+1

i

1

2‖Xiw

j+1i + ebj+1

i ‖22 + ceT ξj+1i

subject to T (|Xiwj+1i + ebj+1

i |) + ξj+1i ≥ e, ξj+1

i ≥ 0, (1.25)

where i = 1, 2, ...K is the index for clusters and j = 0, 1, 2, ... is the index of successive

problem. T (·) denotes the first-order Taylor’s series expansion and the parameter c

is the weight associated with the error vector. The optimization problem in (1.25)

determines the ith cluster center plane, which is required to be as close as possible to

the ith cluster Xi and far away from the other clusters’ data points Xi on both the

sides. The problem also minimizes the error vector ξi which measures the error due

to wrong assignment of cluster labels. By introducing the sub-gradient of |Xiwj+1i +

ebj+1i |, (1.25) becomes

minwj+1

i ,bj+1i ,ξj+1

i

1

2‖Xiw

j+1i + ebj+1

i ‖22 + ceT ξj+1i

subject to diag(sign(Xiwji + ebji ))(Xiw

j+1i + ebj+1

i ) ≥ e− ξj+1i ,

ξj+1i ≥ 0. (1.26)

The solution of the above problem can be obtained by solving its dual problem [13]

and is given by

maxα

eTα− 1

2αTG(HTH)−1GTα

subject to 0 ≤ α ≤ ce, (1.27)

where G = diag(sign(Xiwji + bjie))[Xi e], H = [Xi e], and α ∈ Rm−mi is

the Lagrangian multiplier vector. The problem in (1.27) is solved iteratively by

concave-convex procedure (CCCP) [30], until the change in successive iterations is

insignificant. TWSVC is extended to manifold clustering [28] by using kernel [31].

It uses an initialization procedure [28] which is based on the nearest neighbor graph

14

(NNG) and provides more stability to the algorithm.

1.3 Multi-Category Extension of Binary Classifiers

SVM and TWSVM have been widely studied as binary classifiers and researchers

have been trying to extend them to multi-category classification problems. There

are two approaches to handle multi-category data. One option is to construct and

combine several binary classifiers which consider a part of data. The other option is

to formulate a single optimization problem which uses the entire data [32]. A single

problem generally involves large number of variables and is computationally more

expensive than the first approach and has applicability limited to smaller datasets

only.

Multi-category SVMs have been implemented by constructing several binary

classifiers and integrating their results; two such approaches are One-Against-All

(OAA) and One-Against-One (OAO) Support Vector Machines [32]. OAA-SVM

implements a series of binary classifiers where each classifier separates one class from

rest of the classes. But this approach leads to unbalanced classification due to huge

difference in the number of patterns. For a K-class classification problem, OAA-SVM

builds K binary classifiers and requires a similar number of binary SVM comparisons

for each test data. In case of OAO-SVM, the binary SVM classifiers are determined

using a pair of classes at a time. So, it formulates up to (K ∗ (K−1))/2 binary SVM

classifiers, which increase the computational complexity. Also, Directed Acyclic

Graph SVMs (DAG-SVMs) are proposed in [33], in which the training phase is the

same as OAO-SVMs i.e. generates (K ∗(K−1))/2 binary SVMs, however its testing

phase is different. During testing, it uses a rooted binary directed acyclic graph

which has (K ∗ (K − 1))/2 internal nodes and K leaves. Jayadeva et al. proposed

fuzzy linear proximal Support Vector Machines for multi-category data classification

[34]. Lei et al. proposed Half-Against-Half (HAH) multiclass-SVM [35]. HAH is

built via recursively dividing the training dataset of K classes into two subsets of

classes. It constructs a decision tree where each node is a binary SVM classifier.

Shao et al. proposed a Decision Tree Twin Support Vector Machine (DTTSVM)

for multi-category classification [36], by constructing a binary based on the best

1.3. Multi-Category Extension of Binary Classifiers 15

separating principle. The multi-category approaches, which the researchers have

originally proposed for SVM, are also applicable for TWSVM. Xie et al. extended

TWSVM for multi-category classification [37] using OAA approach.

In this section, we briefly discuss two approaches to extend nonparallel hyper-

planes classifiers to multi-category framework: One-Against-One TWSVM (OAA-

TWSVM) and Twin-KSVC.

1.3.1 One-Against-One Twin Support Vector Machine

Let a K-class dataset consists of m patterns, represented by X ∈ Rm×n and each

pattern is associated with a label y ∈ 1, 2, ...,K. We define i ∈ 1, 2, ...,K and

m = m1 +m2 + ...+mK . One-Against-One TWSVM (OAA-TWSVM) [37] solves K

binary TWSVM problems, where the ith problem presumes ith class as positive and

remaining all patterns as negative class. Let the data for ith class be represented

by Xi and the remaining data points are given by Xi, where Xi ∈ Rmi×n. Here, mi

represents the number of patterns in ith class. OAA-TWSVM formulates K binary

TWSVM problems to obtain K positive hyperplanes, given by

xTwi + bi = 0, i ∈ 1, 2, ...,K. (1.28)

For each class i (i = 1, 2, ...,K), the positive hyperplane is generated by solving

Eq.(1.2), with A = Xi and B = Xi. The ith hyperplane thus obtained would be

proximal to the data points of class i. The constraints require that the hyperplane

should be at least unit away from the patterns of other (K − 1) classes. The class

imbalance problem is taken care by choosing the proper penalty variable ci for the

ith class. A test pattern x is assigned label r (r = 1, 2, ...,K), based on minimum

distance from the hyperplanes given by Eq.(1.28), i.e.

xTw(r) + b(r) = minl=1:K

|xTw(l) + b(l)|‖w(l)‖2

, (1.29)

where |.| is the absolute distance of point x from the lth hyperplane. OAA-TWSVM

is computationally very expensive as it solves K QPPs each of order O((K−1K )m)3.

16

1.3.2 Twin-KSVC

Angulo et al. [38] proposed a multi-category classification algorithm, called support

vector classification regression machine for K-class classification (K-SVCR) which

evaluates all the training points into ‘one-versus-one-versus-rest’ structure. Working

on the lines of K-SVCR, Xu et al. [39] proposed Twin-KSVC for multi-category clas-

sification. Twin-KSVC presented a TWSVM like binary classifier, which is extended

using One-Against-One (OAO) multi-category approach. Twin-KSVC selects two

focused classes (A ∈ Rm1×n, B ∈ Rm2×n) from K classes and constructs two non-

parallel hyperplanes. The patterns of the remaining (K − 2) classes (represented

by C ∈ R(m−m1−m2)×n) are mapped into a region between these two hyperplanes.

Here, m, m1 and m2 represent the total number of patterns in the dataset, number

of patterns in positive and negative focused classes respectively. The positive hyper-

plane (w1, b1), as given in Eq.(1.1), is obtained by solving the following problem:

Twin-KSVC:

minw1,b1,ξ1,η1

1

2‖Aw1 + e1b1‖22 + c1e

T2 ξ1 + c2e

T3 η1

subject to −(Bw1 + e2b1) + ξ1 ≥ e2,

−(Cw1 + e3b1) + η1 ≥ e3(1− ε),

ξ1 ≥ 0, η1 ≥ 0. (1.30)

The constraints require that the positive hyperplane should be at least unit

distance away from the patterns of negative class (represented by B) and (1 −

ε) distance away from rest of the patterns (represented by C). Here, ε is a very

small user-defined value. The second and third terms of the objective function try

to minimize the error due to misclassification of patterns belonging to B and C,

represented by ξ1 and η1 respectively. The other problem of Twin-KSVC is defined

analogously.

1.4 Brief Introduction to Image Processing

Machine learning algorithms have their application in diverse fields like medical

diagnosis, weather forecasting, pattern recognition etc. To suggest the practical

1.5. Contribution of the Thesis 17

application of our work, we have extended it to perform various image processing

tasks like image classification, content based image retrieval, image segmentation,

hand-written digit recognition etc.

1.4.1 Content-based Image Retrieval

With the increase in number of digital images, content based image retrieval (CBIR)

has become an active area of research these days. Due to the large number of digital

images that are available on the Internet, efficient indexing and searching becomes

essential. The task of finding pertinent images is a challenge posed to the researchers

in various domains like medical imaging, remote sensing, crime prevention, publish-

ing, architecture, etc. CBIR uses low-level features like color, texture, shape, spatial

layout etc. along with semantic features for indexing of images. Texture is an ef-

fective visual feature that captures the intrinsic surface characteristics of an image

and states its relationship with surrounding environment. It can also describe the

structural arrangement of a region in the object. Among all visual features, shape

based features are the most important as they correspond to the human perception

of an object. Objects can be recognized solely from their shapes.

1.4.2 Image Classification

Image classification is a multi-category classification problem. The classifier model

is trained using a set of images and it can predict the class label for an unseen image.

Similar to CBIR, Image classification algorithms also use low-level image feature like

color, texture, shape or spatial location.

1.4.3 Image Segmentation through Pixel Classification

Pixel classification is the task of identifying regions in the image and associating

each image pixel with one of those regions. It can be regarded as a segmentation

problem since the image is partitioned into non-overlapping regions that share cer-

tain homogeneous features. The image features used for this work are discussed in

Appendix.

1.5 Contribution of the Thesis

This research work is presented in the form of chapters and their major contribution

is summarized below.

18

Chapter 2

Peng et al. proposed ν-TWSVM [18], which is developed on the lines of ν-SVM

[40, 41]. The parameter ν in ν-TWSVM controls the bounds on the number of

support vectors. It requires that the patterns of a class must be at least ρ distance

away from the hyperplane of other class, which is optimized in the primal problem

involved therein. Taking motivation from ν-TWSVM, Improvements on ν-Twin

Support Vector Machine (Iν-TWSVM) has been developed which solves a smaller-

sized QPP and an unconstrained minimization problem (UMP), instead of solving

a pair of QPPs as done by ν-TWSVM and various other TWSVM-based classifiers.

The contribution of this work is to improve the time-complexity of TWSVM-based

classifiers, while achieving comparable classification accuracy. For linear case, the

hyperplane for one of the twin problems of Iν-TWSVM is obtained by solving a UMP

in the feature dimension, while ν-TWSVM solves a QPP with constraints defined

by number of data points in the other class. Hence, Iν-TWSVM solves a simpler

optimization problem and has efficient learning time than ν-TWSVM. The second

version of the classifier, termed as Iν-TWSVM (Fast), modifies the first problem of

Iν-TWSVM as minimization of a unimodal function, for which line search methods

can be used; this further avoids solving the QPP. Hence, Iν-TWSVM (Fast) is a

faster version of our classifier. This chapter also presents multi-category extension

of Iν-TWSVM and its application for image segmentation.

Chapter 3

This chapter presents two novel TWSVM-based classifiers termed as Angle-based

Twin Parametric-Margin Support Vector Machine (ATP-SVM) and Angle-based

Twin Support Vector Machine (ATWSVM). Both of these classifiers make use of

angle between normal vectors to the hyperplanes to maximize the separation be-

tween the two classes.

ATP-SVM determines two nonparallel parametric-margin hyperplanes, such that

the angle between their normals is maximized. Unlike most TWSVM-based classi-

fiers, ATP-SVM solves only one modified QPP with fewer number of representative

patterns. Further, it avoids the explicit computation of inverse of matrices in the

dual and has efficient learning time. Although only one QPP is being solved in

ATP-SVM, it still manages to attain the speed comparable to that of TWSVM.

1.5. Contribution of the Thesis 19

ATP-SVM results in faster execution than any other single optimization problem

based classifiers and can efficiently handle heteroscedastic noise.

ATWSVM presents a generic classification model, where the first problem can

be formulated using any TWSVM-based classifier and the second problem is an

unconstrained minimization problem (UMP) which is reduced to solving a system

of linear equations. The second hyperplane is determined so that it is proximal to its

own class and the angle between the normals to the two hyperplanes is maximized.

The notion of angle has been introduced to have maximum separation between the

two hyperplanes. In this thesis, we have presented two versions of ATWSVM: one

that solves a QPP and a UMP; second which formulates both the problems as UMPs.

Chapter 4

This chapter presents classifier termed as Ternary Support Vector Machine (TerSVM)

and its tree based multi-category classification approach termed as Reduced Tree for

Ternary Support Vector Machine (RT-TerSVM). The novel classifier is motivated by

Twin Multi-class Support Vector Classification (Twin-KSVC) and can handle three-

class classification problems by determining three proximal nonparallel hyperplanes.

The data patterns are evaluated for ternary outputs (+1,−1, 0). The optimiza-

tion problems of TerSVM are formulated as unconstrained minimization problems

(UMPs) which lead to solving system of linear equations.

Our multi-category classification algorithm (i.e. RT-TerSVM) presents a novel

approach to extent the ternary classifier TerSVM into multi-category framework.

For a K-class problem, RT-TerSVM constructs the classifier model in the form of

a ternary tree of height bK/2c, where the data is partitioned into three groups

at each level. Our algorithm is termed as reduced because it uses a novel proce-

dure to identify a reduced training set which further improves the learning time.

Numerical experiments performed on synthetic and benchmark datasets indicate

that RT-TerSVM outperforms other classical multi-category approaches like One-

Against-All (OAA) and Twin-KSVC, in terms of generalization ability. This chapter

also presents the application of RT-TerSVM for handwritten digit recognition and

color image classification.

20

Chapter 5

This chapter discusses multi-category classification approaches for nonparallel hyper-

planes classifiers. We have developed a multi-category approach termed as Ternary

Decision Structure (TDS) which is a generic algorithm and can be applied to any

binary classifier, in order to extend it to multi-category framework. For this the-

sis, we have extended TWSVM classifier using TDS. The TDS-TWSVM classifica-

tion algorithm is more efficient than classical multi-category algorithms, in terms of

learning time of classifiers and evaluation time. For a K-class problem, a balanced

ternary decision structure requires dlog3Ke comparisons to evaluate a test sample.

The experimental results depict that TDS-TWSVM outperforms One-Against-All

TWSVM (OAA-TWSVM) and Binary Tree-based TWSVM (BT-TWSVM) consid-

ering classification accuracy. We have shown the efficacy of the our algorithm via

image classification and further for image retrieval. Experiments are performed on

a varied range of benchmark image databases with 5-fold cross validation.

Mangasarian et al. proposed Generalized Eigenvalue Proximal SVM (GEPSVM)

[9] which generates two nonparallel hyperplanes. Some variations for GEPSVM have

been proposed like Regularized GEPSVM (RegGEPSVM) [42], Improved GEPSVM

(IGEPSVM) [43]. All these classifiers have been proposed for binary classification

problems. In this chapter, we present a comparative study of four Nonparallel Hyper-

planes Classification Algorithms (NHCAs) - TWSVM, GEPSVM, RegGEPSVM and

IGEPSVM for multi-category classification. The multi-category approaches used for

this thesis are One-Against-All (OAA), Binary Tree-based (BT) and Ternary Deci-

sion Structure (TDS). The experiments are performed on benchmark UCI datasets.

Chapter 6

Motivated by the success of TWSVM as a classifier, we developed a clustering algo-

rithm based on TWSVM, termed as Tree-based Localized Fuzzy Twin Support Vector

Clustering (Tree-TWSVC). Tree-TWSVC is a novel clustering algorithm that builds

the cluster model as a Binary Tree, where each node comprises of a novel TWSVM-

based classifier, termed as Localized Fuzzy TWSVM (LF-TWSVM). Tree-TWSVC

has efficient learning time, achieved due to tree structure and the formulation that

leads to solving a series of system of linear equations. Tree-TWSVC achieves good

1.5. Contribution of the Thesis 21

clustering accuracy because of the square loss function and use of nearest neigh-

bour graph based initialization method. The novel algorithm restricts the cluster

hyperplane from extending indefinitely by using cluster prototype, which further

improves its accuracy. It can efficiently handle large datasets and outperforms other

TWSVM-based clustering methods. In this thesis, we present two implementa-

tions of Tree-TWSVC: Binary Tree-TWSVC and One-Against-One Tree-TWSVC.

To prove the efficacy of our method, experiments are performed on a number of

benchmark UCI datasets. We have also given the application of Tree-TWSVC as an

image segmentation tool.

Chapter 7

This chapter concludes our research work and discusses advantages of this work. It

also includes future scope of our work and pitfalls to be avoided.

22

Chapter 2

Improvements on ν-Twin Support Vector

Machine

2.1 Introduction

ν-Support Vector Machine (ν-SVM) was proposed by Scholkopf et al. for classifi-

cation and regression problems [40, 41]. This classifier is a modification of Support

Vector Machine (SVM) and is particularly useful when the noise is heteroscedastic,

i.e. the noise strongly depends on the input feature vector. ν-SVM introduced a

priori chosen parameter ν that determines an upper bound on the training error

and a lower bound on the number of support vectors (SVs). The concept of ν-SVM

is extended in the framework of Twin Support Vector Machine (TWSVM) by Peng

and he proposed ν-TWSVM [18]. In TWSVM, the patterns of one class are at

least unit distance away from the hyperplane of other class; this might increase the

number of SVs which leads to poor generalization ability. The parameters ν1, ν2 in

ν-TWSVM control the bounds on the number of SVs, similar to ν-SVM, and further

the unit distance of TWSVM is modified to variable ρ, which is optimized in the

primal problem involved therein.

In this chapter, we introduce two binary classifiers collectively termed as Im-

provements on ν-Twin Support Vector Machine (Iν-TWSVM). This work is an at-

tempt to improve the time complexity of TWSVM-based classifiers, specifically ν-

TWSVM, while achieving comparable classification accuracy. The remaining chap-

ter is organized as follows: Section 2.2 presents “Improvements on ν-Twin Support

Vector Machine” and is followed by discussion of its faster version in Section 2.3.

24

The multi-category extension of Iν-TWSVM is presented in Section 2.4 and its com-

plexity analysis is given in Section 2.5. The experimental results are discussed in

Section 2.6. The application of Iν-TWSVM for pixel classification is investigated in

Section 2.7 and the chapter is concluded in Section 2.8.

2.2 Improvements on ν-Twin Support Vector Machine

This section introduces two binary classifiers termed as Improvements on ν-Twin

Support Vector Machine (Iν-TWSVM), which solve a smaller-sized QPP or a uni-

modal function as the first problem and a UMP as the second problem. This is in

contrast to ν-TWSVM or any other TWSVM-based classifier, which solve a pair of

identical QPPs. The novelty of this work is that Iν-TWSVM formulates a pair of

asymmetric optimization problems.

Iν-TWSVM has efficient learning time as compared to ν-TWSVM. For the linear

case, the hyperplane for one of the twin problems of Iν-TWSVM is obtained by

solving a UMP in the feature dimension. Iν-TWSVM minimizes the empirical risk

and tries to generate the hyperplanes that are proximal to the data points of their

respective classes. Our classifier uses a single parameter ν to control the bounds on

the training error and number of support vectors, whereas ν-TWSVM uses two such

parameters - ν1 and ν2.

Structural risk minimization (SRM) principle is a significant property of SVM-

based classifiers [2]. However, ν-TWSVM considered only the empirical risk in its

primal problems and the dual QPPs of ν-TWSVM involve inverse of matrices (GTG)

and (HTH) where G = [B e2] and H = [A e1]. In order to obtain the solution for

the dual problems, ν-TWSVM must assume that the inverse of aforementioned ma-

trices always exist and the matrices should always be positive semidefinite. Taking

motivation from TBSVM [15], we modified the first primal problem of ν-TWSVM

and a regularization term is added with the idea of minimizing the structural risk.

Further, on the lines of ν-TWSVM, Iν-TWSVM also deals with ρ distance separa-

bility of classes rather than unit distance as in TWSVM. Therefore, the variable ρ is

optimized by Iν-TWSVM and the corresponding parameter ν bounds the training

error and number of support vectors.

2.2. Improvements on ν-Twin Support Vector Machine 25

In following section, we present two novel classifiers “Improvements on ν-Twin

Support Vector Machine, namely Iν-TWSVM and Iν-TWSVM (Fast)”, developed

on the lines of TWSVM [10] and further based on ν-TWSVM [18]. (From this point

onwards, we will refer to first implementation as Iν-TWSVM and second as Iν-

TWSVM (Fast)). The novelty of this work is the formulation of second problem as

a UMP that makes Iν-TWSVM more efficient than ν-TWSVM, in terms of learning

time of classifiers.

2.2.1 Iν-TWSVM (Linear classifier)

Working on the lines of ν-TWSVM [18], the first problem of Iν-TWSVM is formu-

lated as a QPP and is given by:

Iν-TWSVM1:

minw1,b1,ρ,ξ

1

2‖Aw1 + e1b1‖22 +

c1

2(wT1 w1 + b21) + c2e

T2 ξ − νρ

subject to −(Bw1 + e2b1) ≥ ρe2 − ξ,

ξ ≥ 0,

ρ ≥ 0, (2.1)

where ‖.‖2 is L2-norm.

The QPP in (2.1) determines the hyperplane which is closer to data points of

positive class (which is represented by A) and at least ρ distance away from the data

points of negative class (represented by B). The first term in the objective function is

similar to TWSVM and ν-TWSVM and thus, Iν-TWSVM follows the Empirical Risk

Minimization (ERM) principle. Further, Iν-TWSVM also takes into consideration

the principle of SRM [44] to improve the generalization ability, by introducing a

term (wT1 w1 + b21) in the objective function. This regularization term maximizes the

margin between two classes with respect to the plane wT1 x+b1 = 0. Here, the margin

between two classes can be expressed as distance bounded by the plane proximal to

the positive class (wT1 x + b1 = 0) and the bounding plane (wT1 x + b1 = −ρ). This

distance is ρ/‖w1‖2 and is the margin between two classes with respect to plane

wT1 x+ b1 = 0. The extra term b21 is motivated by Proximal Support Vector Machine

[8]. Let X = [xT , 1]T , W1 = [w1, b1] then the proximal plane in Rn+1 is XTW1 = −ρ

26

and the margin is ρ/‖W1‖2, i.e. ρ/√

(‖w1‖22 + b22). Thus, the distance between two

classes is maximized with respect to orientation (w1) and relative location of the

plane (b1) from the origin.

The constraints require that the hyperplane should be at least ρ distance away from

the data points of negative class. Iν-TWSVM is defined as a soft margin classifier,

thus we use error variables given by ξ, which measures the amount of violation of the

first set of constraints. Our formulation tries to minimize the sum of error variables

and maximizes the distance ρ so that the hyperplane should be as far as possible

from the data points of negative class. The positive valued constants c1 and ν are

the weights given to testing accuracy and distance ρ respectively; c2 is the penalty

weight for error variable ξ.

The Lagrangian corresponding to Iν-TWSVM1 (2.1) is given by

L(w1, b1, ρ, ξ, α, β, γ) =1

2‖Aw1 + e1b1‖22 +

c1

2(wT1 w1 + b21) + c2e

T2 ξ

−νρ− αT (−(Bw1 + e2b1)− ρe2 + ξ)− βT ξ − γρ,

α, β, γ ≥ 0, (2.2)

where α = (α1, α2, ..., αm2)T , β = (β1, β2, ..., βm2)T and γ are Lagrange multipliers

of dimensions (m2×1), (m2×1) and (1×1) respectively. The Karush-Kuhn-Tucker

(KKT) necessary and sufficient optimality conditions [13] for (Iν-TWSVM1) are

2.2. Improvements on ν-Twin Support Vector Machine 27

given by

∂L

∂w1= 0⇒ AT (Aw1 + e1b1) + c1w1 +BTα = 0, (2.3)

∂L

∂b1= 0⇒ eT1 (Aw1 + e1b1) + c1b1 + eT2 α = 0, (2.4)

∂L

∂ρ= 0⇒ eT2 α− ν − γ = 0, (2.5)

∂L

∂ξ= 0⇒ c2 − α− β = 0, (2.6)

−(Bw1 + e2b1)− ρe2 + ξ ≥ 0, (2.7)

ξ, ρ, α, β, γ ≥ 0, (2.8)

αT (−(Bw1 + e2b1)− ρe2 + ξ) = 0, (2.9)

βT ξ = 0, (2.10)

γρ = 0. (2.11)

Since γ ≥ 0, from (2.5)

eT2 α ≥ ν, (2.12)

and also β ≥ 0, from (2.6)

0 ≤ α ≤ c2. (2.13)

We define augmented matrices H = [A e1], G = [B e2] and z1 = [w1, b1]T ; by

combining (2.3) and (2.4), we get

HTHz1 + c1z1 +GTα = 0, (2.14)

which leads to

z1 = −(HTH + c1I)−1GTα. (2.15)

In some situations, the inverse of the matrix HTH may not exist. Then the param-

eter c1 can be tuned to take care of the problems that arise due to ill-conditioning

of HTH. Here, I is an identity matrix of appropriate dimensions. The value of α

28

can be determined from the Wolfe dual [13] of (Iν-TWSVM1) and is given by

maxα

−1

2αTG(HTH + c1I)−1GTα

subject to eT2 α ≥ ν,

0 ≤ α ≤ c2. (2.16)

The significant contribution of this work is the formulation of second problem,

which determines the hyperplane corresponding to the negative class. The second

hyperplane of Iν-TWSVM is obtained by solving a UMP and its formulation is given

by

Iν-TWSVM2:

minw2,b2

12‖Bw2 + e2b2‖22 + c3

2 (wT2 w2 + b22)− c42 ‖[MA 1][wT2 , b2]T − ρ‖22. (2.17)

In (2.17), we find the hyperplane which is proximal to the data points of negative

class and at the same time, the negative hyperplane should be at least ρ distance

away from the representative (i.e. mean) of positive class. Instead of maximizing the

distance of all the data points of positive class from the negative class hyperplane

(as considered in TWSVM2), we are trying to maximize its distance from the mean

of positive class. Here, MA is the mean of matrix A with dimension (1× n) and is

regarded as the representative of positive class. Hence, the size of the problem is

reduced as we are dealing with unconstrained optimization problem. The positive

constants c3 and c4 associate weights with the corresponding terms.

To find the value of ρ, we use (2.9), (2.10) and (2.13); i.e. get all the indices i

of negative class B where ε < α < c2. Here, ε is selected to be very small value, of

order not more than 10−5. All such data points correspond to support vectors of

positive hyperplane of Iν-TWSVM. For these support vectors, β 6= 0 and hence by

using (2.10), ξ = 0. These patterns lie on the bounding plane (wT1 x+ b1 = −ρ). For

all the indices i (1 ≤ i ≤ m2),

− (Biw1 + e2b1) = ρe2. (2.18)

Further, we take mean of−(Biw1+e2b1) to get the value of ρ. We define P = [MA 1],

2.2. Improvements on ν-Twin Support Vector Machine 29

z2 = [w2, b2]T and (2.17) can be written as

minz2

L =1

2zT2 G

TGz2 +c3

2zT2 z2 −

c4

2‖Pz2 − ρ‖22. (2.19)

The above equation is minimized by differentiating with respect to z2 and equating

to zero; and is given as

∇z2L = 0

or GTGz2 − c4PT (Pz2 − ρ) + c3z2 = 0. (2.20)

From (2.20), we obtain

z2 = c4(−(GTG+ c3I) + c4PTP )−1P Tρ. (2.21)

The augmented vectors z1 and z2 can be obtained from (2.15) and (2.21) respectively

and are used to generate the hyperplanes, as given in (1.1). A new data sample

x ∈ Rn is assigned to class r (r = 1, 2), depending on which of the two hyperplanes

given by (1.1) it lies closer to, i.e.

r = arg (minl=1,2

|xTw(l) + b(l)|‖w(l)‖2

), (2.22)

where |.| is the absolute distance of point x from the plane xTw(l) + b(l) = 0. The

label assigned to the test data is given as

y =

+1 (r = 1)

−1 (r = 2)

.

2.2.2 Iν-TWSVM (Kernel classifier)

In order to extend the results to non-linear classifiers, the kernel-generated surfaces

are considered instead of hyperplanes, as discussed for TWSVM and are given by

(1.9-1.10). The primal QPP of the non-linear Iν-TWSVM corresponding to these

surfaces is given by

30

KIν-TWSVM1:

minu1,b1,ρ,ξ

1

2‖Ker(A,CT )u1 + e1b1‖22 +

c1

2(uT1 u1 + b21) + c2e

T2 ξ − νρ

subject to −(Ker(B,CT )u1 + e2b1) ≥ ρe2 − ξ,

ξ ≥ 0,

ρ ≥ 0. (2.23)

The Wolfe dual of (KIν-TWSVM1) is given by

maxα

−1

2αTR(STS + c1I)−1RTα

subject to eT2 α ≥ ν

0 ≤ α ≤ c2, (2.24)

where S = [Ker(A,CT ) e1] and R = [Ker(B,CT ) e2]. The augmented vector

v1 = [u1, b1]T is determined as

v1 = −(STS + c1I)−1RTα. (2.25)

The second problem of Iν-TWSVM, corresponding to surface (1.10), is defined as

KIν-TWSVM2:

minu2,b2

1

2‖Ker(B,CT )u2 + e2b2‖22 +

c3

2(uT2 u2 + b22)

−c4

2‖[MKerA 1][uT2 , b2]T − ρ‖22. (2.26)

Here, MKerA is the mean of matrix Ker(A,CT ). By differentiating (2.26) with

respect to u2 and equating to zero, we get

v2 = c4(−(RTR+ c3I) + c4STS)−1STρ, (2.27)

where v2 is augmented matrix given by v2 = [u2, b2]T . Once (KIν-TWSVM1) and

(KIν-TWSVM2) are solved to obtain the kernel generated surfaces, a new test pat-

tern x ∈ Rn is assigned to class 1 or -1 in a manner similar to the linear case.

2.3. Improvements on ν-Twin Support Vector Machine (Fast) 31

2.3 Improvements on ν-Twin Support Vector Machine

(Fast)

The second variant of our classifier i.e. Iν-TWSVM (Fast), modifies the first problem

of Iν-TWSVM as minimization of a unimodal function for which line search methods

can be used; this further avoids solving the QPP in the first problem. The other

problem is formulated as a UMP, similar to Iν-TWSVM. Hence, Iν-TWSVM (Fast)

is a faster version of our work. It is experimentally proved to be more time-efficient

than ν-TWSVM and Iν-TWSVM.

The first hyperplane of Iν-TWSVM (Fast) is obtained by solving the following

primal problem.

Iν-TWSVM1 (Fast):

minw1,b1,ρ,ξ

1

2‖Aw1 + e1b1‖22 +

c1

2(wT1 w1 + b21) + c2ξ − νρ

subject to −(MBw1 + b1) ≥ ρ− ξ,

ξ ≥ 0,

ρ ≥ 0, (2.28)

where ‖.‖ is L2-norm.

The QPP in (2.28) is similar to QPP in (2.1); but instead of considering all the

data points of negative class (represented by B), it takes a representative of negative

class as MB. Since the number of constraints in QPP depends on the number of

data points in other class, we can significantly reduce the size of QPP by taking

representative of the class and hence lessening the number of constraints. This

class representative could be some statistical measure of the data points like mean,

depending on the data. For our implementation, we considered ‘mean of data points’

as class representative. In (2.28), ξ has dimensions (1 × 1) and MB is (1 × n) for

linear case. All the symbols have same meaning as defined in the beginning of this

section. The Wolfe dual of (2.28) is given by

minν≤α≤c2

1

2αTG(HTH + c1I)−1GTα, (2.29)

32

where α is a real valued decision variable. The augmented matrices are given as

H = [A e1] and G = [MB 1]. The dual problem obtained as (2.29) is a convex

optimization problem and is of the form

mina≤x≤b

f(x), (2.30)

where f : [a, b] → R is a unimodal min function [45, 46]. Such one dimensional

optimization problems can be solved efficiently by using line search methods like

the golden section rule or the Fibonacci search method. Therefore, in Iν-TWSVM

(Fast), we can avoid solving the QPP and this results in very efficient learning time.

The hyperplane corresponding to negative class B can be obtained as given by

Iν-TWSVM2. Iν-TWSVM1 (Fast) can be extended to non-linear case by defining

kernel as given in (1.9) and (1.10). Using ‘mean of data’ as class representative

gives flexibility to our algorithm, to be extended to non-linear version. It is easy to

find mean in kernel space, whereas with other statistical data representations like

median, there is difficulty in their computation in high-dimensional space.

2.4 Multi-category Extensions of Iν-TWSVM

Since, most of the SVM-based classifiers are formulated for binary classification

problems and to use these classifiers for real world problems, they must be extended

in multi-category framework. In this work, the classifier Iν-TWSVM is extended

using two multi-category approaches- One-Against-All and Binary Tree of classifiers.

2.4.1 One-Against-All Iν-TWSVM

For a K-class classification problem, OAA multi-category approach constructs K

binary Iν-TWSVM classifiers. Here, each classifier consists of a pair of nonparallel

hyperplanes. The ith classifier (i = 1 to K) is obtained by considering all patterns

in the ith class as positive class and rest of the patterns constitute negative class.

With m data patterns ((xj , yj), j = 1 to m), the matrices A = xp : yp = i and

B = xq : yq 6= i are taken for ith problem. The patterns of A and B are assigned

labels +1 and −1 respectively. This data is used as input for Iν-TWSVM in (2.1)

or (2.23) to get linear or non-linear classifiers respectively. A new pattern x ∈ Rn is

tested on the lines of OAA-TWSVM. (Please refer Section 1.3.1.)

2.5. Discussion 33

2.4.2 Binary Tree of Iν-TWSVM

Binary Tree (BT) of classifiers is a multi-category classification approach which is

motivated by half-against-half (HAH) multi-class SVM [35] (For details on Binary

Tree algorithm, please refer Section 5.2.1). HAH randomly partitions the classes into

two groups and constructs SVM classifier that separates the groups. Unlike HAH,

BT identifies two groups of classes by K-Means clustering [25] (with k = 2) and

generates Iν-TWSVM hyperplanes. The multi-category classifier model is built by

recursively partitioning the training data. At each level of the Binary Tree, training

data is partitioned into two groups by applying K-Means (k=2) clustering [47] and

the hyperplanes are determined for the two groups using Iν-TWSVM classifiers;

use (2.1),(2.17) or (2.23),(2.26) to get linear or non-linear classifiers respectively.

This process is repeated until further partitioning is not possible. The BT classifier

model thus obtained can be used to assign the labels to the test patterns. At each

level of the classifier tree, the distance of the new pattern is calculated from both the

hyperplanes and it is associated with the nearest hyperplane, as given in (2.22). This

process is repeated until a leaf node is reached and the label of leaf node is assigned

to the test pattern. BT determines (K−1) classifiers for a K-class problem, but the

size of the problem diminishes as we traverse down the Binary Tree. For testing, BT

requires at most dlog2Ke binary evaluations. BT has better generalization ability

than OAA multi-category algorithm.

2.5 Discussion

Complexity Analysis: The main contribution of this work is to reduce the com-

plexity of the ν-TWSVM based classifiers. Iν-TWSVM determines two nonparallel

classifiers based on a QPP and a UMP, with QPP of smaller size solved first. This

classifier is symmetric in the sense that whichever class has larger number of patterns

would be the one for which QPP would be solved (as the size of QPP is determined

by the number of patterns appearing in the constraints i.e. number of patterns in

the other class) and for the other class UMP is solved. By doing so, we are formulat-

ing a QPP with lesser number of constraints because the order of QPP depends on

number of patterns in other class. Thus, the user is given the flexibility in selecting

34

the problem to be solved first.

Let the number of patterns in positive and negative classes are m1 and m2

respectively (without loss of generality m1 > m2). Then, the complexity of finding

two ν-TWSVM classifiers, by solving a pair of QPPs, is of the order (m31 + m3

2)

for linear version. Whereas for Iν-TWSVM, the hyperplane corresponding to the

positive class is determined first as a QPP and the complexity of the problem is of

order (m2)3. The second problem is a UMP and its complexity is no more than n3

(with linear classifier), since we are solving the problem in primal space using least

squares classifier. Hence, the complexity of Iν-TWSVM is of the order ((m2)3 +n3),

which suggests that Iν-TWSVM requires lesser learning time than ν-TWSVM.

Limitation of Iν-TWSVM: For linear version, if mean is a good representative of

the class data, then the performance of Iν-TWSVM or Iν-TWSVM (Fast) is in-line

with that of other TWSVM-based classifiers. Further, the presence of outliers could

affect the choice of representation of data to be just the mean. Various methods for

outliers detection are discussed by Hodge et al. [48] and if one is convinced that

there are few outliers in the dataset, then mean can be used as the representative.

Therefore, in order to explore the usefulness of Iν-TWSVM, we need to first discard

outliers from the data. The results for non-linear case would depend on how well the

data is represented by the mean of kernel of class patterns. However, the experiment

results discussed in Section 2.6 establish the competency of Iν-TWSVM with ‘mean

of data’ as the class representative.

2.6 Experimental Results

To evaluate the performance of Iν-TWSVM and Iν-TWSVM (Fast), they are com-

pared with TBSVM [15] and ν-TWSVM [18] regarding classification accuracy and

computational efficiency. In order to control the bias and over-fitting, the experi-

ments are performed using 10-fold cross validation [49]. The average of classification

accuracy for 10-folds is calculated for each dataset. All the experiments, presented

in this thesis, are performed in MATLAB version 8.0 under Microsoft Windows envi-

ronment on a machine with 3.40 GHz CPU and 16 GB RAM. (For details regarding

datasets, please refer Appendix C, D.)

2.6. Experimental Results 35

Parameter Setting

For conducting the experiments, an important preliminary task is to get the optimal

values of the parameters. For all the experimental results present in this thesis, we

have applied grid search method [50] to tune the parameters. For each dataset,

a validation set comprising of 10% randomly selected samples from the dataset is

used. For this work, we have selected values of c1 and c3 from 10i, i=-5 to -1. The

parameters ν, c2 and c4 are tuned in the range 0.1 to 1.

2.6.1 Synthetic Datasets

The efficiency of Iν-TWSVM has been tested on cross planes [8] and linearly insep-

arable two moons dataset.

Figure 2.1: Two moons dataset: Classification result with Iν-TWSVM

Two Moons Dataset

The two moons dataset [51], as shown in Fig.2.1 consists of 200 data points in R2

and to rigorously test the four classifiers i.e. TBSVM, ν-TWSVM, Iν-TWSVM and

Iν-TWSVM (Fast), we added Gaussian noise, with SNR 15. The classification result

for all four approaches are given in Table 2.1. The Iν-TWSVM (using non-linear

Gaussian kernel) classifier is able to achieve 98.5% classification accuracy and seems

to be intuitively satisfying.

36

Table 2.1: Classification accuracy for synthetic datasets

Dataset TBSVM ν-TWSVM Iν-TWSVM Iν-TWSVM (Fast)Acc (%) Acc (%) Acc (%) Acc (%)

Two moons 96 97.5 98.5 96.5

Cross-planes 91.5 84.5 93.5 92.5

(a) Iν-TWSVM (b) Iν-TWSVM (Fast)

(c) ν-TWSVM (d) TBSVM

Figure 2.2: The hyperplanes obtained for cross-planes dataset

Cross-planes Dataset

To evaluate the effectiveness of the Iν-TWSVM, we tested its ability to learn the

cross-planes dataset. The cross-planes dataset is generated by perturbing the points

lying on two intersecting lines in R2. Figure 2.2 graphically illustrates the simu-

lation results, with linear kernel, for Iν-TWSVM, Iν-TWSVM (Fast), ν-TWSVM

and TBSVM with 200 training points. The red and blue dots represent the data

points of two classes. The figure demonstrates that the classifier Iν-TWSVM can

effectively generate the hyperplanes for the two classes. The classification results on

synthetic datasets are given in Table 2.1.

2.6.2 Binary Classification Results: UCI and Exp-NDC datasets

To prove the competence of our work, we performed classification experiments on

a variety of benchmark datasets. These include datasets from University of Cali-

2.6. Experimental Results 37

fornia, Irvine Machine Learning Repository (UCI) and Exp-Normally Distributed

Clusters (Exp-NDC) datasets which are commonly used in testing machine-learning

algorithms. The samples are normalized before learning such that the features are

located in the range [0,1]. In our simulations, we performed experiments with linear

and Gaussian kernel to obtain the classifiers. We have selected twelve imbalanced

UCI binary datasets [52] for the experiments, as listed in Table 2.2. The table also

shows the number of samples in positive and negative classes as (NP : NN ).

It is interesting to study the learning time of algorithms with increase in number

of data points. This is done by performing experiments using Exp-NDC datasets

and the results are discussed in following section.

Classification Results for UCI Datasets (Linear)

The classification results using linear classifier for Iν-TWSVM, Iν-TWSVM (Fast),

ν-TWSVM and TBSVM are reported in Table 2.2. The mean of accuracy (in %)

across 10-folds is reported along with standard deviation. The table demonstrates

that Iν-TWSVM outperforms the other algorithms in terms of classification accu-

racy. The mean accuracy of Iν-TWSVM, for all the datasets, is 83.91% as compared

to 83.57% and 83.01% for TBSVM and ν-TWSVM respectively. Iν-TWSVM is able

to achieve the maximum classification accuracy for most of the UCI datasets through

linear classifier.

This work presents an improved version of ν-TWSVM, with the intension of

developing a classifier with efficient learning time. Iν-TWSVM solves a smaller

sized QPP and a UMP and thus, we can intuitively suggest that Iν-TWSVM is

more efficient in terms of learning time as compared to ν-TWSVM and TBSVM.

To validate our point, we simulated the experiments on UCI datasets and Table 2.2

presents the learning time (in 10−3 sec) of all four methods i.e. TBSVM, ν-TWSVM,

Iν-TWSVM and Iν-TWSVM (Fast). The learning time is recorded as the average

CPU time with 10-fold cross validation. It is observed that Iν-TWSVM (Fast) is

the most time-efficient of all the approaches. Iν-TWSVM (Fast) can be used with

data of enormous size, though the accuracy achieved may be slightly lesser than

Iν-TWSVM.

38

Table 2.2: Classification results with linear classifier on UCI datasets

Dataset NP : NN TBSVM ν-TWSVM Iν-TWSVM Iν-TWSVM(Fast)

Mean Accuracy (%) ± SD

Learning time ×(10−3) sec

Heart-Statlog 150 : 120 85.56 ± 2.88 85.19 ± 2.68 85.93* ± 2.90 85.79 ± 3.72(270 × 13) 3.98 5.28 3.46 0.88

WPBC 458 : 240 97.28 ± 1.42 96.56 ± 2.35 96.28 ± 2.87 95.99 ± 3.28(698 × 34) 24.54 27.15 11.10 1.25

PIMA-Indians 500 : 268 76.31 ± 3.69 74.61 ± 4.83 76.43 ± 3.64 76.68 ± 3.55(768 × 8) 11.30 11.71 5.79 1.30

CMC 844 : 629 67.55 ± 4.05 66.26 ± 3.56 68.06 ± 3.04 67.96 ± 4.34(1473 × 9) 28.47 30.40 13.30 2.09

ACA 383 : 307 83.27 ± 3.15 83.77 ± 4.47 87.25 ± 2.03 86.67 ± 2.25(690 × 14) 13.30 16.09 7.56 1.71

Heart-Cleveland 164 : 139 83.84 ± 2.83 82.81 ± 4.76 85.14 ± 3.93 84.47 ± 3.56(303 × 14) 4.83 5.73 2.93 0.80

Votes 267 : 168 95.66 ± 3.77 95.21 ± 4.34 95.66 ± 3.77 95.66 ± 3.77(435 × 16) 6.89 7.66 3.70 0.76

Sonar 111 : 97 75.95 ± 5.33 76.50 ± 6.74 76.38 ± 3.67 76.08 ± 3.91(208 × 60) 4.55 6.26 2.72 1.02

Ionosphere 225 : 126 86.07 ± 3.34 86.63 ± 4.26 87.06 ± 2.78 86.83 ± 2.45(351 × 34) 8.03 8.17 4.36 1.87

Two-norm 351 : 49 98.75 ± 1.32 98.50 ± 1.29 98.00 ± 1.58 98.00 ± 1.58(400 × 20) 7.84 9.36 3.32 1.49

German 700 : 300 71.10 ± 4.38 68.60 ± 4.65 69.30 ± 4.72 71.00 ± 4.32(1000 × 20) 18.74 21.57 7.85 2.18

Thyroid 150 : 65 81.47 ± 4.49 81.47 ± 3.70 81.47 ± 3.70 81.47 ± 3.70(215 × 5) 3.98 5.13 2.03 1.11

Avg. acc 83.57 ± 3.39 83.01 ± 3.97 83.91 ± 3.22 83.88 ± 3.37Avg. time 11.37 12.88 5.68 1.37

* For all the numerical experiment results given in this thesis, the bold values indicate best result.

Classification Results for Exp-NDC Datasets (Linear)

To study the effect of number of data points on the learning time of Iν-TWSVM

classifier, we have conducted experiments on large datasets, generated using David

Musicants NDC Data Generator [53]. Table 2.3 gives the experimental results of

TBSVM, ν-TWSVM, Iν-TWSVM and Iν-TWSVM (Fast) with linear classifier on

Exp-NDC datasets. The number of training and test patterns are shown for each

dataset as “Train-Test” and (NP : NN ) shows the distribution of positive and nega-

tive samples. All Exp-NDC datasets are imbalanced. The classification accuracy is

reported in percent (%) and the results indicate that Iν-TWSVM outperforms the

other methods for most of the datasets.

Table 2.3 also presents the learning time (in seconds) of linear TWSVM-based

classifiers and it is observed that the learning time of Iν-TWSVM (Fast) is less as

compared to other three methods and the rate of growth of its learning time is much

2.6. Experimental Results 39

Table 2.3: Classification results with linear classifier on Exp-NDC datasets

Dataset Train-Test TBSVM ν-TWSVM Iν-TWSVM Iν-TWSVM(NP : NN ) (Fast)

Mean Accuracy (%) ± SDLearning time ×(10−2) sec

Exp-NDC-500 500-50 78.00 ± 5.81 77.20 ± 5.67 78.60 ± 4.12 78.60 ± 4.12(330 : 170) 1.34 1.72 0.54 0.12

Exp-NDC-700 700-70 76.57 ± 3.90 76.14 ± 3.73 77.00 ± 3.46 77.00 ± 3.46(447 : 253) 2.70 2.82 0.94 0.20

Exp-NDC-900 900-90 77.22 ± 3.94 75.78 ± 4.54 78.11 ± 3.63 78.11 ± 3.63(571 : 329) 3.52 4.08 1.43 0.22

Exp-NDC-1K 1K-100 76.40 ± 4.00 74.90 ± 3.35 77.10 ± 3.76 77.10 ± 3.76(627 : 373) 4.24 5.11 1.84 0.29

Exp-NDC-2K 2K-200 77.90 ± 1.97 78.45 ± 1.92 78.00 ± 2.12 78.00 ± 2.12(1246 : 754) 4.79 5.45 2.79 0.60

Exp-NDC-3K 3K-300 78.23 ± 2.03 78.10 ± 2.88 78.30 ± 2.02 78.10 ± 2.16(1860 : 1140) 8.36 9.69 6.70 0.65

Exp-NDC-4K 4K-400 78.25 ± 1.64 78.18 ± 1.55 78.35 ± 1.52 78.23 ± 1.53(2474 : 1526) 14.34 16.82 12.79 0.72

Exp-NDC-5K 5K-500 76.78 ± 2.01 75.42 ± 2.41 76.92 ± 1.99 76.92 ± 1.99(3086 : 1914) 22.67 25.02 19.19 0.78

Exp-NDC-10K 10K-1K 85.02 ± 0.90 84.65 ± 0.93 85.04 ± 0.90 85.03 ± 0.89(6138 : 3862) 116.98 124.62 95.63 1.50

Exp-NDC-50K 50K-5K * * * 77.52 ± 0.55(30783 : 19217) 18.27

Exp-NDC-100K 100K-10K * * * 84.20 ± 0.81(61648 : 38352) 87.31

* Experiments terminated as system was out of memory

less than the rate of growth of data size. Thus, Iν-TWSVM (Fast) can be used for

experiments with very large-sized datasets, where other TWSVM-based classifiers

may fail to give results due to memory constraints or very high execution time. The

results could not be obtained for TBSVM, ν-TWSVM and Iν-TWSVM for Exp-

NDC-50K and Exp-NDC-100K due to restriction of windows environment with 16

GB RAM. However, Iν-TWSVM (Fast) successfully generated the results and can

be used with low-configuration systems.

Classification Results for UCI and Exp-NDC Datasets (Non-linear)

Our classifier is extended for non-linear classifiers and the classification accuracy

of all the four algorithms on UCI datasets is reported in Table 2.4. For all the

methods, the RBF kernel K(x, x′) = exp(−σ‖x − x′‖22) is used. The classification

results illustrate that Iν-TWSVM performs best among all the classifiers and Iν-

TWSVM (Fast) takes minimum CPU time for building the classifier.

The classification accuracy for Exp-NDC datasets is reported in Table 2.5 for

40

Table 2.4: Classification results with non-linear classifier on UCI datasets

Dataset TBSVM ν-TWSVM Iν-TWSVM Iν-TWSVM(Fast)

Mean Accuracy (%) ± SDLearning time ×(10−3) sec

Heart-Statlog 85.73 ± 1.88 85.13 ± 4.57 86.30 ± 3.39 85.88 ± 3.517.24 8.48 4.68 1.35

WPBC 96.56 ± 1.38 96.14 ± 2.02 96.71 ± 1.66 97.14 ± 1.5032.42 35.48 23.24 14.35

PIMA-Indians 76.45 ± 3.63 76.82 ± 4.54 76.83 ± 1.96 76.66 ± 3.0915.34 19.10 12.46 6.60

CMC 70.06 ± 2.48 71.56 ± 4.28 71.83 ± 2.22 70.81 ± 3.9958.65 61.59 38.55 13.72

ACA 86.52 ± 2.65 87.39 ± 4.10 87.39 ± 2.17 87.10 ± 1.9921.15 28.46 19.52 9.36

Heart-Cleveland 83.16 ± 2.50 83.27 ± 3.53 83.83 ± 1.71 82.20 ± 4.878.75 9.86 6.36 2.60

Votes 95.15 ± 2.49 96.27 ± 1.63 96.54 ± 2.95 94.75 ± 4.2810.22 13.22 7.17 1.47

Sonar 89.45 ± 5.47 90.33 ± 7.35 89.45 ± 5.47 86.55 ± 5.8414.11 16.53 12.78 2.28

Ionosphere 94.60 ± 3.65 93.75 ± 4.78 94.60 ± 3.40 94.29 ± 3.8113.27 14.26 10.19 3.29

Twonorm 98.50 ± 1.29 98.75 ± 1.32 99.00 ± 1.29 99.25 ± 1.219.30 12.61 9.59 4.09

German 73.80 ± 5.01 74.40 ± 3.53 77.4 ± 4.97 76.30 ± 5.0137.90 35.44 18.79 13.72

Thyroid 95.91 ± 2.42 97.71 ± 3.88 98.16 ± 2.37 96.23 ± 4.384.72 7.53 5.32 2.46

Accuracy (Mean) 87.24 ± 2.90 87.67 ± 3.79 88.17 ± 2.80 87.26 ± 3.6219.42 21.88 14.06 6.27

all four methods with non-linear kernel. The maximum accuracy over all Exp-NDC

datasets is 83.99%, reported for Iν-TWSVM. Iν-TWSVM (Fast) requires minimum

learning time among all four classifiers. It is observed that for most of the datasets

the classification results are better with non-linear classifier as compared to linear

ones.

2.6.3 Statistical Tests

Statistical tests for required for comparing the performance of all four TWSVM-

based classifiers i.e. TBSVM, ν-TWSVM, Iν-TWSVM and Iν-TWSVM (Fast), on

multiple datasets. Two such statistical tests are: Friedman test [54] and Holm-

Bonferroni test [55] (Please refer Appendix A).

2.6. Experimental Results 41

Table 2.5: Classification result with non-linear classifier on Exp-NDC datasets

Dataset TBSVM ν-TWSVM Iν-TWSVM Iν-TWSVM(Fast)

Mean Accuracy (%) ± SDLearning time ×(10−2) sec

Exp-NDC-500 77.80 ± 6.63 78.60 ±6.04 82.00 ±4.42 79.40 ± 4.902.69 3.24 1.26 0.79

Exp-NDC-700 78.29 ± 5.79 79.86 ± 4.76 81.86 ± 4.52 82.29 ± 4.524.49 4.52 2.40 1.43

Exp-NDC-900 79.00 ± 5.70 78.89 ± 5.54 83.78 ± 3.71 83.78 ± 3.717.86 8.58 4.67 2.35

Exp-NDC-1K 83.60 ± 3.53 82.00 ± 3.13 84.10 ± 3.18 84.00 ± 3.809.04 9.32 6.56 3.66

Exp-NDC-2K 84.35 ± 2.25 88.15 ± 2.42 88.20 ± 1.70 88.80 ± 2.369.30 10.37 8.41 4.70

Accuracy (Mean) 80.61 ± 4.78 81.50 ± 4.38 83.99 ± 3.51 83.65 ± 3.73Time (Mean) 6.68 7.21 4.66 2.59

Friedman Test

The Friedman test on the classification accuracy of all four classifiers with UCI

datasets is given in Table 2.6 and Iν-TWSVM achieves the highest rank among all

the four approaches. The non-linear versions of four TWSVM-based classifiers are

compared in Table 2.7. Iν-TWSVM achieves the highest rank among all the four

approaches which proves that it outperforms the other three methods.

Table 2.6: Friedman test and p-values with linear classifiers for UCI datasets

Dataset TBSVM ν-TWSVM Iν-TWSVM Iν-TWSVM(Fast)

Rank (p-value) Rank (p-value) Rank Rank (p-value)

Heart-Statlog 3 (0.778) 4 (0.560) 1 2 (0.926)

WPBC 1 (0.339) 2 (0.315) 3 4 (0.835)

PIMA-Indians 3 (0.745) 4 (0.354) 2 1 (0.882)

CMC 3 (0.755) 4 (0.240) 1 2 (0.953)

ACA 4 (0.004) 3 (0.038) 1 2 (0.552)

Heart-Cleveland 3 (0.407) 4 (0.247) 1 2 (0.696)

Votes 2 (1.000) 4 (0.805) 2 2 (1.000)

Sonar 4 (0.836) 1 (0.961) 2 3 (0.861)

Ionosphere 4 (0.483) 3 (0.792) 1 2 (0.847)

Two-norm 1 (0.264) 2 (0.449) 3.5 3.5 (1.000)

German 1 (0.388) 3 (0.742) 4 2 (0.412)

Thyroid 2.5 (0.593) 2.5 (1.000) 2.5 2.5 (1.000)

Average rank 2.63 3.04 2.00 2.33

42

Holm-Bonferroni Test

Table 2.6 also presents the p-value with 5% significance level, for 10-fold accuracy

results of UCI datasets. For dataset ‘Heart-Statlog’, (P(1) = 0.560) < (P(2) =

0.778) < (P(3) = 0.926) and P(1) >0.05

3 . Further testing for this dataset is stopped

and it is concluded thatH(1), H(2), H(3) are not rejected. This essentially means that

Iν-TWSVM is statistically similar to TBSVM, ν-TWSVM and Iν-TWSVM (Fast).

Similar tests are repeated for other datasets. The numerical experiments also prove

that our classifier achieves accuracy comparable to TBSVM and ν-TWSVM, which

is verified by Holm-Bonferroni test. Thus, Iν-TWSVM gets accuracy comparable to

other two algorithms, but in lesser time.

Holm-Bonferroni test is also applied on four TWSVM-based non-linear classifiers

i.e. TBSVM, ν-TWSVM, Iν-TWSVM and Iν-TWSVM (Fast). The p-values for

non-linear algorithms are listed in Table 2.7 and Holm-Bonferroni test applied on

p-values suggest that the Iν-TWSVM is similar in performance to TBSVM and

ν-TWSVM.

Table 2.7: Friedman test and p-values with non-linear classifiers for UCI datasets

Dataset TBSVM ν-TWSVM Iν-TWSVM Iν-TWSVM(Fast)

Rank (p-value) Rank (p-value) Rank Rank (p-value)

Heart-Statlog 3 (0.651) 4 (0.524) 1 2 (0.792)

WPBC 3 (0.831) 4 (0.498) 2 1 (0.550)

PIMA-Indians 4 (0.773) 2 (0.997) 1 3 (0.889)

CMC 4 (0.152) 2 (0.862) 1 3 (0.490)

ACA 4 (0.432) 1.5 (1.000) 1.5 3 (0.759)

Heart-Cleveland 3 (0.639) 2 (0.639) 1 4 (0.332)

Votes 3 (0.757) 2 (0.098) 1 4 (0.291)

Sonar 2.5 (1.000) 1 (0.764) 2.5 4 (0.266)

Ionosphere 1.5 (1.000) 3 (0.653) 1.5 4 (0.850)

Two-norm 4 (0.398) 3 (0.673) 2 1 (0.660)

German 4 (0.124) 3 (0.137) 1 2 (0.628)

Thyroid 4 (0.051) 2 (0.755) 1 3 (0.236)

Average rank 3.33 2.46 1.38 2.83

2.6.4 Scatter Plots

To further compare the performance of Iν-TWSVM with that of ν-TWSVM, Figure

2.3 shows two-dimensional scatter plots of 21 test points of Thyroid dataset with

both the classifiers, as also presented in [10]. Here, star represents the scatter plot

2.6. Experimental Results 43

(a) Iν-TWSVM (b) ν-TWSVM

Figure 2.3: Two-dimensional projections of 21 test data points of Thyroid dataset

of the positive class and diamond represents scatter plot of the negative class. The

points appearing as clusters near the axes indicate how well the classifier is able to

discriminate between the two classes. It is observed that for both the classifiers,

the majority of test samples are clustered near their corresponding hyperplanes.

However, the projections of two classes are well separated with Iν-TWSVM but

not with ν-TWSVM. Figure 2.4 shows the two-dimensional scatter plots of the test

data points (comprising of 10% of the data points) for the WPBC dataset using

Iν-TWSVM and ν-TWSVM classifiers. It is clearly noticeable that Iν-TWSVM

manages to get better separation of data points than ν-TWSVM. Thus, Iν-TWSVM

obtains better clustered points and separated classes than ν-TWSVM.

(a) Iν-TWSVM (b) ν-TWSVM

Figure 2.4: Two-dimensional projections of 70 test data points of WPBC dataset

44

Table 2.8: Classification results with linear multi-category classifiers for UCIdatasets

TWSVM Iν-TWSVM Iν-TWSVM (Fast)

OAA BT OAA BT OAA BT

Dataset Mean Accuracy (%) ± SD

Derma. 94.82 ± 1.57 92.38 ± 4.57 96.97 ± 1.60 97.74 ± 1.95 96.20 ± 3.10 96.23 ± 4.65

Ecoli 82.02 ± 3.51 82.88 ± 1.91 85.91 ± 5.66 86.22 ± 7.22 84.07 ± 5.44 84.70 ± 6.26

Iris 95.33 ± 3.80 97.33 ± 1.49 95.33 ± 3.44 98.00 ± 3.22 94.00 ± 4.33 96.00 ± 4.66

Seeds 93.81 ± 2.13 92.80 ± 3.61 95.75 ± 3.51 95.24 ± 3.17 95.71 ± 5.24 95.24 ± 3.88

Segment 88.65 ± 8.21 88.74 ± 8.17 90.00 ± 7.92 87.14 ± 8.99 90.00 ± 7.92 86.19 ± 9.64

Wine 96.43 ± 4.07 94.96 ± 3.09 99.44 ± 1.76 98.89 ± 3.51 95.49 ± 5.75 96.67 ± 5.36

Zoo 93.14 ± 6.41 94.05 ± 6.52 96.00 ± 5.16 98.00 ± 4.21 93.00 ± 9.49 97.00 ± 4.83

2.6.5 Multi-category Classification Results: UCI Datasets

The classification accuracy of both versions of Iν-TWSVM on multi-category UCI

datasets with two approaches- OAA and BT, is given in Table 2.8. The performance

of Iν-TWSVM is compared with TWSVM. The experiments are conducted with

linear classifiers and the results show that Iν-TWSVM can be successfully used as

a multi-category classifier. It is also observed that the Binary Tree (BT) based

approach is better than OAA in terms of classification accuracy.

2.7 Application: Image Segmentation

In this chapter, we have tried to explore the application of Iν-TWSVM as a classifier

for color pixel classification problem. The image has been converted to HSV color

space and quantized with minimum variance color quantization technique for two

levels. The color quantized image is used to create the training set for Iν-TWSVM

classifier. The image is first partitioned into non-overlapping square windows of

size p× p and the windows are identified as homogeneous or not based on the pixel

values assigned by color quantization. For the experiments, we set p = 3. We

randomly selected 1% of homogeneous window pixels as training set and all the

pixels of heterogeneous windows are test pixels. We extracted Gabor features [56]

(please refer Appendix E) with 4-orientation (0, 45, 90, 135) and 3-scale (0.5, 1.0, 2.0)

sub-bands and the maximum of the 12 coefficients determine the orientation at a

given pixel location. This pixel classification algorithm takes full advantage of the

local information of color image and uses the ability of Iν-TWSVM classifier to

distinguish the object pixels from background. Experimental evidence shows that

2.7. Application: Image Segmentation 45

the our method has generated very effective results and is able to extract the object

from the background. We have also implemented pixel classification through K-

Means clustering [25] and it is observed that Iν-TWSVM is able to achieve better

classification results than K-Means clustering.

(a.) Original Image (b.) Segmentation by (c.) Segmentation byK-Means clustering Iν-TWSVM

Table 2.9: Pixel Classification of color images from BSD image dataset.

We have performed pixel classifications experiments on color images taken from

Berkeley Segmentation Database (BSD) [57] and the results are displayed in Table

2.9. In this table, the first column shows the original RGB images taken from

BSD dataset. The second and third columns demonstrate pixel classification results

obtained by K-Means clustering and our method respectively. Each image pixel is

46

labeled with the color that shows its belongingness to object or background regions.

It is observed that the classification results obtained through Iν-TWSVM are better

than those obtained using K-Means. Our method is able to distinguish well between

the object pixels and the background pixels.

2.8 Conclusions

In this chapter, we have presented two novel classifiers, namely “Improvements on

ν-Twin Support Vector Machine: Iν-TWSVM and Iν-TWSVM (Fast)”, which im-

prove the learning time of TWSVM based classifiers, specifically ν-TWSVM. In Iν-

TWSVM, a smaller sized quadratic programming problem (QPP) and unconstrained

minimization problem (UMP) are solved, whereas TWSVM based classifiers solve a

pair of QPPs. Hence, Iν-TWSVM is computationally more efficient than TBSVM

and ν-TWSVM and has got comparable generalization ability. The formulation of

Iν-TWSVM is attractive for handling unbalanced datasets. Iν-TWSVM (Fast) is

even faster than Iν-TWSVM, as it further reduces the size of QPP and leads to

solving just system of equations and UMP. However, the application of Iν-TWSVM

(Fast) is possible whenever the mean is the correct representative of the data. Under

these circumstances, the use of Iν-TWSVM (Fast) is strongly recommended as it

is extremely fast and has comparable generalization capability as Iν-TWSVM. Our

work has its application in pixel classification and Iν-TWSVM is able to distinguish

the object pixels from the background.

Chapter 3

Angle-based Nonparallel Hyperplanes

Classifiers

3.1 Introduction

In this chapter, we present two TWSVM based nonparallel hyperplanes classifiers

(NHCs): Angle-based Twin Parametric-Margin Support Vector Machine (ATP-

SVM) and Angle-based Twin Support Vector Machine (ATWSVM).

Most of the NHCs solve two optimization problems independently in the train-

ing phase and then their solutions are used collectively to predict the labels in the

testing phase. The predicted label of a test pattern depends on its distance from

the two hyperplanes, whereas these two distances do not appear simultaneously in

any of the two optimization problems. Hence, the training and testing phases of

such classifiers are not consistent. To deal with this condition, Shao et al. proposed

Nonparallel Hyperplanes Support Vector Machine (NHSVM) [58] which determines

two nonparallel proximal hyperplanes simultaneously i.e. by solving only one opti-

mization problem. NHSVM is considered to be logically consistent in its predicting

and training processes and has improved classification accuracy. Similar to NHSVM,

Tian and Ju proposed Nonparallel SVM based on One Optimization Problem (NSV-

MOOP) [19] which aims at separating the two classes with the largest possible angle

between their decision hyperplanes. However, NSVMOOP formulation considers

the distance of all the training points from both the hyperplanes simultaneously and

results in a QPP which is twice the size of an SVM problem.

In this chapter, a novel NHC termed as Angle-based Twin Parametric-Margin

48

Support Vector Machine (ATP-SVM) with single optimization problem is presented,

which is motivated by Twin Parametric-margin SVM (TPMSVM) and is formulated

on the lines of NSVMOOP. Most of the NHCs assume that the noise is uniform in

the training data or its functional dependency is known beforehand, however this

assumption does not always hold true and could lead to poor results. Also, train-

ing and testing phases are not rational due to inconsistency in problem formulation

(optimization problem which determines the hyperplanes) and decision rules. The

binary classifier ATP-SVM can overcome both the above mentioned limitations.

ATP-SVM combines the merits of TPMSVM and NSVMOOP and hence the re-

sulting classifier is efficient in handling the data with unknown noise and generates

consistent results.

The idea of ATP-SVM is to solve a single optimization problem so as to generate

the two parametric-margin nonparallel hyperplanes, which bound the data so that

the respective class patterns lie on either side of corresponding hyperplanes. In order

to increase the separation between the two classes, the angle between the normal

vectors to the two hyperplanes is maximized. Unlike TWSVM, ATP-SVM avoids

solving inverse of matrices in the dual which otherwise is a computationally expen-

sive task. In this chapter, a training data selection procedure is introduced which

identifies the ‘representative patterns’ from the two classes to further improve the

training speed of the novel classifier. The classifier is proved to be more robust with

good generalization ability and its efficacy is established by conducting numerical

experiments on large number of benchmark UCI datasets.

We also present extension of ATP-SVM in multi-category environment using

One-Against-All (OAA) [32] and tree-based approach [59] like Binary Tree (BT).

This work includes application of ATP-SVM for color image segmentation into two

or more regions. When extended to multi-category scenario, ATP-SVM can be used

to identify multiple non-overlapping regions in the image. In this thesis, we have

used color images from Berkley Segmentation Dataset (BSD) [57].

This chapter presents another binary classifier termed as “Angle-based Twin

Support Vector Machine” (ATWSVM) which generates two nonparallel hyperplanes

by solving a pair of optimization problems. The first problem is formulated on the

lines of TWSVM and the other problem is a UMP that uses solution of the first

3.2. Angle-based Twin Parametric-Margin Support Vector Machine 49

problem and determines the hyperplane such that angle between the normal vectors

to the two hyperplanes is maximized. The novel classifier has a generic model where

the first problem can be solved using any TWSVM-based classifier like TBSVM,

ITWSVM, twin parametric-margin SVM (TPMSVM) [16] etc. For this work, we

have used TBSVM as the first problem.

The remaining chapter is organized as follows: Section 3.2 introduces “Angle-

based Twin Parametric-Margin Support Vector Machine”. Section 3.3 presents

“Angle-based Twin Support Vector Machine” and the another version of ATWSVM

is discussed in Section 3.4. The extension of classifiers in multi-category framework

is presented in Section 3.5. The complexity analysis of our classifiers is discussed

in Section 3.6. The numerical results on benchmark binary and multi-category UCI

and image datasets are given in Section 3.7. The application of classifiers for im-

age segmentation is discussed in Section 3.8. The concluding remarks are given in

Section 3.9.

3.2 Angle-based Twin Parametric-Margin Support Vec-

tor Machine

In this section, a novel binary classifier “Angle-based Twin Parametric-Margin Sup-

port Vector Machine” is presented. The classifier aims to determine two nonparallel

parametric-margin hyperplanes such that the angle θ between their normal vectors

w1 and w2 is maximized, as shown in Fig.3.1. This further results in larger separation

between the classes. Since,

cos θ =(w1.w2)

‖w1‖2‖w2‖2,

therefore minimizing the cosine of angle θ could achieve the objective [19]. The two

parametric-margin hyperplanes bound the respective class data on their either side

only and the final hyperplane is obtained as shown in Fig.3.1.

3.2.1 Selection of Representative Points

Similar to NSVMOOP, ATP-SVM also solves a single optimization problem. In order

to reduce the complexity of the problem, we present a procedure to identify repre-

50

Figure 3.1: Geometrical illustration of angle between normal vectors to ATP-SVMhyperplanes

sentative patterns from both the classes (Algorithm 1). Since, ATP-SVM generates

parametric-margin hyperplanes which lie on the boundary of the classes, therefore,

the data points that lie on or near the periphery of a class have a prominent role in

determining the hyperplanes. Our selection procedure identifies the representative

patterns and train the classifier with those points only. Because of this selection

procedure, the number of constraints are reduced in the QPP of ATP-SVM and it

results in faster learning of the classifier. These selected patterns can effectively

represent the entire dataset and are used to train the classifier. In the algorithm, if

P = 50%, then it results in an optimization problem of size comparable to that of

TWSVM.

3.2.2 ATP-SVM (Linear version)

The primal problem of linear ATP-SVM is given as:

ATP-SVM (Primal):

minz1,z2,ξ1,ξ2

1

2(‖z1‖22 + ‖z2‖22) + c1(eT1 ξ1 + eT2 ξ2)

+c2(eT2 Gz1 − eT1 Hz2) + c3(z1.z2),

subject to Hz1 ≥ 0− ξ1, ξ1 ≥ 0

Gz2 ≤ 0 + ξ2, ξ2 ≥ 0. (3.1)

Here, H = [Af e1], G = [Bf e2], z1 = [wT1 b1]T and z2 = [wT2 b2]T . The

3.2. Angle-based Twin Parametric-Margin Support Vector Machine 51

Input : Training data X = A,B, POutput : Representative patterns X f = Af , BfProcess:1. Find mean of both the classes: mean1, mean2

2. For each training point i, find its Euclidean distance from the mean of itsown class, disti = ‖Xi −meanj‖2, where j ∈ 1, 2.

3. Select most distant P% patterns as Af , Bf from both the classes basedon distance from the respective means.

Algorithm 1: Selection of representative points

matrices H and G are augmented matrices of representative patterns for positive

and negative classes respectively. The normal vectors to the hyperplanes are repre-

sented by augmented vectors z1 and z2; e1 and e2 are vector of ones of appropriate

dimensions. Our formulation follows SRM principle [44] due to the regularization

term 12(‖z1‖22 + ‖z2‖22) in the objective function and has good generalization ability.

Since, ATP-SVM is formulated as a soft margin classifier, it permits violation of con-

straints. The error due to infringement is measured in slack variables represented

by ξ1 and ξ2. The objective function minimizes this error with a positive penalty

parameter c1, for both the classes. The third term of the objective function aims at

maximizing the projection of data points on the hyperplane of other class i.e. it tries

to drive the points of one class away from the hyperplane of other class. The term

(z1.z2) is motivated by the formulation of NSVMOOP, and tries to maximize the

angle between the augmented normal vectors z1 and z2. The positive parameters

c2, c3 are the associated weights. ATP-SVM takes into consideration the princi-

ple of empirical risk minimization (ERM). The constraints of (3.1) require that the

samples of positive class must lie on that side of positive hyperplane which is away

from the negative class and vice-versa. The Lagrangian function [13] for the primal

problem of ATP-SVM (3.1) is given by

L(z1, z2, ξ1, ξ2) =1

2(‖z1‖22 + ‖z2‖22) + c1(eT1 ξ1 + eT2 ξ2) + c2(eT2 Gz1 − eT1 Hz2)

+c3

2(z1.z2 + z2.z1)− αT1 (Hz1 + ξ1) + αT2 (Gz2 − ξ2),

−βT1 ξ1 − βT2 ξ2, (3.2)

where α1 = (α11, α

21, ..., α

m11 )T , α2 = (α1

1, α21, ..., α

m21 )T ), β1 = (β1

1 , β21 , ..., β

m11 )T and

β2 = (β11 , β

21 , ..., β

m21 )T are Lagrange multipliers of dimensions (m1 × 1), (m2 × 1),

52

(m1 × 1), and (m2 × 1), respectively. The Karush-Kuhn-Tucker (KKT) necessary

and sufficient optimality conditions [13] are given by

∂L∂z1

= 0⇒ z1 + c2GT e2 + c3z2 −HTα1 = 0, (3.3)

∂L∂z2

= 0⇒ z2 − c2HT e1 + c3z1 +GTα2 = 0, (3.4)

∂L∂ξ1

= 0⇒ c1e1 − α1 − β1 = 0, (3.5)

∂L∂ξ2

= 0⇒ c1e2 − α2 − β2 = 0, (3.6)

−Hz1 − ξ1 ≤ 0, (3.7)

Gz2 − ξ2 ≤ 0, (3.8)

ξ1, ξ2 ≥ 0, (3.9)

α1, α2, β1, β2 ≥ 0, (3.10)

αT1 (Hz1 + ξ1) = 0, (3.11)

αT2 (Gz2 − ξ2) = 0, (3.12)

βT1 ξ1 = 0, (3.13)

βT2 ξ2 = 0. (3.14)

Since β1, β2 ≥ 0, from (3.5) and (3.6)

0 ≤ α1 ≤ c1e1 and 0 ≤ α2 ≤ c1e2. (3.15)

From (3.3) and (3.4), we get

z1 = 11−c23

(HTα1 − c2c3HT e1 + c3G

Tα2 − c2GT e2), (3.16)

z2 = 11−c23

(−GTα2 + c2c3GT e2 − c3H

Tα1 + c2HT e1). (3.17)

By substituting z1 and z2 from (3.16), (3.17) into the Lagrangian L (3.2) and using

KKT optimality conditions, we obtain the dual of ATP-SVM as,

3.2. Angle-based Twin Parametric-Margin Support Vector Machine 53

ATP-SVM (Dual):

maxα

12α

Tλα+ fTα

subject to lb ≤ α ≤ ub, (3.18)

where α is the augmented vector given by α = [αT1 , αT2 ]T ,

λ =−1

1− c23

HTH c3HTG

c3GTH GTG

, (3.19)

f =1

1− c23

c2c3(HTH)e1 + c2(HTG)e2

c2c3(GTG)e2 + c2(GTH)e1

, (3.20)

lb =

0

0

, ub =

c1e1

c1e2

. (3.21)

The dual problem of (3.18) can be solved by standard MATLAB functions like

quadprog(). The solution thus obtained by solving (3.18) is used to find z1 and

z2 through (3.16), (3.17). These augmented vectors z1 and z2 give the hyperplane

parameters i.e. w∗1, w∗2, b∗1, and b∗2, and generate the parametric-margin hyperplanes

as given in (1.1). Once, the hyperplanes h1(x) and h2(x) are obtained, the final

classifying hyperplane is given by

h(x) = xT(

w1

‖w1‖2+

w2

‖w2‖2

)+

(b1‖w1‖2

+b2‖w2‖2

)= 0. (3.22)

A new data sample x ∈ Rn is assigned to class r (r = +1,−1), based on its relative

position to h(x) and the class label is given by

y = sign

(xT(

w1

‖w1‖2+

w2

‖w2‖2

)+

(b1‖w1‖2

+b2‖w2‖2

)). (3.23)

3.2.3 ATP-SVM (Kernel version)

By considering the kernel-generated surfaces instead of hyperplanes, the classifier

ATP-SVM can be extended to non-linear version. The surfaces are given as

Ker(xT , (X f )T )v1 + b1 = 0, Ker(xT , (X f )T )v2 + b2 = 0, (3.24)

54

where (X f )T = [Af Bf ]T and Ker is an appropriately chosen kernel. The primal

QPP of the non-linear ATP-SVM is given by

KATP-SVM (Primal):

minu1,u2,η1,η2

1

2(‖u1‖22 + ‖u2‖22) + c1(eT1 η1 + eT2 η2)

+c2(eT2 Nu1 − eT1 Mu2) + c3(u1.u2),

subject to Mu1 ≥ 0− η1, η1 ≥ 0

Nu2 ≤ 0 + η2, η2 ≥ 0, (3.25)

where u1 = [v1 b1], u2 = [v2 b2], M = [Ker(Af , (X f )T ) e1] andN = [Ker(Bf , (X f )T ) e2].

For representative samples, the selection procedure is applied as discussed in Section

3.2.1. Here, Af , Bf and X f refer to representative points of positive, negative and

both the classes respectively. The Wolfe dual of (KATP-SVM) is given by

KATP-SVM (Dual):

maxγ

12γ

Tκγ + gTγ

subject to lbk ≤ γ ≤ ubk, (3.26)

where γ is the augmented vector given by γ = [γT1 , γT2 ]T ,

κ =−1

1− c23

MTM c3MTN

c3NTM NTN

, (3.27)

g =1

1− c23

c2c3(MTM)e1 + c2(MTN)e2

c2c3(NTN)e2 + c2(NTM)e1

, (3.28)

lbk =

0

0

, ubk =

c1e1

c1e2

. (3.29)

The solution obtained by solving (3.26) is used to find u∗1 and u∗2 through follow-

ing equations:

u1 = 11−c23

(MTγ1 − c2c3MT e1 + c3N

Tγ2 − c2NT e2), (3.30)

u2 = 11−c23

(−NTγ2 + c2c3NT e2 − c3M

Tγ1 + c2MT e1). (3.31)

3.3. Angle-based Twin Support Vector Machine 55

The parameter for kernel generated surfaces i.e. (v1, b1), (v2, b2), as given in (3.24)

can be obtained from the augmented vectors u1, u2 respectively. A new pattern

x ∈ Rn is assigned to class +1 or class -1 in a manner similar to the linear case.

3.3 Angle-based Twin Support Vector Machine

We present another novel binary classifier “Angle-based Twin Support Vector Ma-

chine” (ATWSVM) which solves a pair of optimization problems to determine two

nonparallel hyperplanes. ATWSVM aims at developing a classifier model that re-

duces the time complexity of nonparallel hyperplanes classifiers. The two nonparallel

hyperplanes of ATWSVM are generated by solving a pair of optimization prob-

lems where the first problem is formulated on the lines of TWSVM. The second

problem is a UMP that uses the solution of the first problem and determines the

hyperplane such that angle between the normal vectors to the two hyperplanes is

maximized, as shown in Fig.3.2. For this work, we have used TBSVM as the first

problem. ATWSVM has efficient learning time with good generalization ability

when compared with TWSVM based classifiers. The second optimization problem

of ATWSVM avoids solving QPP as solved by TWSVM or TBSVM, and is there-

fore more efficient. ATWSVM implements SRM as well as ERM principle and has

comparable testing accuracy as TWSVM and TBSVM. The efficacy of our classifier

is established by conducting experiments on synthetic as well as benchmark UCI

and NDC datasets. This chapter includes the application of ATWSVM for image

segmentation.

This thesis presents one more version of ATWSVM termed as “Least Squares

Angle-based Twin Support Vector Machine” (LS-ATWSVM). From this point on-

wards, the first version will be referred as ATWSVM and second as LS-ATWSVM.

For LS-ATWSVM, the first optimization problem is formulated on the lines of LS-

TWSVM. The second hyperplane is determined so that it is proximal to one class

and the angle θ between the normal vectors to the two hyperplanes is maximized.

Therefore, for LS-TWSVM formulates both the problems as UMPs and is discussed

in Section 3.4. The following section presents the linear and non-linear versions of

ATWSVM.

56

3.3.1 ATWSVM (Linear version)

ATWSVM is developed on the lines of TWSVM, where the first hyperplane of

ATWSVM can be determined using any TWSVM-based formulation; the second

hyperplane is determined so that it is proximal to one class and the angle θ between

the normal vectors to the two hyperplanes is maximized, as shown in Fig.3.2. This

leads to finding larger separation between the two classes. For this work, TBSVM

has been used as the first problem.

ATWSVM: First problem

The formulation of first problem of ATWSVM is similar to that of TBSVM [15] and

is given by:

ATWSVM1:

minw1,b1,ξ

1

2c1(‖w1‖22 + b21) +

1

2‖Aw1 + e1b1‖22 + c3e

T2 ξ

subject to −(Bw1 + e2b1) + ξ ≥ e2, ξ ≥ 0. (3.32)

The parameters c1 and c3 in (3.32) are the weights associated with structural risk

and empirical risk respectively. The regularization term c12 (‖w1‖22 + b21) widens the

margin between two classes with respect to plane wT1 x+b1 = 0 [60]. The solution for

(3.32) is obtained by solving its Lagrangian function and using Karush-Kuhn-Tucker

conditions [13]. The Wolfe dual of (ATWSVM1) is given by [15]:

maxα

eT2 α −12α

TG(HTH + c1I)−1GTα

subject to 0 ≤ α ≤ c3, (3.33)

where, H = [A e1], G = [B e2] are augmented matrices of respective classes and

α = (α1, α2, ..., αm2)T are Lagrange multipliers. The regularization term takes care

of the possible ill-conditioning of (HTH + c1I) term of ATWSVM, where I is the

identity matrix of appropriate size. The augmented vector u1 = [w1, b1]T is given

by

u1 = −(HTH + c1I)−1GTα. (3.34)

3.3. Angle-based Twin Support Vector Machine 57

ATWSVM: Second problem

The major contribution of this work is the formulation of second problem of ATWSVM

as an unconstrained minimization problem. The problem is given as:

ATWSVM2:

minw2,b2

P2 = c2‖Bw2 + e2b2‖22 + c4(wT1 .w2 + b1.b2) +c5

2(‖w2‖22 + b22), (3.35)

where c2, c4 and c5 > 0 are the weights associated with the corresponding terms.

There exists a trade off between the first and the second terms of (3.35) which is

reflected in the choice of c2 and c4 such that c2 +c4 = 1. In order to give more weight

to the angle term (i.e. (wT1 .w2 + b1.b2)), we select c4 as a value more that 0.5 and

the value of c2 is adjusted accordingly. The first term minimizes the sum of squared

distances of the negative hyperplane from the data points of class B and keeps the

hyperplane proximal to the negative class. By keeping the hyperplane close to its

corresponding class, ATWSVM follows the empirical risk minimization principle.

ATWSVM also takes into consideration the principle of SRM by minimizing the

regularization term ‖w2‖22 + b22. The above problem does not require data points of

class A, but it uses the optimal hyperplane of positive class.

In (3.35), w1, b1 represents the optimal hyperplanes parameters obtained by

solving (3.34). The term wT1 .w2 + b1.b2 is added with the idea of maximizing the

angle between the normal vectors w1 and w2. The two proximal hyperplanes and

the angle between their normals are shown in Fig.3.2. ATWSVM2 determines a

hyperplane which is proximal to the patterns of class B and at maximum angle from

the positive hyperplane (w1, b1).

Setting the gradient of P2 with respect to w2 and b2 equal to zero gives

∂P2

∂w2= 0⇒ c2B

T (Bw2 + e2b2) + c4w1 + c5w2 = 0e2, (3.36)

∂P2

∂b2= 0⇒ c2e

T (Bw2 + e2b2) + c4b1 + c5b2 = 0, (3.37)

58

Figure 3.2: Geometrical illustration of angle between normal vectors to ATWSVMhyperplanes

By combining (3.36) and (3.37) , we get

BTB + c5In2c2

BT e2

eT2 B eT2 e2 + c52c2

w2

b2

=

−c4w1/2c2

−c4b1/2c2

. (3.38)

Here, In is identity matrix of order n× n. By using augmented matrices H, G and

u2 in (3.38),

GTG+

c5In2c20

0 c52c2

u2 =

−c4w1/2c2

−c4b1/2c2

. (3.39)

This further implies that

GTG+

c5In2c20

0 c52c2

u2 = − c4

2c2

w1

b1

, (3.40)

⇒(GTG+

c5

2c2In+1

)u2 = − c4

2c2

w1

b1

, (3.41)

⇒ u2 = − c4

2c2

(GTG+

c5

2c2In+1

)−1

w1

b1

, (3.42)

which involves a matrix inverse operation of order (n+ 1)× (n+ 1). The augmented

3.3. Angle-based Twin Support Vector Machine 59

vectors u1 and u2 can be obtained from (3.34) and (3.42) respectively and are used

to generate the hyperplanes given by (1.1). Testing a new pattern is done on the

lines of TWSVM.

3.3.2 ATWSVM (Kernel version)

The classifier ATWSVM is extended to non-linear version by considering kernel-

generated surfaces (1.9-1.10). The first primal problem of the non-linear ATWSVM

is given by

K-ATWSVM1:

minz1,b1,ξ

c1

2(‖z1‖22 + b21) +

1

2‖Ker(A,CT )z1 + e1b1‖22 + c3e

T2 ξ

subject to −(Ker(B,CT )z1 + e2b1) + ξ ≥ e2, ξ ≥ 0. (3.43)

The solution of (3.43) is obtained in similar manner as linear case and its Wolf dual

is given by:

maxβ

eT2 β −12β

TS(RTR+ c1I)−1STβ

subject to 0 ≤ β ≤ c3, (3.44)

where, R = [Ker(A,CT ) e1], S = [Ker(B,CT ) e2] are augmented matrices of

respective classes and β = (β1, β2, ..., βm2)T are Lagrange multipliers. The regu-

larization term takes care of the possible ill-conditioning of (RTR + c1I) term of

ATWSVM. The augmented vector v1 = [z1, b1]T is given by

v1 = −(RTR+ c1I)−1STβ. (3.45)

The kernel version of second problem of ATWSVM is given as:

K-ATWSVM2:

minz2,b2

c2‖Ker(B,CT )z2 + e2b2‖22 + c4(zT1 .z2 + γ1.b2) +c5

2(‖z2‖22 + b22), (3.46)

where c2, c4 and c5 are positive weights associated with the corresponding terms. By

setting the gradient of (3.46) with respect to z2 and b2 equal to zero and rearranging

60

the equations, we get

v2 = − c4

2c2

(STS +

c5

2c2Im+1

)−1

z1

γ1

, (3.47)

where v2 = [z2, b2]T is the augmented vector.

Geometric Interpretation

The novel classifier is developed on the lines of TWSVM or TBSVM, but the geo-

metric interpretation of ATWSVM is quite different from that of TBSVM. TBSVM

determines the two nonparallel hyperplanes such that they are proximal to their

corresponding class and unit distance away from the other class. ATWSVM gen-

erates the first (positive) hyperplane in similar manner as TBSVM but the second

(negative) hyperplane is obtained by maximizing the angle between normal vectors

to the hyperplanes and simultaneously minimizing the distance of negative hyper-

plane from the negative class. Fig. 3.3a and 3.3b show the classifiers obtained by

ATWSVM and TBSVM respectively. It is observed that ATWSVM generates planes

which are separated by larger angle than for TBSVM.

3.4 Other Versions of ATWSVM

The idea of ATWSVM is to obtain the first hyperplane by solving one TWSVM-

based problem and the other hyperplane is obtained by solving an angle-based

unconstrained minimization problem. Therefore, the learning time complexity of

ATWSVM is almost half of TWSVM. The first problem of ATWSVM classifier can

be formulated using any variant of TWSVM like ITWSVM [61], TPMSVM [16],

LS-TWSVM [14] etc. The second problem would remain the same as discussed in

above ATWSVM. The Least Square version of ATWSVM (LS-ATWSVM) is de-

scribed below as an illustration.

Least Squares ATWSVM (LS-ATWSVM)

The first problem of LS-ATWSVM is motivated by LS-TWSVM [14] and the second

problem will be same as for ATWSVM. However, LS-TWSVM minimizes only the

3.4. Other Versions of ATWSVM 61

(a)

(b)

Figure 3.3: Classifiers obtained for synthetic dataset (Syn1). a. ATWSVM b.TBSVM

empirical risk in the primal problems and deals with inverse of matrices (HTH)

and (GTG) where H = [A e1] and G = [B e2]. To get the solution for the dual

problems, LS-TWSVM assumes that the inverse of these matrices always exist and

the matrices are always positive semidefinite. Taking motivation from TBSVM [15],

we have modified the first primal problem of LS-TWSVM by adding a regularization

term, which minimizes the structural risk and takes care of possible ill-conditioning

of matrices before inversion. The first optimization problem of ATWSVM is given

by:

minw1,b1,ξ

P1 =c1

2(‖w1‖22 + b21) +

1

2‖Aw1 + e1b1‖22 +

c3

2‖ξ‖22

subject to −(Bw1 + e2b1) + ξ = e2. (3.48)

62

The objective function of (3.48) is similar to that of LS-TWSVM, with an added

term c12 (‖w1‖22 + b21) which widens the margin between two classes with respect to

plane wT1 x+ b1 = 0 [60]. The regularization term also takes care of the possible ill-

conditioning of (GTG+ 1c1HTH) term of LS-TWSVM. On substituting error variable

ξ in the objective function of (3.48), the problem is formulated as a UMP, given as

minw1,b1

P1 =c1

2(‖w1‖22 + b21) +

1

2‖Aw1 + e1b1‖22 + (3.49)

c3

2‖Bw1 + e2b1 + e2‖22.

Setting the gradient of P1 with respect to w1 and b1 equal to zero, we get:

∂P1

∂w1= 0⇒ c1w1 +AT (Aw1 + e1b1) + c3B

T (Bw1 + e2b1 + e2) = 0e1,(3.50)

∂P1

∂b1= 0⇒ c1b1 + eT1 (Aw1 + e1b1) + c3e

T2 (Bw1 + e2b1 + e2) = 0. (3.51)

Rearranging the equations (4.6) and (4.7) gives

c1

w1

b1

+

Ae1

T [A e1

]w1

b1

+

c3

Be2

T [B e2

]w1

b1

+

c3

Be2

e2

=

0e1

0

.

Let H = [A e1], G = [B e2] and the augmented vector u1 = [w1, b1]T , then

c1u1 +HTHu1 + c3GTGu1 = −c3G

T e2, (3.52)

which further implies that

u1 = −c3(c1In+1 +HTH + c3GTG)−1GT e2. (3.53)

Here, In+1 is the identity matrix of order (n + 1) × (n + 1). In (3.53), the term

c1In+1 takes care of the possible ill-conditioning problem. So, the first hyperplane

is obtained by solving (3.53), which requires a matrix inverse operations of order

(n+ 1)× (n+ 1) for linear case. The second hyperplane is obtained as discussed for

ATWSVM.

3.5. Multi-category Extension of ATP-SVM and ATWSVM 63

3.5 Multi-category Extension of ATP-SVM and ATWSVM

In this chapter, we extend ATP-SVM, ATWSVM, NHSVM and NSVMOOP to

multi-category scenario using well established OAA and BT approaches. The fol-

lowing two subsections explain the extension of ATP-SVM and similar procedures

are repeated for ATWSVM, NHSVM and NSVMOOP to perform the numerical

experiments.

One-Against-All

To solve aK-class classification problem using One-Against-All (OAA) multi-category

approach, K-binary ATP-SVM classifiers are built. The training data is created for

K binary problems, in similar manner as explained for OAA-TWSVM in Section

1.3.1. This data is used as input for ATP-SVM in (3.1) or (3.25) to generate a pair

of hyperplanes for ith classifier, where i = 1 to K. Therefore, K-pairs of hyperplanes

are obtained. Testing is done on the lines of OAA-TWSVM.

Binary Tree

The Binary Tree based multi-category approach builds the classifier model by recur-

sively dividing the training data into two groups and finds the hyperplanes for the

groups thus obtained [59]. For extending ATP-SVM, use (3.1) or (3.25) to obtain

the hyperplanes. The data is partitioned by applying K-Means (k=2) clustering

[47, 25]. This process is repeated until further partitioning is not possible. The pro-

cedure for Binary Tree-based multi-category approach is discussed in Section 5.2.1.

The hyperplanes obtained by OAA and BT for a 3-class problem are shown in Fig

3.4. Here, OAA is not able to perform well due to confused/ ambiguous patterns,

whereas BT can easily handle this condition and gives better results.

3.6 Discussion

In this section, a comparison of ATP-SVM with NSVMOOP and TPMSVM is pre-

sented.

64

(a) OAA-NHC

(b) BT-NHC

Figure 3.4: Three-class classification with (a.) OAA-NHC (b.) BT-NHC

ATP-SVM vs. NSVMOOP

The classifiers ATP-SVM and NSVMOOP generate a pair of hyperplanes by solving

one QPP, as shown in Fig.3.5. It is observed that the geometric interpretation of

both the classifiers is quite different. The hyperplanes generated by NSVMOOP are

proximal to their respective classes and hence, the data points of a class lie on both

sides of its hyperplane. Whereas for ATP-SVM, the hyperplanes are the bounding

planes and lie along the class-boundary. Therefore, the data points of a class lie on

one side of its hyperplane. Due to the formulation of ATP-SVM, it can efficiently

handle heteroscedastic noise.

The complexity of a QPP is of the order O(m3), where m is the number of con-

straints. NSVMOOP considers the distance of all the training points from both the

hyperplanes in the constraints and leads to a QPP which is twice the size of an SVM

problem. Since there are 2m constraints in NSVMOOP optimization problem (refer

(1.21)), where m is the number of data points in both the classes, its complexity

is O((2m)3). Therefore, the use of NSVMOOP is restricted to small datasets only

or it requires efficient solvers like Sequential Minimal Optimization (SMO) [62] to

make it feasible for large datasets. ATP-SVM formulation considers the distance

of training points from their corresponding class hyperplane in the constraints and

the objective function takes into consideration the projection of points of one class

3.6. Discussion 65

(a) ATP-SVM

(b) NSVMOOP(c) TPMSVM

Figure 3.5: Geometric interpretation of ATP-SVM, NSVMOOP and TPMSVM

on other hyperplane. Hence, it constructs a QPP half the size of NSVMOOP. The

representative samples can effectively reduce the size of QPP for ATP-SVM. If the

size of the representative set is even half the size of training set, then the QPP for

ATP-SVM would have m/2 constraints. Therefore, it results in a primal problem

of size comparable to that of TWSVM i.e. O((m/2)3). For certain datasets, the

boundary points may not represent the entire class and the representative set would

consist of all data points i.e. m patterns. Under such circumstances, the complexity

of ATP-SVM would be similar to that of SVM.

ATP-SVM vs. TPMSVM

The major difference between ATP-SVM and TPMSVM is the formulation of their

primal problems where TPMSVM solves a pair of QPPs and ATP-SVM solves a

single QPP. The geometric interpretation of both the classifiers is similar and they

can handle heteroscedastic noise. Due to single optimization problem, the testing

and training phases of ATP-SVM are more consistent and hence it has got bet-

ter generalization ability. With the use of representative samples for training, the

complexity of ATP-SVM is comparable to that of TPMSVM.

66

ATWSVM vs. TBSVM

Assuming that the dataset consists of two classes of almost comparable size (ap-

proximately m/2 samples in each class), then the learning time of linear ATWSVM

is almost half the learning time of TBSVM. It is because ATWSVM solves a system

of linear equations and a QPP instead of a pair of QPPs, as solved by TBSVM. For

linear ATWSVM, the classification problem

1. solves a QPP of order (m/2)3 and

2. performs matrix inverse of a smaller dimension matrix, of order (n+1)×(n+1)

where n << m,

whereas TBSVM solves two QPPs of order (m/2)3.

The significant contribution of our algorithm is that it improves the complexity

of the TWSVM-based classifier by more than factor of two. ATWSVM determines

the two hyperplanes by solving a QPP and a UMP, with QPP of smaller size solved

first. This essentially means that whichever class has larger number of data points

would be the one for which QPP is solved (since the size of QPP is determined by

the number of constraints, which is equal to the number of patterns in the other

class) and for the other class UMP is solved. Thus, the QPP is formulated with

lesser number of constraints and the user has flexibility in selecting the class for

which QPP is solved.

Most of the real world datasets are imbalanced, including UCI datasets. If the

number of patterns in the positive and negative classes are m1 and m2 respec-

tively (without loss of generality m1 > m2), then the positive hyperplane would

be obtained by solving a QPP of order (m2)3 with linear ATWSVM; and the sec-

ond hyperplane is obtained by solving a UMP and its complexity is no more than

n3 (linear). Consider the case when m1 ≈ 2 ∗ m2, then complexity of ATWSVM

would be of order ((m2)3 + n3) ≈ (m2)3; whereas the complexity of TBSVM is

((m1)3 + (m2)3) ≈ (2m2)3 + (m2)3) = 9(m2)3. This is proved experimentally with

NDC datasets in Section 3.7.2. For non-linear ATWSVM, rectangular RBF kernel

is used which makes ATWSVM more time-efficient than TWSVM and TBSVM.

3.7. Experimental Results 67

3.7 Experimental Results

In order to evaluate the performance of both angle-based classifiers i.e. ATP-SVM

and ATWSVM, extensive experimentation have been performed on synthetic and

benchmark UCI [52] datasets. The performance of these algorithms is measured in

terms of classification accuracy and computational efficiency. For ATP-SVM, all the

experiments are performed with representative samples which are half the size of

training set.

Parameter Settings

For ATP-SVM, we have selected values of c1 in the range 0.1 to 0.9. The parameter

c2 is selected so that the ratio of c2 and c1 is in the range 0.1 to 1. There is a trade-

off between the values of c1 and c3, such that c1 is always more than c3. In order to

control the bias and over-fitting, the experiments are performed using 10-fold cross

validation [49].

The formulation of ATWSVM involves parameters ci (i = 1 to 5) and kernel pa-

rameter σ for nonlinear classifier. The parameters c1 and c5 associates weights with

the regularization terms and are assigned value of order 10−6 to 1; c3 is associated

with the error term of ATWSVM1 and is given a value in the range (0,1]. There

exists a trade-off between c2 and c4, such that c2 + c4 = 1. To study the influence of

parameters on the performance of our classifier, experiments have been performed

by varying the values of parameters. Fig. 3.6a shows the accuracy achieved for

WPBC dataset with different values of c1 and c2. For this experiment, c1 and c5

are assigned same values from the set 10i, i = −6, ..., 0. The parameter c3 is

fixed to 0.1. The parameter c2 varies between (0,1] and the value of c4 is adjusted

according to the assumption c2 + c4 = 1. It is observed that ATWSVM achieves

good classification accuracy for very small values of c1 for WPBC dataset. Fig. 3.6b

shows the accuracy achieved by ATWSVM for Thyroid dataset.

3.7.1 Synthetic Datasets

In this thesis, the efficiency of ATP-SVM and ATWSVM is tested for synthetic

datasets. (Please refer Appendix D for more details on synthetic datasets.)

68

(a) WPBC

(b) Thyroid

Figure 3.6: Influence of parameters on the performance of ATWSVM classifier. Theparameters c1 and c5 are assigned same value, c3=0.1 and c2 + c4 = 1

Dataset 1: Cross planes

Fig.3.7 illustrates the simulation results, with linear version of ATP-SVM and NSV-

MOOP for cross planes data with 200 training points. The red ‘dots’ and blue

‘plus’ represent the data points of two classes. It is observed that both the classi-

fiers are able to generate hyperplanes close to their respective class and away from

the other class. ATP-SVM achieves training accuracy of 99%, whereas NSVMOOP

achieves 98% accuracy. Parallel hyperplanes classifiers like SVM and PSVM could

not generate good results with cross-planes data.

3.7. Experimental Results 69

(a) ATP-SVM (b) NSVMOOP

Figure 3.7: Hyperplanes obtained by ATP-SVM and NSVMOOP for cross-planesdataset

Dataset 2: Complex XOR

Fig.3.8 shows the nonparallel planes obtained with linear version of ATWSVM and

TBSVM. The data consists of 120 patterns in R2 where red ‘dots’ (80) and blue

‘stars’ (40) represent the data points of positive and negative classes respectively.

It is observed that both the classifiers i.e. ATWSVM and TBSVM are able to gen-

erate proximal hyperplanes and achieve testing accuracy of 96% and 95.5%. Single

hyperplane classifiers like SVM and PSVM fail to give good classifications results

for cross-planes or complex XOR data.

(a) ATWSVM (b) TBSVM

Figure 3.8: Complex XOR dataset and the hyperplanes obtained by classifiers

Dataset 3: Syn1

The performance of classifiers is compared using Syn1 data with 100 data points for

both the classes. The hyperplanes obtained by ATP-SVM, NHSVM and TPMSVM

are shown in Fig. 3.5 and they achieve classification accuracy of 100%, 99.6% and

99.2% respectively. The hyperplanes obtained by ATWSVM and TBSVM for Syn1

dataset are shown in Fig. 3.3. The classification accuracy achieved by ATWSVM,

70

TBSVM and TWSVM are 98.54%, 94.46% and 94.46% respectively.

Dataset 4: Ripley’s

The Ripleys dataset is an artificially-generated binary dataset [63] which includes 250

training points and 1000 test points, as shown in Fig.3.9. The figure shows the linear

classifiers obtained with ATP-SVM, NSVMOOP and TPMSVM. It is observed that

ATP-SVM obtains comparable results as other classifiers and achieves test accuracy

of 89.7% against 89.6% and 89.4% for NSVMOOP and TPMSVM respectively.

(a) ATP-SVM

(b) NSVMOOP (c) TPMSVM

Figure 3.9: Results on Ripley’s dataset with linear classifiers a. ATP-SVM b.NSVMOOP c. TPMSVM

3.7.2 Binary Classification Results: UCI and NDC Datasets

The classification experiments have been performed on a variety of benchmark UCI

datasets [52]. For training, the dataset is standardized with zero mean and unit

standard deviation. Results are reported for both linear as well as Gaussian kernel.

We have selected ten imbalanced UCI binary datasets for the experiments with

binary classifiers, as listed in Table 3.1.

3.7. Experimental Results 71

Classification Results for Binary UCI datasets

The efficiency of ATP-SVM and ATWSVM is compared with NHSVM [58], NSV-

MOOP [19], TWSVM [10], LS-TWSVM [14] and TPMSVM [16].

Linear Case

Table 3.1 shows the classification results using linear classifier. The table demon-

strates that ATP-SVM outperforms the other algorithms in terms of classification

accuracy. The mean accuracy of ATP-SVM is 87.12% as compared to 85.34%,

85.27% and 85.90% for TPMSVM, NSVMOOP and NHSVM respectively. The table

also shows the learning time of all these classifiers. It is also observed that ATP-

SVM and TPMSVM have comparable learning time, whereas NSVMOOP takes the

maximum time for building the classifier. We have not reported learning time for

NHSVM [58] as its dual formulation is incorrect and the classification results are

obtained by solving its primal problem. The average classification accuracy achieved

by ATWSVM is 86.97% as compared to 86.19% for TWSVM. LS-ATWSVM achieves

average classification accuracy of 86.97%, whereas LS-TWSVM achieve average ac-

curacy of 86.60%. As indicated by the results in the table, the Least Squares version

of ATWSVM performs better than its TWSVM counterpart, in terms of learning

time. Therefore, LS-ATWSVM is most time-efficient among all the above mentioned

classifiers. It is attributed to the fact that LS-TWSVM solves two UMPs and thus

avoid solving QPPs. It is also observed that NSVMOOP is computationally most ex-

pensive among all these binary classifiers. Although the learning time of ATWSVM

is more than least square versions of classifiers i.e. LS-ATWSVM and LS-TWSVM,

but it is still more time-efficient than TWSVM, TBSVM and NSVMOOP. Taking

into consideration both classification accuracy and learning time, it can be said that

ATWSVM is the best choice for binary classification problems.

72

Tab

le3.

1:C

lass

ifica

tion

resu

lts

wit

hli

nea

rcl

assi

fier

son

bin

ary

UC

Id

atas

ets

Data

set

TW

SV

MT

PM

SV

MN

HS

VM

NS

VM

OO

PL

S-T

WS

VM

LS-A

TW

SV

MA

TW

SV

MA

TP

-SV

M

Mea

nA

ccu

racy

(%)±

SD

Lea

rnin

gti

me

(sec

)

AC

A85.

65±

3.95

81.8

2.67

85.2

4.87

76.0

5.77

85.9

5.16

87.8

4.0

587.8

4.0

585.

94±

3.75

0.38

67

0.409

1-

2.70

130.

0043

0.0

041

0.2

362

0.39

55

BU

PA

Liv

er70.

50±

6.60

73.5

6.48

71.5

6.30

75.1

4.50

70.9

6.09

72.1

5.8

072.

19±

5.80

75.4

4.5

80.

2815

0.274

4-

0.81

420.

0043

0.0

042

0.2

279

0.24

37

Hea

rt-C

84.

19±

2.87

85.1

6.48

84.5

6.01

84.7

6.36

84.4

2.73

84.5

2.1

784.

59±

2.96

85.8

5.3

90.

2622

0.266

2-

0.55

590.

0045

0.0

042

0.2

154

0.24

76

Hea

rt-S

84.

07±

3.91

84.0

6.31

84.0

5.25

84.4

6.00

84.0

3.91

84.4

3.5

284.

82±

3.08

86.3

5.2

50.

2741

0.262

6-

0.54

250.

0067

0.0

062

0.2

116

0.23

79

Ion

osp

her

e83.

95±

6.78

85.0

3.82

86.3

6.45

86.8

4.68

82.0

3.46

82.8

3.8

984.

03±

3.89

87.4

4.7

30.

2748

0.304

4-

1.50

690.

0058

0.0

056

0.2

142

0.25

73

PIM

A-I

nd

ian

s76.

43±

5.24

73.5

4.61

72.2

6.21

74.9

5.08

75.6

3.32

75.2

3.2

576.7

3.2

574.

86±

5.20

0.46

61

0.401

2-

2.51

250.

0039

0.0

045

0.2

384

0.38

35

Thyro

id87.

95±

3.86

84.8

7.90

85.7

9.34

84.3

8.76

86.7

3.88

88.8

3.3

588.9

5.2

686.

62±

8.60

0.25

05

0.238

9-

0.62

140.

0038

0.0

036

0.2

095

0.21

59

Tw

o-n

orm

97.

75±

1.85

97.0

1.97

97.7

1.85

97.5

2.04

98.5

1.75

98.0

1.5

898.

00±

1.92

98.

25±

2.06

0.33

62

0.323

2-

1.89

920.

0052

0.0

057

0.2

206

0.28

82

Vot

es95.

65±

3.10

92.6

4.72

95.6

3.87

92.8

3.82

95.6

1.32

96.0

1.9

096.7

2.2

593.

79±

4.43

0.30

84

0.321

8-

1.02

260.

0042

0.0

045

0.2

328

0.27

56

WP

BC

95.

71±

2.02

95.7

2.43

95.9

2.10

95.7

2.43

95.1

2.24

95.8

2.0

795.

86±

2.07

96.7

2.4

30.

3953

0.592

3-

2.97

630.

0042

0.0

043

0.2

293

0.37

27

Mean

Accu

racy

86.

19±

4.02

85.3

4.74

85.9

5.23

85.2

4.94

85.9

3.39

86.6

3.1

686.

97±

3.45

87.1

4.6

4A

vg.

tim

e0.

3236

0.339

4-

1.51

530.

0046

0.0

046

0.2

235

0.29

18

3.7. Experimental Results 73

Table 3.2: Variation in classification accuracy based on selection of classes

NP : NN LS-ATWSVM ATWSVM

Mean Accuracy (%) c1, c2, c3 Mean Accuracy (%) c1, c2, c3

Learning time (seconds)

WPBC

458 : 240 95.86 10−4, 0.1, 10−5 95.86 10−4, 0.1, 10−5

0.0043 - 0.2291 -

240 : 458 94.43 10−5, 0.2, 0.0002 94.99 10−1, 0.1, 10−5

0.0051 - 0.5204 -

PIMA-Indians

500 : 268 75.36 10−5, 0.1, 2 76.62 10−4, 0.1, 0.90.0045 0.2384

268 : 500 75.36 10−5, 0.1, 0.2 76.62 10−5, 0.1, 0.10.0061 1.0128

ACA

383 : 307 87.83 10−5, 0.2, 0.2 87.83 0.5, 0.1, 0.20.0046 0.2361

307 : 383 87.92 0.1, 0.1, 0.9 87.25 0.1, 0.2, 10−5

0.0049 0.3182

Influence of Class Selection on Classification Accuracy

ATWSVM (or LS-ATWSVM) is an asymmetric binary classifier which solves a pair

of optimization problems. Unlike TWSVM, the formulation of the two problems

of ATWSVM are not identical. Here, the user is given the flexibility to choose the

class for which QPP is to be solved and for the other class, angle-based UMP will

be solved. For all these experiments, we have chosen the class with more number of

data points as ‘Class A’ (i.e. positive class) and the other as ‘Class B’ (i.e. negative

class). This results in solving a QPP of smaller-order and makes the algorithm

efficient in terms of learning time.

To study the effect of choice of classes on the classification accuracy, experiments

have been performed by interchanging the classes. The results are given in Table

3.2 and it demonstrates that comparable classification accuracy can be achieved by

interchanging the classes of any dataset. The choice of parameters would depend on

the data. It is observed from the table that there is difference in learning time of

the classifiers when the positive and negative classes are interchanged. For WPBC,

the classification accuracy is 95.86% (for NP = 458, NN = 240) and 94.99% (for

NP = 240, NN = 458) with ATWSVM, but there is difference in learning time for

these two cases. The difference in learning time is due to difference in order of QPPs

formulated for two cases.

74

Tab

le3.

3:C

lass

ifica

tion

resu

lts

wit

hn

on-l

inea

rcl

assi

fier

onb

inar

yU

CI

dat

aset

s

Data

set

TW

SV

MT

PM

SV

MN

HSV

MN

SV

MO

OP

LS-T

WSV

ML

S-A

TW

SV

MA

TW

SV

MA

TP

-SV

M

Mea

nA

ccura

cy(%

SD

AC

A86

.64±

3.61

82.3

3.04

85.7

2.95

79.2

3.46

76.1

5.36

87.2

3.17

87.5

3.17

86.

81±

1.7

4

BU

PA

-Liv

er72

.38±

4.57

73.4

4.24

73.0

6.08

75.7

4.09

74.8

6.85

72.8

4.84

72.

75±

4.84

76.8

4.3

4

Hea

rt-C

84.8

6.55

84.7

3.88

87.2

3.32

84.3

4.54

83.7

5.87

84.8

6.55

84.

92±

6.55

86.

84±

3.1

0

Hea

rt-S

85.3

2.88

85.1

6.06

85.7

4.68

84.5

4.23

85.1

5.23

86.3

2.20

86.

30±

2.20

86.4

4.5

1

Ionos

pher

e91

.23±

3.08

88.4

5.42

89.0

4.68

86.5

4.79

89.6

2.57

92.5

5.25

92.6

5.25

89.

48±

5.6

6

PIM

A-I

ndia

ns

76.2

4.28

77.8

4.28

74.9

4.39

79.0

4.85

75.3

4.67

76.1

2.96

76.

16±

2.96

79.0

6.3

7

Thyro

id92

.18±

4.57

93.4

4.48

93.4

4.48

84.6

8.34

96.2

2.98

95.3

3.11

97.7

3.11

97.7

3.1

1

Tw

onor

m97

.00±

1.21

97.0

1.97

97.7

1.84

97.5

2.04

98.7

1.32

98.0

2.29

97.

25±

2.29

98.

70±

2.0

0

Vot

es96

.05±

2.19

96.7

3.10

97.0

1.50

95.2

3.16

96.1

2.79

96.3

1.96

95.

87±

1.96

96.

51±

2.7

9

WP

BC

96.2

2.15

96.4

1.81

96.0

1.88

96.2

2.04

96.5

2.54

96.4

1.54

96.

71±

1.54

97.1

2.8

6

Mean

Accu

racy

87.8

3.51

87.5

3.83

88.0

3.58

86.3

4.15

87.2

4.02

88.6

3.39

88.

78±

3.39

89.5

3.6

5

3.7. Experimental Results 75

Non-linear Case

For non-linear classifier i.e. RBF kernel Ker(x, x′) = exp(−σ‖x − x′‖22), Table 3.3

presents the classification accuracy for all the above mentioned algorithms on UCI

datasets. The results in the table illustrate that ATP-SVM performs best among

other classifiers and achieves mean accuracy 89.56% over all the 10 UCI datasets. It

is also observed that accuracy obtained with nonlinear kernel is better than the linear

version for corresponding datasets. Taking motivation from Reduced SVM (RSVM)

[64], the experiments have been conducted using ATWSVM with rectangular RBF

kernel Ker(A,A′). A rectangular kernel greatly reduces the size of the problem and

simplifies the generation of non-linear separating surface. For the experiments, we

have used rectangular kernel [64] created using 50% data points which are randomly

selected from the dataset and referred as A.

Classification Results for NDC Datasets

Linear case

In order to study the effect of size of data on the learning time of classifiers, experi-

ments have been conducted on large datasets, generated using David Musicants NDC

Data Generator [53]. Table 3.4 gives the experimental results of ATWSVM, LS-

ATWSVM, LS-TWSVM, TWSVM and ATP-SVM using linear classifiers on NDC

datasets. The distribution of training and test patterns are shown as “Train-Test”

and (NP : NN ) shows the ratio of positive and negative data points. The classifi-

cation accuracy is reported in percent (%) and the results indicate that ATP-SVM

outperforms the other methods for most of the NDC datasets. But ATWSVM,

TWSVM and ATP-SVM fail to generate results for dataset of size 50,000 instances

or more, due to memory constraints of our system. Whereas LS-ATWSVM and

LS-TWSVM perform well for large datasets. This shows that LS-ATWSVM can be

successfully used with low-configuration systems.

Table 3.4 shows the learning time (in seconds) of linear classifiers for NDC

datasets. It is observed that the learning time of ATWSVM is much less as com-

pared to other two classifiers i.e. TWSVM and ATP-SVM, which have comparable

learning time. The experimental results comply with the complexity analysis of

76

Table 3.4: Classification results with linear classifiers on NDC datasets

Dataset Train-Test TWSVM ATP-SVM LS-TWSVM LS-ATWSVM ATWSVM

Mean Accuracy (%)

NDC-500 500-50 84.00 85.00 84.00 85.00 85.00(330 : 170) 0.3172 0.3653 0.0006 0.0005 0.2188

NDC-700 700-70 81.92 85.88 84.29 85.71 85.26(447 : 253) 0.4170 0.4230 0.0007 0.0007 0.2475

NDC-900 900-90 83.45 86.17 85.56 85.56 85.98(571 : 329) 0.5986 0.6160 0.0008 0.0008 0.2859

NDC-1K 1K-100 84.00 85.44 84.00 84.00 85.29(627 : 373) 0.4537 0.4443 0.0008 0.0008 0.2564

NDC-2K 2K-200 84.50 84.75 84.50 86.00 85.00(1246 : 754) 2.1480 2.4649 0.0014 0.0010 0.3805

NDC-3K 3K-300 75.96 78.53 79.40 79.67 79.67(1860 : 1140) 5.3255 5.6775 0.0017 0.0013 0.6161

NDC-4K 4K-400 73.75 75.75 75.03 75.05 75.03(2474 : 1526) 10.5517 10.1186 0.0020 0.0015 0.94

NDC-5K 5K-500 79.78 79.26 78.20 78.56 78.68(3086 : 1914) 10.1965 11.5732 0.0031 0.0020 1.5494

NDC-10K 10K-1K 86.26 86.26 84.50 85.90 85.90(6138 : 3862) 113.8356 102.1345 0.0048 0.0041 6.2377

NDC-50K 50K-5K * * 78.80 79.09 *(30783 : 19217) 0.0274 0.02

NDC-100K 100K-10K * * 85.76 86.11 *(61648 : 38352) 0.0439 0.0371

NDC-200K 200K-20K * * 73.17 74.37 *(122401 : 77599) 0.0877 0.0774

* Experiments terminated due to “out of memory”

these classifiers. Since, ATWSVM solves a QPP and a UMP, the UMP is reduced to

solving a system of linear equations of order (n+ 1) ∗ (n+ 1). Therefore, ATWSVM

requires lesser learning time than TWSVM. For imbalanced datasets, ATWSVM

takes advantage of the distribution of data in the two classes. The hyperplane for

the class with more data points is obtained by solving the QPP, whereas the sec-

ond hyperplane is obtained by solving UMP. Hence, ATWSVM solves the QPP of

smaller order and a UMP whereas TWSVM solves a pair of QPPs. This further

reduces the learning time of ATWSVM. The learning time for LS-ATWSVM and

LS-TWSVM is comparable, but they are more efficient than ATWSVM, TWSVM

and ATP-SVM. This happens because LS-ATWSVM and LS-TWSVM solve systems

of linear equations in feature space and thereby avoids solving QPPs. The compu-

tational efficiency of LS-ATWSVM improves further with the size of dataset and

therefore LS-ATWSVM can be used for experiments with very large-sized datasets,

where TWSVM or TBSVM may fail to give results due to memory constraints or

very high execution time.

3.7. Experimental Results 77

Non-linear Case

The classification accuracy for NDC datasets is reported in Table 3.5 for all clas-

sifiers with Gaussian kernel. The results show that non-linear classifiers achieve

better accuracy than linear classifiers. The table also presents a comparative view

of learning time of all non-linear classifiers for NDC datasets. It is noticed that

LS-ATWSVM is most efficient among all these classifiers.

Table 3.5: Classification result with non-linear classifiers on NDC datasets

Dataset TWSVM ATP-SVM LS-TWSVM LS-ATWSVM ATWSVM

Mean Accuracy (%)Learning time (sec)

NDC-500 84.00 88.00 84.00 88.00 88.000.3018 0.3094 0.0327 0.0306 0.0872

NDC-700 90.00 91.60 90.00 91.43 91.570.3074 0.3086 0.0689 0.0644 0.0951

NDC-900 88.89 89.89 90.00 90.00 90.000.5274 0.5389 0.1378 0.1290 0.1694

NDC-1K 92.00 93.00 92.00 92.50 92.500.6247 0.7381 0.1683 0.1682 0.1928

NDC-2K 91.00 93.00 93.00 91.50 91.503.3214 3.4697 1.0329 0.9142 1.3304

3.7.3 Statistical Tests

To compare the efficacy of multiple classifiers on different datasets, some statis-

tical tools are needed. Since the experiments are conducted with eight classi-

fiers i.e. TWSVM, TPMSVM, NHSVM, NSVMOOP, LS-TWSVM, LS-ATWSVM,

ATWSVM and ATP-SVM, on various UCI datasets, so these are compared by using

Friedman test [54].

Friedman Test

The result for Friedman test on binary classifiers for UCI datasets, regarding classi-

fication accuracy, is given in Table 3.6 and the classifiers ATWSVM and ATP-SVM

achieve the top two ranks among all the approaches. The Friedman test on all eight

algorithms establishes that angle-based classifiers with single optimization problem

achieve better results.

78

Table 3.6: Friedman test ranks with linear classifiers for UCI datasets

Dataset TWSVM TPMSVM NHSVM NSVMOOP LS-TWSVM LS-ATWSVM ATWSVM ATP-SVM

Thyroid 3 7 6 8 4 2 1 5

Heart Statlog 6.5 6.5 6.5 3.5 6.5 3.5 2 1

Heart Cleveland 8 2 6 3 7 5 4 1

Bupa Liver 8 3 6 2 7 4.5 4.5 1

Ionosphere 6 4 3 2 8 7 5 1

Two-norm 5.5 8 5.5 7 1 3.5 3.5 5

Votes 3 8 4 7 5 2 1 6

ACA 5 7 6 8 3.5 1.5 1.5 3.5

WPBC 6 6 2 6 8 3.5 3.5 1

Pima Indians 2 7 8 5 3 4 1 6

Average Rank 5.30 5.85 5.30 5.15 5.3 3.65 2.7 3.05

3.7.4 Multi-category Classification Results: UCI Datasets

This chapter includes extension of NHCs, obtained with single optimization prob-

lem (i.e. ATP-SVM, NSVMOOP and NHSVM) and ATWSVM in multi-category

framework. The extension is done using One-Against-All and Binary Tree based

approaches.

The classification accuracy of NSVMOOP, NHSVM, ATWSVM and ATP-SVM

on multi-category UCI datasets with two approaches- OAA and binary-tree based,

is given in Table 3.7. For each dataset, the number of patterns, features and classes

are shown in the table as m, n and K respectively. The experiments are conducted

with RBF kernel. The classification results show that the ATP-SVM and ATWSVM

can be successfully used as a multi-category classifier and have comparable accuracy

as the other two classifiers i.e. NHSVM and NSVMOOP. It is also observed that

the tree-based approach is better than OAA in terms of classification accuracy.

3.7. Experimental Results 79

Tab

le3.

7:

Cla

ssifi

cati

on

resu

lts

wit

hn

on-l

inea

rcl

assi

fier

onm

ult

i-ca

tego

ryU

CI

data

sets

NH

SV

MN

SV

MO

OP

AT

WSV

MA

TP

-SV

M

OA

AB

TO

AA

BT

OA

AB

TO

AA

BT

Data

set

m×n

(K)

Mea

nA

ccura

cy(%

SD

Der

mat

olog

y36

34(6

)96

.58±

2.18

96.9

7±1.

6096

.97±

3.93

96.9

1.5

997

.34±

2.58

96.6

0±2.7

997

.34±

2.5

898.1

1.99

Eco

li33

6×7

(8)

83.5

5.30

84.3

5.72

85.0

8±7.

2984

.37±

6.5

284

.06±

4.73

85.1

6±4.3

784

.69±

4.1

685.2

5.02

Gla

ss21

4×9

(6)

68.6

9.35

69.1

8.19

71.9

5.14

66.

75±

10.6

070

.29±

3.65

71.6

0±6.1

571

.60±

5.2

871.

88±

5.5

2

Iris

150×

4(3

)94

.67±

6.13

96.0

0±3.

4496

.67±

3.51

95.3

5.4

996

.67±

3.51

97.2

0±2.7

996

.00±

5.6

297.6

2.74

See

ds

210×

7(3

)92

.86±

6.83

94.2

9±4.

3893

.81±

5.52

93.3

7.1

790

.89±

6.51

93.3

3±4.6

590

.95±

6.5

294.7

3.51

Seg

men

t21

0×19

(7)

88.1

5.14

89.5

4.9

287

.62±

7.17

87.6

7.1

788

.10±

5.14

89.1

6±2.7

989

.48±

5.8

589.5

3.69

Win

e17

8×13

(3)

96.6

4.70

97.1

3.96

97.1

5.42

97.7

3.9

297

.19±

3.96

98.1

2.3

696

.63±

3.9

198.3

2.74

Zoo

101×

16(7

)95

.00±

4.72

98.0

4.2

296

.09±

6.91

98.0

4.22

95.0

4.72

98.0

4.2

295

.00±

4.7

298.0

4.22

Mean

Acc.

89.5

5.54

90.6

4.55

90.6

5.61

89.9

5.99

89.9

4.35

91.1

4±3.7

790

.21±

4.83

91.6

3.68

80

Table 3.8: Friedman test and p-values with multi-category classifiers for UCIdatasets

OAA BT OAA BT OAA BT

NHSVM NSVMOOP ATP-SVM

Dataset Rank (p-value) Rank

Derm. 5 (0.1202) 2 (0.1762) 4 (0.1806) 3 (0.4246) 6 (0.4660) 1

Ecoli 6 (0.4680) 4 (0.7072) 5 (0.7255) 2 (0.6541) 3 (0.7722) 1

Glass 5 (0.3568) 4 (0.3906) 6 (0.1916) 1 (0.9707) 3 (0.9081) 2

Iris 6 (0.7169) 3.5 (0.2446) 5 (0.2463) 2 (0.5914) 3.5 (0.1674) 1

Segment 4 (0.4649) 1.5 (0.4855) 5.5 (0.4647) 5.5 (0.4649) 3 (0.9831) 1.5

Seeds 5 (0.4429) 2 (0.7911) 4 (0.5783) 3 (0.6504) 6 (0.1214) 1

Wine 5.5 (0.3458) 3.5 (0.4746) 2 (0.7177) 3.5 (0.5700) 5.5 (0.2838) 1

Zoo 5.5 (0.3823) 2 (1.0000) 2 (1.0000) 4 (0.4655) 5.5 (0.3823) 2

Avg.rank 4.30 2.40 3.45 2.45 3.15 1.05

The multi-category versions of the above mentioned classifiers are compared by

Friedman test in Table 3.8. ATP-SVM achieves the highest rank among all the

approaches. It is also seen that tree-based approach achieves better rank than OAA

multi-category approach. The p-values given in Table 3.8 are tested for a significance

level α = 0.05 and are calculated by pairwise t-test on tree-based ATP-SVM and

other algorithms. Holm-Bonferroni test applied on p-values suggest that ATP-SVM

classifier is similar in performance to TPMSVM, NSVMOOP and NHSVM.

3.8 Application: Segmentation through Pixel Classifi-

cation of Color Images

The problem of image segmentation can be regarded as pixel classification problem

that identifies regions in an image by associating a label with each image pixel. In

this section, we present application of binary and multi-category ATP-SVM to color

image segmentation. For this work, Gabor texture features [65] are determined for

each pixel of the image.

Segmentation: Object and Background

In this thesis, we have explored the application of angle-based classifiers for color

pixel classification problem which can partition the image into two regions corre-

sponding to object and background. Since ATP-SVM and ATWSVM are binary

classifiers, they can be trained to associate the image pixels to some homogeneous

3.8. Application: Segmentation through Pixel Classification of Color Images 81

region that either belongs to the object or to the background. The RGB image is first

quantized with minimum variance color quantization for two levels. The training set

consists of 1% randomly selected pixels and the pixel labels are given by correspond-

ing quantized color value. Gabor features [56] with 4-orientation (0, 45, 90, 135) and

3-scale (0.5, 1.0, 2.0) sub-bands and the maximum of the 12 coefficients determine

the orientation at a given pixel location. These features are used as input to the

classifiers which determine labels for all pixels. It takes advantage of the local infor-

mation of color image and uses the ability of our classifier to distinguish the object

pixels from background.

Table 3.9: Segmentation results for BSD color images

Segmentation (2-regions)

Image L F-measure Error rate

K-Means ATP-SVM K-Means ATP-SVM

42049 2 0.84 0.91 0.0346 0.0267

35049 2 0.57 0.59 0.0479 0.0395

296059 2 0.67 0.77 0.0408 0.0304

181021 2 0.59 0.63 0.0406 0.0385

196027 2 0.51 0.58 0.0874 0.0640

Segmentation (multi-region)

118035 4 0.63 0.75 0.0411 0.0232

100007 3 0.58 0.63 0.0445 0.0323

163014 4 0.44 0.60 0.0833 0.0598

124084 4 0.60 0.61 0.0536 0.0473

196027 4 0.45 0.56 0.0751 0.0344

This section presents the segmentation results of color images selected from

Berkley Segmentation Dataset (BSD) [57]. For BSD images, the ground truth seg-

mentations are known and the images segmented by ATP-SVM are compared with

ground truth. To evaluate the segmentation algorithms statistically, two criteria are

used: F-measure (FM) and error rate (ER). (Please refer Appendix A for details.)

Experiments show that the classifier ATP-SVM can generate effective image seg-

mentation results and is able to distinguish the object from the background. To

compare the segmentation results, we have implemented pixel classification through

K-Means clustering [25]. It is observed from Table 3.9 that ATP-SVM achieves bet-

ter F-measure and error rate values. This is validated by the image segmentation

results given in Table 3.10. With K-Means, many object pixels are misclassified

as background pixels. The segmentation results are visually more appealing for

82

ATP-SVM.

It is observed from Table 3.11 that ATWSVM is able to achieve better re-

sults than segmentation with K-Means clustering. The segmentation results for

ATWSVM are visibly more satisfactory as compared to other method and the same

is proved by F-measure and error rate. The segmentation results are given in Table

3.12.

Segmentation: Multiple Regions

ATP-SVM can be used to identify multiple non-overlapping regions in an image in

multi-category framework. A dynamic method is used to determine the number of

regions in each image by generating its histogram and identifying the prominent

peaks. The number of prominent peaks determine the number of regions (L) in the

image. The image is then color quantized using minimum variance quantization for

L levels and each pixel gets associated with some color value. The training and

test data is created in the similar manner as for binary segmentation and the multi-

category classifier model is build using the training data. For this chapter, binary-

tree based multi-category approach is used. The multi-region segmentation achieved

by ATP-SVM is compared with segmentation obtained by K-Means clustering [25]

and Normalized cut (Ncut) [66]. The results are presented in Table 3.9 (Multi-

region results) and Table 3.13 and it is observed that Ncut is not able to produce

satisfactory output.

3.9 Conclusions

In this chapter, two novel classifiers are presented. The first classifier is Angle-based

Twin Parametric-Margin Support Vector Machine (ATP-SVM) which solves only

one quadratic programming problem for simultaneously determining two parametric-

margin hyperplanes and is particularly useful when the data has heteroscedastic

noise. It tries to maximize the angle between the two hyperplanes and avoids the

explicit computation of inverse of matrices in the dual. Also, a novel procedure for

selecting the ‘data representative points’ is presented, which can effectively reduce

the size of training set and hence reduces the complexity of the problem. Thus,

3.9. Conclusions 83

42049

35049

296059

181021

196027

Image No. (a.) Original image (b.) Image segmented (c.) Image segmentedwith K-Means with ATP-SVM

Table 3.10: Segmentation results (binary) on color images from BSD image dataset

84

118035

42049

124084

135069

299091

Image No. (a.) Original image (b.) Image segmented (c.) Image segmentedwith K-Means with ATWSVM

Table 3.11: Segmentation results (binary) on color images from BSD image dataset

our classifier is able to attain the speed comparable to that of TWSVM and results

in faster execution than any other single optimization based classifier. Further,

multi-category ATP-SVM using One-Against-One (OAA) and Binary Tree (BT)

approaches are presented, along with the application for segmentation of color images

into two or more regions.

The second classifier is Angle-based Twin Support Vector Machine (ATWSVM)

which determines the two hyperplanes by solving a quadratic programming problem

and an unconstrained minimization problem. The first problem of ATWSVM can

be formulated as any TWSVM-based optimization problem and obtains positive

hyperplane. The second problem obtains the hyperplane so that it is proximal to the

data points of its own class and is at maximum angle from the positive hyperplane.

ATWSVM classifier has efficient time-complexity than TWSVM and TBSVM, since

3.9. Conclusions 85

Table 3.12: Segmentation result for BSD color images

Image ATWSVM K-Means

F-measure Error rate F-measure Error rate

299091 0.77 0.0083 0.42 0.0323

135069 0.94 0.0169 0.60 0.0500

118035 0.71 0.0382 0.70 0.0416

42049 0.91 0.0298 0.82 0.0344

124084 0.61 0.0722 0.42 0.1245

the second problem of ATWSVM avoids solving QPP. This research work presents

a generic model for binary classifiers where the first problem can be solved using

any TWSVM-based classifier like ITWSVM [61], TPMSVM [16] or LS-TWSVM

[14] and the second problem would remain the same. The efficacy of ATWSVM has

been proved by performing experiments with synthetic and real world datasets and

ATWSVM is further applied for color image segmentation.

86

118035

100007

163014

124084

124084

BSD No. (a.) Original image (b.) Ncut* (c.) K-Means (d.) ATP-SVM

* Code source:https://in.mathworks.com/matlabcentral/fileexchange/52698-K-Means–mean-shift-and-normalized-cut-segmentation?requestedDomain=www.mathworks.com.

Table 3.13: Segmentation results (multi-region) with normalized cut, K-Means andATP-SVM on color images of BSD dataset

Chapter 4

Ternary Support Vector Machine with

Extension for Multi-category Classification

4.1 Introduction

In this chapter, we present a ternary (i.e. three class) classifier termed as Ternary

Support Vector Machine (TerSVM) and its extension for multi-category classifica-

tion. The motivation behind this chapter is to develop an algorithm which can

efficiently handle multi-class data. The existing multi-category classification ap-

proaches like One-Against-All (OAA) and One-Against-One (OAO) are not efficient

in terms of learning time. Their performance deteriorates with increase in number

of classes. So, we present a tree-based multi-category algorithm which is robust

enough to deal with large number of classes and is termed as Reduced Tree for

Ternary Support Vector Machine (RT-TerSVM).

TerSVM is developed on the lines of Twin-KSVC [39] and evaluates the training

data into ‘one-versus-one-versus-rest’ structure. But the problem formulation of

TerSVM is quite different from that of Twin-KSVC. The significant features of our

classifier TerSVM are listed below:

1. The supervised classification algorithm i.e. TerSVM, is a ternary classifier

which deals with three classes: positive, negative and rest, associated with

labels +1, −1 and 0 respectively. If required, TerSVM can be used as binary

classifier also.

2. TerSVM formulates three unconstrained minimization problems (UMPs) to

determine proximal hyperplanes for three classes. These optimization prob-

88

lems are solved as systems of linear equations, whereas Twin-KSVC solves

quadratic programming problems (QPPs).

3. Our classifier optimizes the distance (i.e. ρ1, ρ2) between the hyperplane of one

class and patterns of the other class, whereas Twin-KSVC tries to maintain a

separation of unit distance.

4. TerSVM first solves the optimization problems for positive and negative classes.

Then the hyperplane for the rest class is determined by using the solution of

the other two problems.

This chapter also presents a novel multi-category classification approach and its

characteristic features are listed below.

1. The novel approach i.e. RT-TerSVM, is a tree-based multi-category classifi-

cation algorithm that evaluates the training data into ‘one-versus-one-versus-

rest’ structure. RT-TerSVM identifies the two most distant classes as positive

and negative classes. The remaining classes are collectively referred as rest

class and their patterns are mapped in the region between positive and nega-

tive classes. RT-TerSVM recursively divides the data of ‘rest’ class further into

three classes in the similar manner until all the classes are uniquely represented

in the ternary tree.

2. This work presents an effective procedure to identify positive, negative and

rest classes from a given set of K classes, such that the rest class is mapped

between positive and negative classes.

3. At each level, RT-TerSVM determines three nonparallel hyperplanes using

TerSVM classifier and evaluates the test data based on minimum distance

from these hyperplanes.

4. Each leaf node of the RT-TerSVM is associated with a unique class and the

internal nodes are employed to distinguish between these classes.

5. RT-TerSVM develops a ternary tree of height bK/2c, where K is the number

of classes in the dataset. The size of the problem reduces as we traverse down

the ternary tree and this results in efficient learning time.

4.2. Ternary Support Vector Machine 89

6. To improve the learning time complexity of our algorithm, we present a novel

procedure to generate a reduced training set which can effectively represent

the entire training set.

The remaining chapter is organized as follows: Sections 4.2 and 4.3 present the

novel classifier Ternary Support Vector Machine (TerSVM) and the multi-category

extension approach. Section 4.4 compares TerSVM with other multi-category algo-

rithms. The experimental results on synthetic and benchmark datasets are given in

Section 4.5. This chapters also presents the application of RT-TerSVM for handwrit-

ten digit recognition and color image classification in Section 4.6. The concluding

remarks are given in Section 4.7.

4.2 Ternary Support Vector Machine

In this section, we introduce a TWSVM-based classifier which can handle three

classes and is therefore termed as Ternary Support Vector Machine (TerSVM).

TWSVM and most of its variants solve QPPs to obtain the nonparallel hyperplanes.

Kumar at al. proposed a faster version of TWSVM as Least Squares TWSVM (LS-

TWSVM) [14] which solves systems of linear equations to generate the proximal

hyperplanes. Taking motivation from LS-TWSVM, TerSVM obtains classifiers by

solving systems of linear equations. Let the three classes are represented by matrices

A (m1×n), B (m2×n) and C (m3×n) which are referred as positive (+1), negative

(−1) and rest (0) classes. The dataset X = [A; B; C] has m = m1 +m2 +m3 pat-

terns. This classifier determines three proximal nonparallel hyperplanes by solving

unconstrained minimization problems (UMPs). The three hyperplanes are given by

xTw1 + b1 = 0,

xTw2 + b2 = 0,

xTw3 + b3 = 0, (4.1)

where (w1, b1), (w2, b2) and (w2, b2) are the parameters of normal vectors to the

positive, negative and rest hyperplanes.

90

4.2.1 TerSVM (Linear version)

The TerSVM hyperplanes for three classes (i.e. A, B and C) are obtained by solving

the following three optimization problems:

TerSVM1:

minw1,b1,ρ1,ξ1,η1

PA =1

2‖Aw1 + e1b1‖22 +

c1

2ξT1 ξ1 +

c2

2ηT1 η1 +

c3

2ρT1 ρ1

subject to −(Bw1 + e2b1) + ξ1 = e2(1− ρ1),

−(Cw1 + e3b1) + η1 = e3(1− ε− ρ1). (4.2)

TerSVM2:

minw2,b2,ρ2,ξ2,η2

PB =1

2‖Bw2 + e2b2‖22 +

c1

2ξT2 ξ2 +

c2

2ηT2 η2 +

c3

2ρT2 ρ2

subject to (Aw2 + e1b2) + ξ2 = e1(1− ρ2),

(Cw2 + e3b2) + η2 = e3(1− ε− ρ2). (4.3)

TerSVM3:

minw3,b3,ξ3,η3

PC =1

2‖Cw3 + e3b3‖22 +

c1

2ξT3 ξ3 +

c1

2ηT3 η3 +

c4

2

(w1

Tw3 + w2Tw3 + b1b3 + b2b3

)subject to (Aw3 + e1b3) + ξ3 = e1(1− ε− ρ1),

−(Bw3 + e2b3) + η3 = e2(1− ε− ρ2). (4.4)

TerSVM1: The TerSVM classifier assumes that the rest class is placed between

positive and negative classes. The optimization problem given in (4.2) determines

the parameters of positive hyperplane (w1, b1), which is proximal to the positive

class (represented by A). The term 12‖Aw1 +e1b1‖22 in the objective function of (4.2)

is the sum of squared distances of the positive hyperplane to the patterns of its own

class. Thus, minimizing this term tends to keep the hyperplane closer to the positive

class. The equality constraints require that the negative class patterns (represented

by B) should be exactly (1− ρ1) distance away from the positive hyperplane. Since

TerSVM is soft-margin classifier, it allows some error in classification. The amount

4.2. Ternary Support Vector Machine 91

of violation of constraints is measured by ξ1 and its L2-norm is minimized in the

objective function. This is in contrast to Twin-KSVC problem which minimizes the

L1-norm of error vector and has inequality constraints. Similarly, the other set of

constraints require the patterns of rest class to be (1 − ε − ρ1) distance away from

the positive hyperplane. The error due to misclassification of patterns of rest class

is measured by η1 and is minimized in (4.2). To have maximum separation between

patterns of one class and hyperplane of other class, the distance ρ1 is minimized in

our problem and ε is user-defined parameter. The positive parameters ci, i = 1, .., 4

associate weights with the corresponding terms.

The optimization problem (4.2) is converted into an unconstrained minimization

problem (UMP) by substituting ξ1, η1 and we get

minw1,b1,ρ1

PA =1

2‖Aw1 + e1b1‖22 +

c1

2‖Bw1 + e2b1 + e2(1− ρ1)‖22 +

c2

2‖Cw1 + e3b1 + e3(1− ε− ρ1)‖22 +

c3

2ρT1 ρ1. (4.5)

The second and third terms of the objective function of (4.5) try to minimize the

L2-norm of the error due to misclassified patterns of negative and rest classes re-

spectively, with parameters c1 and c2. Taking sub-gradient of PA with respect to

w1, b1, ρ1 and equating them to zero, we get

∂PA∂w1

= 0⇒ AT (Aw1 + e1b1) + c1BT (Bw1 + e2b1 + e2(1− ρ1)) +

c2CT (Cw1 + e3b1 + e3(1− ε− ρ1)) = 0, (4.6)

∂PA∂b1

= 0⇒ eT1 (Aw1 + e1b1) + c1eT2 (Bw1 + e2b1 + e2(1− ρ1)) +

c2eT3 (Cw1 + e3b1 + e3(1− ε− ρ1)) = 0, (4.7)

∂PA∂ρ1

= 0⇒ −c1eT2 (Bw1 + e2b1 + e2(1− ρ1)) +

−c2eT3 (Cw1 + e3b1 + e3(1− ε− ρ1)) + c3ρ1 = 0. (4.8)

Let H = [A e1], G = [B e2], J = [C e3] be the augmented matrices for positive,

negative and rest classes respectively. The normal vector to the hyperplane is repre-

sented by augmented vector z1 = [wT1 bT1 ]T , which includes the bias term b1; e1, e2

and e3 are vectors of ones of appropriate dimensions. The above equations (4.6-4.8)

could be rewritten as

92

HTHz1 + c1GT (Gz1 + e2(1− ρ1)) + c2J

T (Jz1 + e3(1− ε− ρ1)) = 0, (4.9)

−c1eT2 (Gz1 + e2(1− ρ1))− c2e

T3 (Jz1 + e3(1− ε− ρ1)) + c3ρ1 = 0, (4.10)

Arranging Eq.(4.9) and (4.10) in matrix form, we get

HTH + c1GTG+ c2J

TJ −c1GT e2 − c2J

T e3

−c1eT2 G− c2e

T3 J c1e

T2 e2 + c2e

T3 e3 + c3

z1

ρ1

=

−c1GT e2 − c2J

T e3(1− ε)

c1eT2 e2 + c2e

T3 e3(1− ε)

(4.11)

The augmented vector z1 and distance ρ1 are obtained as

z1

ρ1

=

HTH + c1GTG+ c2J

TJ −c1GT e2 − c2J

T e3

−c1eT2 G− c2e

T3 J c1e

T2 e2 + c2e

T3 e3 + c3

−1

−c1GT e2 − c2J

T e3(1− ε)

c1eT2 e2 + c2e

T3 e3(1− ε)

. (4.12)

TerSVM2: The optimization problem (4.3) for determining negative hyperplane is

analogous to the first QPP and is converted into UMP given by

minw2,b2,ρ2

PB =1

2‖Bw2 + e2b2‖22 +

c1

2‖Aw2 + e1b2 − e2(1− ρ2)‖22 +

c2

2‖Cw2 + e3b2 − e3(1− ε− ρ2)‖22 +

c3

2ρT2 ρ2. (4.13)

The solution of the above equation is obtained as

z2

ρ2

=

GTG+ c1HTH + c2J

TJ c1HT e1 + c2J

T e3

c1eT1 H + c2e

T3 J c1e

T1 e1 + c2e

T3 e3 + c3

−1

c1HT e1 + c2J

T e3(1− ε)

c1eT1 e1 + c2e

T3 e3(1− ε)

, (4.14)

4.2. Ternary Support Vector Machine 93

where z2 = [wT2 bT2 ]T is the augmented vector for negative hyperplane.

Figure 4.1: Geometrical illustration of angle between normal vectors to the hyper-planes

TerSVM3: The problem (4.4) is different from the first two problems and uses

the solution of TerSVM1 and TerSVM2. It requires the optimal values of positive

(w1, b1) and negative (w2, b2) hyperplanes along with optimal values of ρ1 and

ρ2. The proximal hyperplane is determined so that it is (1− ε− ρ1) distance away

from the patterns of positive class and (1− ε− ρ2) distance away from the patterns

of negative class. The objective function of TerSVM3 tries to maximize the angle

between the normal vector to rest hyperplane and the normal vectors to the other

two hyperplanes, as shown in Figure 4.1. Since,

cos θ1 =(w0.w1)

‖w0‖2‖w1‖2and cos θ2 =

(w0.w2)

‖w0‖2‖w2‖2,

the cosine of angle θ1 and θ2 could be minimized to achieve the maximum separation

between the hyperplanes. The optimization problem of (4.4) is converted into UMP

given as

minw3,b3

PC =1

2‖Cw3 + e3b3‖22 +

c1

2‖Aw3 + e1b3 − e1(1− ε− ρ1)‖22 +

c1

2‖Bw3 + e2b3 + e2(1− ε− ρ2)‖22 +

c4(w1Tw3 + w2

Tw3 + b1.b3 + b2.b3). (4.15)

94

The last term of (4.15) tries to simultaneously maximize the angle between the opti-

mal positive w1 and the rest w3 hyperplane (represented by w1Tw3) and also between

negative w2 hyperplanes and rest w3 hyperplane (represented by w2Tw3). To obtain

the hyperplane corresponding to the rest class, the optimal solution (i.e. z1, z2, ρ1

and ρ1) of TerSVM1 and TerSVM2 are used. The solution for rest hyperplane is

obtained by solving

[z3

]=

[JTJ + c1H

TH + c2GTG

]−1

[c1

(HT e1(1− ε− ρ1)−GT e2(1− ε− ρ2)

)− c4(z1 + z2)

], (4.16)

where z3 = [wT3 bT3 ]T is the augmented vector for rest hyperplane. A new test

pattern x ∈ Rn is assigned to class r (r = +1,−1, 0), based on minimum distance

from the three hyperplanes. The class label is given by

y = arg

(minl=1,2,3

|xTwl + bl|‖wl‖2

), (4.17)

where |.| is the absolute distance of point x from the plane xTwl + bl = 0 and ‖.‖2

represents L2-norm. The label assigned to the test data is given as

y =

+1 (r = 1)

−1 (r = 2)

0 (r = 3)

.

4.2.2 TerSVM (Kernel version)

This classifier can be extended to non-linear version by considering the kernel-

generated surfaces instead of hyperplanes. The surfaces are given as

Ker(xT , (X )T )u1 + b1 = 0, (4.18)

Ker(xT , (X )T )u2 + b2 = 0, (4.19)

Ker(xT , (X )T )u3 + b3 = 0, (4.20)

4.2. Ternary Support Vector Machine 95

where X = [A; B; C] represents the entire dataset and Ker is an appropriately

chosen kernel. The formulation of kernel TerSVM is given as

minu1,b1,ρ1

1

2‖Ker(A,X T )u1 + e1b1‖22 +

c1

2‖Ker(B,X T )u1 + e2b1 + e2(1− ρ1)‖22 +

c2

2‖Ker(C,X T )u1 + e3b1 + e3(1− ε− ρ1)‖22 +

c3

2ρT1 ρ1. (4.21)

minu2,b2,ρ2

1

2‖Ker(B,X T )u2 + e2b2‖22 +

c1

2‖Ker(A,X T )u2 + e1b2 − e2(1− ρ2)‖22 +

c2

2‖Ker(C,X T )u2 + e3b2 − e3(1− ε− ρ2)‖22 +

c3

2ρT2 ρ2. (4.22)

minu3,b3

1

2‖Ker(C,X T )u3 + e3b3‖22 +

c1

2‖Ker(A,X T )u3 + e1b3 − e1(1− ε− ρ1)‖22 +

c1

2‖Ker(B,X T )u3 + e2b3 + e2(1− ε− ρ2)‖22 +

c4(w1Tu3 + w2

Tw3 + b1.b3 + b2.b3), (4.23)

where Ker(A,X T ), Ker(B,X T ), Ker(C,X T ) represent kernel matrices for classes

A, B and C respectively. The solutions for these problems are obtained in similar

manner as for linear case. The augmented vector for positive hyperplane r1 =

[uT1 bT1 ] is given as

r1

ρ1

=

HTKHK + c1G

TKGK + c2J

TKJK −c1G

TKe2 − c2J

TKe3

−c1eT2 GK − c2e

T3 JK c1e

T2 e2 + c2e

T3 e3 + c3

−1

−c1GTKe2 − c2J

TKe3(1− ε)

c1eT2 e2 + c2e

T3 e3(1− ε)

. (4.24)

Here, HK = [Ker(A,X e1], GK = [Ker(B,X e2], JK = [Ker(C,X e3]; e1, e2

and e3 are vectors of ones of appropriate dimensions. The solution for the negative

hyperplane is given as

r2

ρ2

=

GTKGK + c1HTKHK + c2J

TKJK c1H

TKe1 + c2J

TKe3

c1eT1 HK + c2e

T3 JK c1e

T1 e1 + c2e

T3 e3 + c3

−1

c1HTKe1 + c2J

TKe3(1− ε)

c1eT1 e1 + c2e

T3 e3(1− ε)

, (4.25)

96

where r2 = [uT2 bT2 ]T is the augmented vector for negative hyperplane. The solution

for rest hyperplane r3 = [uT3 bT3 ]T is obtained by solving

[r3

]=

[JTKJK + c1H

TKHK + c2G

TKGK

]−1

[c1(HT

Ke1(1− ε− ρ1)−GTKe2(1− ε− ρ2))− c4(r1 + r2)

]. (4.26)

Once we obtain the surfaces (4.18), a new pattern x ∈ Rn is assigned to class 1, -1

or 0 in a manner similar to the linear case.

4.2.3 TerSVM as Binary Classifier

The ternary classifier can also be used as a binary classifier. For a two-class classifi-

cation problem, there would be no rest class and only two nonparallel hyperplanes

are determined corresponding to positive and negative classes. The optimization

problems for binary TerSVM are:

minw1,b1,ρ1

PA =1

2‖Aw1 + e1b1‖22 +

c1

2‖Bw1 + e2b1 + e2(1− ρ1)‖22 +

c3

2ρT1 ρ1, (4.27)

minw2,b2,ρ2

PB =1

2‖Bw2 + e2b2‖22 +

c1

2‖Aw2 + e1b2 − e2(1− ρ2)‖22 +

c3

2ρT2 ρ2. (4.28)

The equations (4.27) and (4.28) are UMPs which can be solved in similar manner

as (4.5).

4.3 Multi-category Classification Algorithm: Reduced

Tree for TerSVM

In this section, we present a novel multi-category classification approach which can

efficiently handle large amount of data from more than two classes. The multi-

category classification approach is motivated by Twin-KSVC [39] and implements

a ternary tree (i.e. a tree where each node can have at most three child nodes)

of classifiers to organize multiple classes. This algorithm recursively partitions the

4.3. Multi-category Classification Algorithm: Reduced Tree for TerSVM 97

Figure 4.2: RT-TerSVM for dataset with 5 classes

set of K classes into one positive class, one negative class and remaining (or rest)

classes denoted by +1, −1 and 0 respectively. The remaining (K − 2) classes are

recursively partitioned until all the classes are uniquely represented by a leaf node of

the ternary tree, as demonstrated in Figure 4.2. Once the hyperplanes are obtained

for the positive and negative classes, their patterns are removed from the training

set. Therefore, to determine the hyperplanes for remaining (K − 2) classes, these

2 classes (i.e. positive and negative classes in higher level of the tree) are not

considered. This reduces the number of training samples and hence, the learning

time of the algorithm. Our algorithm achieves good classification accuracy, which is

proved experimentally in Section 4.5. To identify the positive and negative classes

at each level of the RT-TerSVM, a novel class selection approach is presented in

98

Algorithm 2.

Input : Training data X = X1, X2, ..., Xm with labels y ∈ 1, 2, ...,K.

Output : Positive, negative and rest classes.

Process:

1. For each class Ci, i = 1, ...,K, identify

a. class mean Mi.

b. span of each class (Spani) where span is given as

Spani =1

mi

mi∑j=1

‖Xj −Mi‖2, (4.29)

where mi is the number of patterns in ith class and ‖Xj −Mi‖2 is the

Euclidean distance between data pattern Xj and class mean Mi.

2. For p=1 to K

For q=p+1 to K

Find the separation between classes Cp and Cq, given by

D(Cp, Cq) = ‖Mp −Mq‖2 − Spanp − Spanq. (4.30)

End.

End.

3. Select the classes i, j as positive and negative classes respectively, such

that the distance D(Ci, Cj) is maximum.

4. The remaining classes of the dataset i.e. Cr, r ∈ K − i, j are treated

as rest class and would lie in the region between the positive and negative

classes.Algorithm 2: Selection of positive, negative and rest classes

Reduced training set

To improve the learning time of our algorithm, we present a novel procedure that

selects a subset of training patterns which is referred as reduced training set. For

our multi-category classification algorithm, a ternary tree is built by first selecting

positive, negative and rest classes. Since the number of patterns in rest class would

be more than those in other two classes, a selection procedure in Algorithm 3 is

4.3. Multi-category Classification Algorithm: Reduced Tree for TerSVM 99

presented which can identify few representative patterns from the rest class. The

reduced patterns can effectively represent the entire training set and it is exper-

imentally observed that we can achieve good classification accuracy with reduced

training set. It significantly improves the learning time of our algorithm.

Input : Training data X = X1, X2, ..., Xm with labels y ∈ +1,−1, 0.

Output : Reduced training set.

Process:

1. Calculate class mean M0 for rest class.

2. For each pattern in rest class, find its distance from M0.

disti = ‖Xi −M0‖, (4.31)

where ‖Xi −M0‖ is the Euclidean distance.

2. Select P% patterns which are most distant from the class mean M0.

Algorithm 3: Procedure to create reduced training set

Reduced tree for Ternary Support Vector Machine (RT-TerSVM)

RT-TerSVM is a supervised learning approach that builds a classifier model for K

classes by using m training points. The procedure for generating RT-TerSVM is

explained in Algorithm 4. The inputs to the algorithm are X ∈ Rm×n and K, where

X represents m data points in n-dimensional feature space and K is the number of

classes. The multi-category classification approach i.e. RT-TerSVM identifies two

classes as positive and negative and remaining all training patterns as taken as rest

class. For a dataset with K classes, the +1 and −1 classes are identified based on

maximum distance between them, using Algorithm 2. The remaining K − 2 classes

are treated as rest class. The patterns in the rest class are recursively partitioned

until each class is uniquely represented in the classifier model. RT-TerSVM develops

a ternary tree of height bK/2c, where K is the number of classes in the dataset. At

each level, RT-TerSVM obtains three nonparallel hyperplanes and evaluates the test

data based on minimum distance from these hyperplanes. Each leaf node of the

RT-TerSVM is associated with a unique class and the internal nodes are employed

to distinguish between these classes. The size of the problem reduces as we traverse

down the ternary tree and this results in efficient learning time. This approach

100

performs better that the classical One-Against-All (OAA), Twin-KSVC and other

tree-based multi-category approaches, in terms of generalization ability and learning

time. A test pattern xi ∈ Rn is evaluated using the trained RT-TerSVM model and

the procedure is explained in Algorithm 5.

4.4 Discussion

The strength of our multi-category approach RT-TerSVM is its efficient training and

testing time. RT-TerSVM evolves as a recursive classifier model with better time

complexity as compared to classical One-Against-All approach. The size of data

diminishes as the model is progressively obtained. This characteristic favors the use

of non-linear (kernel) classifiers where the learning time depends on the size of data.

Input : Labeled dataset with m patterns from K different classes.

Output : The hyperplane parameters for all internal and leaf nodes that

build up the RT-TerSVM.

Process: RT-TerSVM(X ,K)

1. Use Algorithm 2 to identify positive, negative and rest classes with labels

‘+1’, ‘-1’ and ’0’ respectively. The two most distant classes are referred as

A, B, and remaining classes as C.

2. Take training patterns of ‘+1’, ‘-1’ and ‘0’ classes and find three

hyperplanes [w1, b1], [w2, b2] and [w3, b3] by solving (4.12, 4.14, 4.16) or

using the kernel based versions of TerSVM classifier (4.21-4.23).

3. If the number of classes in C are more than one, then recursively call this

algorithm: RT-TerSVM(Xnew, K − 2),

where Xnew = X − A,B.Algorithm 4: Reduced Tree for Ternary Support Vector Machine

Twin-KSVC: For a K-class problem, Twin-KSVC formulates K∗(K−1)2 prob-

lems. Each problem determine two nonparallel hyperplanes and uses the entire

dataset. Assume that all classes have approximately equal number of patterns i.e.

m/K, where m is the total number of data points. The learning time of one Twin-

KSVC problem is of the order

2×(K − 1

K∗m

)3

. (4.32)

4.4. Discussion 101

Since, twin-KSVC solves K∗(K−1)2 problems, its total learning time is of the order

2×(K − 1

K∗m

)3

× K ∗ (K − 1)

2,

⇒ (K − 1)4

K2m3 (4.33)

u K2 ×m3. (4.34)

Therefore, the learning time of Twin-KSVC increases with the number of classes.

OAA-TWSVM: When TWSVM binary classifier is extended in multi-category

framework using OAA approach, then the algorithm solves K QPPs, each of size

((K − 1)/K) ∗m. Hence, the learning time of OAA-TWSVM is given by

TOAA = K ∗(K−1K ∗m

)3,

u K ∗m3. (4.35)

TerSVM-Tree: Our multi-category algorithm uses a ternary classifier TerSVM,

which solves the optimization problems as systems of linear equation. For the linear

case, TerSVM solves three problems where two of them find the inverse of matrix

of order (n + 2) × (n + 2) and one problem finds the inverse of matrix of order

(n+ 1)× (n+ 1). Here, n is the number of features. The data from the rest class is

further partitioned and time complexity of RT-TerSVM is given as

T (m) = [2(n+ 2)3 + (n+ 1)3] + [2(n+ 2)3 + (n+ 1)3] + ....

(bK

2c- terms

)=

(bK

2c)× [2(n+ 2)3 + (n+ 1)3]

u K × n3. (4.36)

The height of the ternary tree is bK2 c. Since n << m, therefore, the learning time

of linear RT-TerSVM is much less than that of Twin-KSVC or OAA-TWSVM.

For kernel version, RT-TerSVM determines inverse of matrices of order (m+2)×

(m+ 2) and avoids solving QPPs, as solved by the other two approaches. Our algo-

rithm uses reduced training set which further improves its learning time. Therefore,

TerSVM outperforms Twin-KSVC and OAA-TWSVM in terms of learning time.

102

Input : The hyperplane parameters for all internal and leaf nodes thatbuild up the RT-TerSVM; Test pattern xi.

Output : Label for test pattern.Process:1. The test pattern xi is evaluated at the root node where the distance fromthree hyperplanes i.e. +1, −1 and 0, is determined, using Eq.(4.17).

2. Repeata. The pattern xi is associated with the class whose hyperplane is at the

minimum distance from the pattern.b. If the associated class is +1 or −1,

i. then assign the actual label of that class to xi.ii. break.else if number of classes in rest class is one

i. then assign the actual label of that class.ii. break.elsei. Determine the distance of pattern xi from the three

hyperplanes at next level of the RT-TerSVM.3. End.

Algorithm 5: Testing a data pattern using RT-TerSVM.

RT-TerSVM vs. Twin-KSVC

Both RT-TerSVM and Twin-KSVC extend nonparallel hyperplanes classifiers into

multi-category framework by evaluating training points into ‘one-versus-one-versus-

rest’ structure. For a K-class classification problem, Twin-KSVC constructs K(K−

1)/2 classifiers, where each classifier solves two QPPs. Whereas for RT-TerSVM,

bK/2c classifiers are constructed and each TerSVM classifier solves three UMPs.

The testing time for Twin-KSVC is more than RT-TerSVM, due to large number

of classifiers and the final label depends on the voting decision rules. But for RT-

TerSVM, pattern testing is based on minimum distance from the three TerSVM

hyperplanes. Therefore, maximum testing time depends on the height of the tree,

which is much less than the number of classifiers obtained for Twin-KSVC.

RT-TerSVM is time efficient than Twin-KSVC because it formulates the op-

timization problems as UMPs, rather than QPPs. The solution is obtained, for

TerSVM, by solving a system of linear equations. This makes our algorithm feasible

for real life large-sized problems. Twin-KSVC considers unit distance separability

between positive-hyperplane and patterns of other classes and vice-versa. Whereas

RT-TerSVM optimizes the separating distance (ρ1 and ρ2) between patterns of one

class and hyperplane of other class. Figure 4.3 graphically illustrates the hyper-

4.4. Discussion 103

planes obtained by TerSVM and Twin-KSVC classifiers. For a three-class problem,

TerSVM solves three UMPs to obtain proximal hyperplanes; whereas, Twin-KSVC

builds classifiers as 1-vs-2-vs-rest, 2-vs-3-vs-rest, 3-vs-1-vs-rest.

(a) TerSVM: Three hyperplanes obtained by TerSVM classifier

(b) Twin-KSVC: Hyperplanes are obtained as 1-vs-2-vs-rest, 2-vs-3-vs-rest, 3-vs-1-vs-rest

Figure 4.3: Synthetic dataset with 300 data points. Hyperplanes obtained by a.TerSVM; b. Twin-KSVC

RT-TerSVM vs. TDS-TWSVM

Ternary Decision Structure (TDS) is a multi-category classification algorithm that

generates the classifier model in a hierarchical manner. It partitions the training data

into at most three groups and is discussed in more detail in Section 5.2. TDS evalu-

104

ates the training patterns using ‘i-versus-j-versus-k’ structure whereas RT-TerSVM

uses ‘one-versus-one-versus-rest’ structure. TDS extends an existing binary classifier

like TWSVM into multi-category framework. It uses K-Means clustering to iden-

tify three groups of classes and generates their proximal hyperplanes using OAA

method for three classes.RT-TerSVM and TDS-TWSVM are multi-category classi-

fication algorithms for nonparallel hyperplanes classifiers; but their approaches are

totally different.

In contrast, RT-TerSVM is built upon a novel ternary classifier i.e TerSVM. For

a K-class problem, RT-TerSVM first identifies two most distant classes and maps

rest of the samples in the space between them. The three hyperplanes are obtained

by solving three optimization problems of TerSVM. Hence, RT-TerSVM does not

require OAA to obtain the hyperplanes, as required by TDS. The patterns belonging

to remaining K − 2 classes are recursively partitioned using similar approach. RT-

TerSVM doest not require K-Means clustering to partition the data, as required by

TDS.

Our algorithm has slightly better learning time than TDS-TWSVM. Although

TDS generates the decision structure of height dlog3Ke (best case), but it uses OAA

approach and further processes all three child nodes. Whereas in RT-TerSVM only

one child node (corresponding to class C) is processed further. Also RT-TerSVM

uses reduced training set, which further improves its learning time. Therefore, RT-

TerSVM has better learning time than TDS-TWSVM.

4.5 Experimental Results

In this section, we compare the performance of our multi-category classification

algorithm with other well established methodologies. The experiments have been

performed on benchmark UCI [52] as well as synthetic datasets. In all experiments,

the focus is on the comparison of RT-TerSVM with Twin-KSVC, OAA-TWSVM

and TDS-TWSVM. The parameters c1 and c2 are selected in the range 0.01 to 1;

c3, c4 ∈ 10−i, i = 1, ..., 5 ;and ε is selected to be very small value of order 10−5.

For non-linear version, Gaussian kernel is used and the kernel parameter is tuned in

the range 0.1 to 1.

4.5. Experimental Results 105

4.5.1 Synthetic Dataset

The efficiency of TerSVM is evaluated on synthetic dataset with three classes. Each

class has 100 patterns in R2. The positive (+1), negative (−1) and rest (0) class

patterns are shown as ‘+’, ‘*’ and ‘o’ respectively. Figure 4.4 shows the proximal

hyperplanes corresponding to three classes which are obtained by solving (4.12),

(4.14) and (4.16). TerSVM achieves testing accuracy of 99.33% for this dataset.

Figure 4.4: Linear TerSVM classifier with three classes

4.5.2 Multi-category Classification Results: UCI Datasets

For training, the dataset is normalized so that each feature is in the range zero to

one. For this work, experiments have been performed with linear and non-linear

classifiers using 15 multi-category UCI datasets.

Classification results for UCI datasets: Linear classifiers

The classification accuracy1 achieved by OAA-TWSVM, TDS-TWSVM, Twin-KSVC

and RT-TerSVM for multi-category UCI datasets is reported in Table 4.1. For each

dataset, the number of patterns, features and classes are shown in the table as

m, n and K respectively. It is observed that classification accuracy achieved by

RT-TerSVM is best among all the above mentioned algorithm.

To prove the efficacy of our algorithm, experiments have been performed on

1The bold figures indicate best value for the given dataset.

106

Table 4.1: Classification results with linear classifier on multi-category UCI datasets

OAA-TWSVM TDS-TWSVM Twin-KSVC RT-TerSVM

Dataset m× n (K) Mean Accuracy (%) ± SD

Balance 625 × 4 (3 ) 86.08 ± 3.78 88.01 ± 3.30 85.91 ± 4.50 87.85 ± 3.92

Dermatology 366 × 34 (6 ) 94.82 ± 1.57 95.08 ± 3.90 88.63 ± 6.15 98.11 ± 2.65

Ecoli 336 × 7 (8 ) 82.02 ± 3.51 84.42 ± 3.67 85.03 ± 6.12 86.23 ± 3.19

Glass 214 × 9 (6 ) 57.48 ± 5.04 57.83 ± 3.40 57.97 ± 7.08 61.67 ± 3.95

Iris 150 × 4 (3) 95.33 ± 3.80 97.33 ± 1.49 98.00 ± 3.22 98.00 ± 4.50

Seeds 210 × 7 (3 ) 93.81 ± 2.13 92.38 ± 4.57 96.67 ± 2.30 96.67 ± 3.24

Segment 210 × 19 (7 ) 88.09 ± 4.45 90.00 ± 6.90 86.19 ± 7.60 91.43 ± 3.01

Soybean 47 × 35 (4 ) 97.50 ± 2.63 100.00 ± 0.00 96.00 ± 8.43 99.00 ± 2.32

Wine 178 × 13 (3 ) 96.43 ± 4.07 97.17 ± 2.02 94.42 ± 5.24 99.44 ± 1.76

Zoo 101 × 16 (7 ) 93.14 ± 6.41 93.04 ± 5.72 94.00 ± 5.16 98.18 ± 5.75

Mean Accuracy 88.47 ± 3.73 89.52 ± 3.49 88.28 ± 5.58 91.66 ± 3.43

Table 4.2: Classification results with linear classifier on large-sized multi-categoryUCI datasets

OAA-LS-TWSVM TDS-LS-TWSVM RT-TerSVM

Dataset m× n (K) Mean Accuracy (%) ± SD

Multiple Features 2, 000× 648 (10) 97.60 ± 0.87 96.25 ± 2.45 97.95 ± 1.32

Optical Digits 5, 620× 64 (9) 88.25 ± 1.03 90.64 ± 1.70 92.15 ± 1.47

Page Blocks 5, 473× 10 (5) 90.55 ± 2.89 93.13 ± 1.40 94.35 ± 0.68

Pendigits 10, 992× 16 (9) 71.85 ± 1.27 87.74 ± 2.48 87.42 ± 2.09

Satimage 6, 435× 36 (7) 77.37 ± 1.39 84.13 ± 2.83 85.16 ± 1.42

Mean Accuracy 85.12 ± 1.49 90.37 ± 2.17 91.41 ± 1.40

large-sized UCI datasets. The learning time of OAA-TWSVM and TDS-TWSVM is

much more than RT-TerSVM, as discussed in Section 4.4. Hence, these algorithms

(i.e. OAA-TWSVM and TDS-TWSVM) are not feasible for large-sized datasets.

Therefore, the classification accuracy achieved by RT-TerSVM is compared with

two multi-category extensions for Least Squares version of TWSVM [14] and we

have termed them as OAA-LS-TWSVM and TDS-LS-TWSVM. The results are

demonstrated in Table 4.2 which prove that RT-TerSVM outperforms the other

two approaches for classification accuracy.

Classification results for UCI datasets: Non-linear classifiers

The experiments have been performed with RBF kernel Ker(x, x′) = exp(−σ‖x −

x′‖22) and the classification results show that our algorithm RT-TerSVM can be

effectively used as a multi-category classifier. Our algorithm outperforms the other

three approaches and the results are demonstrated in Table 4.3 and 4.4.

4.5. Experimental Results 107

Table 4.3: Classification results with non-linear classifier on multi-category UCIdatasets

OAA-TWSVM TDS-TWSVM Twin-KSVC RT-TerSVM

Dataset Mean Accuracy (%) ± SD

Balance 92.17 ± 4.82 93.30 ± 4.48 90.33 ± 5.48 97.76 ± 1.71

Dermatology 92.80 ± 4.67 96.59 ± 2.45 92.38 ± 3.12 98.48 ± 2.68

Ecoli 76.76 ± 3.99 87.47 ± 3.19 86.36 ± 4.51 88.11 ± 5.92

Glass 62.68 ± 3.67 69.17 ± 4.95 71.47 ± 11.10 72.34 ± 9.02

Iris 94.00 ± 4.34 97.33 ± 1.49 98.00 ± 4.49 98.00 ± 3.22

Seeds 93.33 ± 3.10 93.80 ± 4.32 94.29 ± 3.76 96.19 ± 4.38

Segment 91.85 ± 3.70 89.05 ± 5.52 86.19 ± 8.53 91.91 ± 6.37

Soybean 100.00 ± 0.00 100.00 ± 0.00 100.00 ± 0.00 100.00 ± 0.00

Wine 98.32 ± 2.49 99.44 ± 1.27 97.71 ± 4.05 99.44 ± 1.76

Zoo 96.04 ± 4.17 97.04 ± 2.69 86.09 ± 10.79 98.09 ± 4.03

Mean Accuracy 89.79 ± 3.49 92.31 ± 3.03 90.28 ± 5.58 94.03 ± 3.91

Table 4.4: Classification results with non-linear classifier on large-sized multi-category UCI datasets

OAA-LS-TWSVM TDS-LS-TWSVM RT-TerSVM

Dataset Mean Accuracy (%) ± SD

Multiple Features 98.20 ± 0.54 96.75 ± 0.85 97.95 ± 1.04

Optical Digits 92.94 ± 1.33 98.68 ± 0.64 99.13 ± 0.48

Page Blocks 94.66 ± 0.54 92.85 ± 0.33 93.24 ± 1.14

Pendigits 73.81 ± 6.02 98.52 ± 1.24 99.49 ± 0.18

Satimage 80.25 ± 3.49 87.18 ± 2.70 89.96 ± 0.83

Mean Accuracy 91.87 ± 3.78 94.39 ± 2.99 95.95 ± 0.73

108

Learning Time Comparison

Figure 4.5: Learning time of classifiers for UCI datasets (linear)

Figure 4.6: Learning time of classifiers for large-sized UCI datasets (linear)

This chapter presents time efficient multi-category classification approach. The

learning time of RT-TerSVM is less due to the use of novel classifier TerSVM which

avoids solving expensive QPPs. The tree structure further reduces the complexity of

the problem. To prove the efficacy of our work and to compare the learning time of

4.5. Experimental Results 109

the algorithms mentioned before, we performed experiments on UCI datasets. Figure

4.5 presents the learning time (in seconds) of all four methods i.e. OAA-TWSVM,

TDS-TWSVM, Twin-KSVC and RT-TerSVM. Here, ‘Derm’ refers to ‘Dermatology’,

‘Segm’ for ‘Segment’, ‘Soy’ for ‘Soybean’ and ‘Bal’ for ‘Balance’. The learning time

is recorded as the average CPU time for 10-fold cross validation. It is observed that

RT-TerSVM is the most time-efficient algorithm of all the approaches. The learning

time of RT-TerSVM and TDS-TWSVM is much less than that of OAA-TWSVM and

Twin-KSVC. The vertical axis of the graph is shown using logarithmic scale to clearly

demonstrate the learning time of RT-TerSVM and TDS-TWSVM. In order to study

the behavior of our algorithm towards large-sized datasets, we performed numeri-

cal experiments with UCI datasets which have large number of instances (ranging

from 2,000 to 10,000) or large number of features (e.g. ‘Multiple Features’ dataset

has 648 features.) The learning time of OAA-LS-TWSVM, TDS-LS-TWSVM and

RT-TerSVM is recorded for these datasets and presented in Figure 4.6. Here, ‘Mult

Feat.’ refers to ‘Multiple features’ dataset.

Figure 4.7: Learning time of classifiers for UCI datasets (non-linear)

110

Figure 4.8: Learning time of classifiers for large-sized UCI datasets (non-linear)

Figure 4.7 shows the learning time of UCI datasets with non-linear classifiers.

It shows the similar trend as the linear version and establishes that the learning

time of RT-TerSVM is least among all the four classifiers. Figure 4.8 compares the

learning time of non-linear OAA-LS-TWSVM, TDS-LS-TWSVM and RT-TerSVM

for large-sized UCI datasets. It is observed that RT-TerSVM outperforms the other

two approaches. For ‘Pen digits’ dataset RT-TerSVM learns the classification model

in 1.18 seconds as compared to OAA-LS-TWSVM and TDS-LS-TWSVM which

required 1062.51 and 396.73 seconds respectively.

4.6 Applications

We present the application of our multi-category classification algorithm for hand-

written digit recognition and color image classification.

4.6.1 Hand-written Digits Recognition: USPS Dataset

To test the efficacy of our ternary classifier, we performed experiments with three

classes selected out of 10 classes of USPS, as shown in Table 4.5. The selection of

these classes is done on the basis of digits which are more likely to be misinterpreted

e.g. 1 as 7, 0 as 6 etc. All the experiments are performed with 10-fold cross-validation

4.6. Applications 111

Table 4.5: Classification accuracy with linear classifier on three-class datasets cre-ated from USPS

3-Class Dataset OAA-LS-TWSVM TDS-LS-TWSVM RT-TerSVM

Mean Accuracy (%) ± SD

0 vs. 6 vs. 8 99.18 ± 0.62 99.16 ± 0.86 99.21 ± 0.48

1 vs. 4 vs. 7 99.36 ± 0.28 99.48 ± 0.55 99.49 ± 0.43

0 vs. 1 vs. 2 99.39 ± 0.43 99.09 ± 0.40 99.40 ± 0.40

1 vs. 5 vs. 9 99.79 ± 0.29 99.21 ± 0.52 99.82 ± 0.29

0 vs. 6 vs. 9 99.64 ± 0.34 99.42 ± 0.41 99.64 ± 0.34

8 vs. 9 vs. 0 98.09 ± 0.94 98.67 ± 0.56 98.82 ± 0.76

3 vs. 4 vs. 5 99.33 ± 0.59 99.06 ± 0.83 99.33 ± 0.58

Average 99.26 ± 0.50 99.16 ± 0.59 99.39 ± 0.47

Table 4.6: USPS Error Rate with different approaches

Method Description Error rate

Human Human classification 2.50

K-NN K-nearest neighbor 5.70

TD Tangent Distance 2.50

Lenet1 Simple Neural Network 4.20

OMC Optimal Margin Classifier 4.30

Boosting Boosting 2.60

SVM Support Vector Machine 4.00

SVM+TD Support Vector Machine + Tangent Distance 3.65

Pre-pro+SVM Preprocessing +Support Vector Machine 2.50

RT-TerSVM Our Algorithm 1.84

and the average accuracy is reported in the table. The algorithm RT-TerSVM is

compared with OAA-LS-TWSVM and TDS-LS-TWSVM for classification accuracy.

Some digits of this dataset were mis-segmented by the postal department, which

made it extremely difficult to correctly recognize and classify digits. The human

error-rate reported by Simard et al. [67] was around 2.5%. The results from other

popular approaches (Human, K-nearest neighbor, Tangent distance, Lenet1- simple

neural network, Optimal Margin Classifier (OMC), Boosting [68]), (SVM, SVM +

tangent distance, preprocessing + SVM [69]) and our algorithm are listed in Table

4.6. For our work, the error rate is reported as average over 10-folds cross-validation.

4.6.2 Color Image Classification

There has been a rapid growth in the number of digital images that are available

on-line. To carry out efficient image retrieval task, there is a need to classify these

images into different categories. Classifying and searching large-sized image datasets

require efficient algorithms. In this chapter, we present application of RT-TerSVM

112

Table 4.7: Classification accuracy for image datasets

Dataset OAA-LS-TWSVM TDS-LS-TWSVM RT-TerSVM

VisTex 97.34 ± 2.09 97.66 ± 1.33 98.28 ± 1.15

Wang’s 84.70 ± 3.40 85.90 ± 3.87 86.40 ± 3.27

for color image classification. Our algorithm extracts low-level features i.e. texture

and shape, from the image. Texture captures the intrinsic surface characteristics of

an image and describes the pixel’s relationship with its neighborhood. Shape based

features correspond to the human perception of an object. Therefore, in our work,

we use fusion of texture and shape features which are CR-LBP with co-occurrence

matrix and ART descriptors [59].

Depending on the visual content, multi-category image classification algorithm

associates a unique class label with every image in the dataset. The classification

model is trained and its performance is evaluated on unseen test data. Our al-

gorithms is based on the assumption that the images belonging to one class have

uniformity in terms of one or more features. The experiments are performed using

10-fold cross-validation. The test patterns are evaluated based on minimum distance

from the three hyperplanes at each level of the ternary tree, until it reaches a leaf

node. To prove the efficacy of our multi-category algorithm, experiments have been

conducted on benchmark image datasets like Wangs Color and MIT VisTex texture

datasets.

Image classification results

Table 4.7 shows the classification accuracy achieved by OAA-LS-TWSVM, TDS-LS-

TWSVM and RT-TerSVM for two image datasets. The experimental results prove

that RT-TerSVM outperforms the other two approaches.

4.7 Conclusions

In this chapter, a three-class classifier termed as ‘Ternary Support Vector Ma-

chine’ (TerSVM) and a tree based multi-category classification algorithm termed

as ‘Reduced tree for TerSVM’ (RT-TerSVM) are presented. Taking motivation from

Twin-KSVC, our classifier can handle three classes by determining nonparallel prox-

4.7. Conclusions 113

imal hyperplanes and evaluates the data patterns for ternary outputs (+1,−1, 0).

TerSVM formulates the optimization problems as unconstrained minimization prob-

lems (UMPs) and is therefore more time-efficient that Twin-KSVC. The multi-

category extension of TerSVM i.e. RT-TerSVM, is a tree based approach that builds

a tree of height bK/2c for a K-class problem. Numerical experiments prove that

RT-TerSVM outperforms other multi-category classification approaches. The appli-

cation of RT-TerSVM is shown for image handwritten digit recognition and image

classification.

114

Chapter 5

Multi-category Classification Approaches for

Nonparallel Hyperplanes Classifiers

5.1 Introduction

The speed while learning a model is a major challenge for multi-category classifica-

tion problems in Support Vector Machines (SVMs). Twin Support Vector Machines

(TWSVM) is approximately four times faster than SVM, as it solves two smaller

QPPs. Since, TWSVM solves convex optimization problem, it guarantees optimal

solution. Further TWSVM overcomes the unbalance problem in two classes by

choosing two different penalty variables for different classes. After exploring the

strengths of TWSVM, we intend to study the behavior of this classifier in multi-

category scenario. Taking motivation from Twin Multiclass Classification Support

Vector Machine (Twin-KSVC) [39], we present Ternary Decision Structure based

Multi-category Twin Support Vector Machine (TDS-TWSVM) classifier. TDS-

TWSVM determines a decision structure of TWSVM classifiers which evaluates the

training patterns using ‘i-versus-j-versus-k’ structure. Each decision node of TDS is

split into three decision nodes labeled as (+1, 0,−1), where +1, and −1 represent

focused groups of classes and 0 represents ambiguous group of classes. Ambiguous

and focused groups consist of training patterns with low and high confidence re-

spectively. At each level of the decision structure, we partition K-class problem into

three problems, until all patterns of that node belong to only one class (for best case,

each group has approximately(K3

)-classes). TDS-TWSVM requires dlog3Ke tests,

on an average, for evaluation of a test pattern. The strength of this method is its

116

divide-and-conquer approach. This formulation reduces testing time by decreasing

the number of evaluations required to derive the conclusion. In order to check the

efficacy of the TDS algorithm, we have given comparison result with One-Against-

All TWSVM (OAA-TWSVM) [37] and Binary Tree based-TWSVM (BT-TWSVM).

The performance of TDS-TWSVM is evaluated using out-of-sample data evaluation.

The application of our method is investigated for color image classification and re-

trieval.

TDS-TWSVM is a generic model for multi-category extension of binary classi-

fiers and to prove its efficacy, experiments have been performed with benchmark

multi-category UCI datasets. In this chapter, we present TDS-TWSVM for multi-

category image classification and retrieval. In our chapter, we use a combination of

Complete Robust - Local Binary Pattern with co-occurrence matrix (CR-LBP-Co)

and Angular Radial Transform (ART) descriptors which can efficiently capture the

texture and shape information of an image. The experimental results show that ac-

curacy of TDS-TWSVM based image retrieval exceeds many state-of-the-art image

retrieval methods.

This chapter presents another work which extends four Nonparallel Hyperplanes

Classification Algorithms (NHCAs) into different multi-category scenarios. Man-

gasarian and Wild [9] proposed nonparallel hyperplanes classifier, termed as Gener-

alized Eigenvalue Proximal Support Vector Machine (GEPSVM), which generates a

pair of nonparallel proximal hyperplanes. In the past few years, various modifications

to GEPSVM have been proposed like Regularized GEPSVM (RegGEPSVM) [42]

and Improved GEPSVM (IGEPSVM) [43]. We present a comparative study of four

nonparallel hyperplanes classification algorithms (NHCAs): GEPSVM, RegGEPSVM,

IGEPSVM and TWSVM, in multi-category frameworks. To the best of our knowl-

edge, GEPSVM based classifiers have not been extended to multi-category frame-

work using a tree structure. We explore three approaches for multi-category exten-

sion, namely OAA, BT and TDS. It is observed that tree-based approaches (BT

and TDS) are computationally more efficient than OAA, in learning the classifier.

The experiments show that TDS approach outperforms the other two multi-category

approaches regarding classification accuracy.

This chapter is organized as follows: The multi-category approach i.e. Ternary

5.2. Ternary Decision Structure 117

Decision Structure is presented in Section 5.2. Section 5.3 gives a brief introduction

of GEPSVM and its variants. Section 5.4 presents the extension of Nonparallel

Hyperplanes Classifiers in multi-category framework. The experimental results for

TDS and the comparative study of NHCAs is presented in Section 5.6. Finally, the

chapter is concluded in Section 5.7.

5.2 Ternary Decision Structure

In this section, we present an algorithm termed as Ternary Decision Structure (TDS),

which can extend any nonparallel hyperplanes binary classifier to multi-category

framework. For this thesis, TDS extends TWSVM and is applied for Content-based

Image Retrieval (CBIR). Otherwise, TDS is a generic algorithm and can be used for

multiple applications.

CBIR is an automated process of converting the pixel intensities of an image

into mathematical quantities that are used for images classification and retrieval.

Most of the common CBIR approaches determine the low-level features from the

image and then they use similarity or distance measures to compare images [70].

Our work involves the use of machine learning techniques to improve the accuracy

of a CBIR system. In [39] K-class classification algorithm, Twin-KSVC, is developed

which selects two focused sets of patterns from K classes and then construct two

nonparallel hyperplanes for them. The remaining patterns are mapped into a region

between the two nonparallel hyperplanes. Taking motivation from Twin-KSVC,

Ternary Decision Structure (TDS) is presented as a multi-category classification

algorithm. TDS is a generic approach to extend any classifier to multi-category

scenario. For this thesis, TWSVM is extended using TDS and is termed as TDS-

TWSVM. It evaluates all the training points into an ‘i-versus-j-versus-k’ structure.

This section discusses the application of TDS-TWSVM for CBIR.

During the training phase, TDS-TWSVM recursively divides the training data

into three groups by applying K-Means (K=2) clustering [25] and creates a Ternary

Decision Structure of TWSVM classifiers, as shown in Fig.5.1. The training set is

first partitioned into two clusters which leads to identification of two focused groups

of classes and an ambiguous group of classes. The focused class is one where most of

the patterns belong to a single cluster whereas the patterns of an ambiguous group

118

Figure 5.1: Ternary Decision Structure of classifiers with 10 classes

are almost equally distributed in both the clusters. Therefore, our algorithm has

ternary outputs ( +1, 0, -1 ), where focused class patterns are labeled as +1, −1

and ambiguous patterns as ‘0’. TDS-TWSVM partitions each node of the decision

structure into at most three groups, as shown in Fig.5.2. The group labels +1, 0,−1

are assigned to the training data and three TWSVM hyperplanes are determined

using One-Against-All approach. This in turn creates a decision structure of height

dlog3Ke. Thus, TDS-TWSVM is an improvement over OAA-TWSVM approach

regarding learning time of classifier and retrieval accuracy.

The training data is partitioned into three groups and these groups are repre-

sented by nodes of the Ternary Decision Structure. The K non-divisible nodes of

TDS represent K-classes. This dynamic arrangement of classifiers significantly re-

duces the number of tests required in testing phase. The label for test pattern is

evaluated by finding the hyperplane at minimum distance and assigning its class

label to the test pattern. With a balanced ternary structure, a K-class problem

would require only dlog3Ke tests. Also, at each level, the number of patterns used

by TDS-TWSVM diminishes with the expansion of decision structure. Hence, the

order of QPP reduces as we traverse down the structure. The TDS-TWSVM al-

gorithm determines the classifier model, which is efficient in terms of accuracy and

requires fewer tests for a K-class classification problem. The process of finding TDS-

TWSVM classifier is explained in Algorithm 6.

5.2. Ternary Decision Structure 119

Figure 5.2: Illustration of TDS-TWSVM

5.2.1 Binary Tree Multi-category Approach

The Binary Tree (BT) based multi-category approach builds the classifier model by

recursively dividing the training data into two groups and finding the hyperplanes

for the groups thus obtained. The data is partitioned by applying K-Means (K=2)

clustering [47, 25]. This process is repeated until further partitioning is not possible

i.e. each node represents a unique class. The procedure for Binary Tree-based

multi-category approach is discussed in Algorithm 7.

BT determines (K−1) classifiers for a K-class problem. For testing, BT-TWSVM

requires at most dlog2Ke binary TWSVM evaluations. The test pattern x ∈ Rn is

evaluated using the Binary Tree based classifier model. The testing starts at the

root of the Binary Tree. The test pattern is associated with one of the two child

nodes, based on minimum distance from the two hyperplanes, given as

r = arg (minl=1,2

|xTw(l) + b(l)|‖w(l)‖2

), (5.1)

where |.| is the absolute distance of point x from the plane xTw(l) + b(l) = 0. The

label assigned to the test data is given as y =

+1 (r = 1)

−1 (r = 2).

This evaluation is continued until the test pattern reaches a leaf node and the label

of the leaf node is assigned to it.

We also implemented a variation of BT-TWSVM as Ternary Tree-based TWSVM

120

(TT-TWSVM) where each node of the tree is recursively divided into three nodes.

The partitioning is done by K-Means clustering, with K=3 and the classifier is built

on the lines of BT-TWSVM.

Input : Given a labeled image dataset with N images from K differentclasses. Pre-compute the CR-LBP-Co and ART features for allimages in the dataset as discussed in Section E.1. Create adescriptor F by concatenating both the features. F is a matrix ofsize N × n, where n is the length of feature vector. Here, n = 172and the feature vector for an image is given asfv = [ft1, ft2, ..., ft136, fs1, fs2, ..., fs36], wherefti (i = 1, 2, ..., 136) is texture feature and fsK (K = 1, 2, ..., 36)is shape feature.

Output : Labels for all test patterns.Process: (This structure can be applied in general to any type of dataset;

however in our experiments we have shown it in context of imageclassification).

1. Select the parameters- penalty parameter Ci, kernel type and kernelparameter.

2. Repeat following steps until K-leaf nodes, each representing a uniqueclass, are obtained

a. Use K-Means clustering to partition the training data into two sets.Identify two focused groups of classes with labels ‘+1’ and ‘-1’ respectively,and one ambiguous group of classes represented with label ‘0’. Here,K = 2 and we get at most three groups.

b. Take training patterns of ‘+1’, ‘-1’ and ‘0’ groups as class representativesand find three hyperplanes [w1, b1], [w2, b2] and [w3, b3], by applyingOne-Against-All approach and solving the equations for TWSVM, as givenin (1.4) or using the non-linear TWSVM classifier.

c. Partition the training set into at most three groups i.e. A1, A2 and A3

based on minimum distance of patterns from the three positivehyperplanes.

d. If any of the group Ai (i = 1, 2, 3) contains patterns from more than oneclass, then go to Step 2a with new set of inputs Ai.

3. Evaluate the test patterns with the decision structure based classifiermodel and assign the label of non-divisible node, based on minimumdistance criteria.

Algorithm 6: Ternary Decision Structure for Multi-category TWSVM

5.2.2 Content-based Image Classification using TDS-TWSVM

Multi-category image classification is an automated technique of associating a class

label with an image, based on its visual content. The multi-category classification

task includes training a classifier model for all the image classes and evaluating the

performance of the classifier by computing its accuracy on unseen (out-of-sample)

data. Classification includes a broad range of decision-based approaches for identi-

5.2. Ternary Decision Structure 121

Input : Training data X = X1, X2, ..., Xm with labelsOutput : The hyperplanes corresponding to the internal nodes and leaf

nodes of the Binary Tree.Process:1. Use K-Means clustering (with K = 2) to partition the training data intotwo sets A, B. Assign the labels ‘+1’ and ‘-1’ to the sets.

2. Find the hyperplane parameters (wA, bA), (wB, bB), by solving theequation for TWSVM as given in (1.2 and (1.3) and using sets A, B as thetwo classes.

3. Recursively partition the data-sets A, B if these sets contain patternsfrom more than one class and obtain TWSVM classifiers until furtherpartitioning is not possible.

Algorithm 7: Binary tree of TWSVM for multi-category classification

fication of images. These algorithms are based on the assumption that the images

possess one or more features and these features are associated to one of several

distinct and exclusive classes.

For TDS-TWSVM based image classification, we divide the image dataset into

training and test data. To avoid overfitting, we use 5-fold cross-validation. We

randomly partition the dataset into five equal-sized sub-samples. Of these five sub-

samples, one is retained as the evaluation set for testing the model, and the remain-

ing four sub-samples are used as training data. The TDS algorithm works on image

datasets with multiple classes. The training data is used to determine a classifier

model and each test pattern is evaluated using this model based on minimum eu-

clidean distance from the three hyperplanes at each level of the classifier structure,

until it reaches a non-divisible node. We assign the label of non-divisible node to

this test pattern. The accuracy of the model is calculated by taking the average

accuracy over all the folds with standard deviation. An important application of

image classification is image retrieval i.e. searching through an image dataset to

retrieve best matches for a given query image, using their visual content.

5.2.3 Content-based Image Retrieval using TDS-TWSVM

Content-based image retrieval (CBIR) makes use of image features to determine the

similarity or distance between two images. For retrieval, CBIR fetches most similar

images to the given query image. We suggest the use of TDS-TWSVM for image

retrieval. We first find the class label of query image as explained in Section 4.1 and

122

then find the similar images, from the classified training set, based on chi-square

distance measure. A highlighting feature of TDS-TWSVM is that it is evaluated

using out-of-sample data. Most CBIR approaches take query image from the dataset

which is used to determine the model. But unlike CBIR, TDS-TWSVM reserves a

separate part of dataset for evaluation. Thus it provides a way to test the model

on data that has not been a component in the optimization model. Therefore, the

classifier model will not be influenced in any way by out-of-sample data.

5.2.4 Comparison of TDS-TWSVM with Other Multi-Category Ap-

proaches

The strength of our algorithm lies in the fact that it requires fewer number of

TWSVM comparisons for evaluating a test pattern, than other state-of-the-art multi-

category approaches like OAA-SVM and OAO-SVM. The accuracy of TDS algorithm

is compared with OAA-TWSVM, TT-TWSVM and BT-TWSVM. The experimental

results show that TDS-TWSVM outperforms all other approaches. TDS-TWSVM is

more efficient than OAA-TWSVM considering the time required to build the multi-

category classifier. Also new test pattern can be tested by dlog3Ke comparisons of

TDS-TWSVM, which is more efficient than OAA-TWSVM and BT-TWSVM. For

a balanced decision structure, the order of QPP reduces to one-third of parent QPP

with each level, because the parent node is divided into three groups. Experimental

results show that TDS-TWSVM has advantages over OAA-TWSVM, TT-TWSVM

and BT-TWSVM in terms of learning time. At the same time, TDS-TWSVM

outperforms other approaches in multi-category image classification and retrieval.

5.3 Eigenvalue Problem Based Classifiers

In this section, we briefly discuss Eigenvalue Problem Based Nonparallel Hyper-

planes Classification Algorithms (NHCAs) i.e. Generalized Eigenvalue Proximal

Support Vector Machine (GEPSVM) and its two variants. Given a binary classifica-

tion problem, these NHCAs determine two hyperplanes such that each hyperplane

is in close proximity to one class and at maximum distance from the other class.

These classifiers are extended using three multi-category approaches to compare

5.3. Eigenvalue Problem Based Classifiers 123

their performance with TWSVM.

5.3.1 Generalized Eigenvalue Proximal Support Vector Machine

GEPSVM [9] generates two nonparallel hyperplanes by solving two Generalized

Eigenvalue Problems (GEPs) of the form Pz = µQz, where P and Q are symmet-

ric positive semidefinite matrices. The eigenvector corresponding to the smallest

eigenvalue of each GEP determines the hyperplane. The data points of two classes

(referred as positive and negative classes) are given by matrices A and B respectively,

with m1 and m2 data points. Therefore, matrices A and B have dimensions (m1×n)

and (m2 × n). The GEPSVM formulation determines two nonparallel hyperplanes,

as determined by TWSVM in (1.1). The optimization problem of GEPSVM is given

as:

Minw,b6=0

‖Aw + eb‖22/‖[w, b]T ‖22‖Bw + eb‖22/‖[w, b]T ‖22

, (5.2)

where e is a vector of ones with proper dimension and ‖·‖ represents the L2-norm.

Here, it is assumed that (w, b) 6= 0⇒ Bw+ eb 6= 0 [9]. The objective function (5.2)

is simplified and regularized by adding a term as proposed by Tikhonov [71]:

Minw,b6=0

(‖Aw + eb‖22 + δ‖[w, b]T ‖22)

‖Bw + eb‖22, (5.3)

where δ > 0 is the regularization parameter. This, in turn, takes the form of Rayleigh

Quotient [72]

Minw,b6=0

zTPz

zTQz, (5.4)

where P and Q are symmetric matrices in R(n+1)×(n+1) and are given as

P = [A e]T × [A e] + δ × I for some δ > 0,

Q = [B e]T × [B e], and z = [w, b]T . (5.5)

124

Here, I is an identity matrix. Using the properties of the Rayleigh Quotient [9, 72],

we can get the solution of (5.4) by solving the following GEP

Pz = µQz, z 6= 0, (5.6)

where the solution of (5.4) is attained at an eigenvector analogous to the smallest

eigenvalue µmin of (5.6). Therefore, if z1 is the eigenvector for µmin, then [w1, b1]T =

z1 is the plane xTw1 + b1 = 0 that is passing through the positive class. The other

minimization problem can be similarly defined by switching the roles of A and B.

The eigenvector z2 for the smallest eigenvalue of second GEP yields the hyperplane

xTw2 + b2 = 0, which is proximal to the negative class. The solution of GEPSVM

is generated by solving a system of linear equations of order n3 [9], where n is the

feature dimension and SVM solves a QPP of order m3 (m >>> n). Therefore,

GEPSVM is computationally more efficient than SVM, although the classification

accuracy results are comparable to those of SVM.

5.3.2 Regularized GEPSVM

Guarracino et al. [42] modified the formulation of GEPSVM, so that a single GEP

can be used to generate both the hyperplanes. The GEP Pz = µQz is transformed

as P ∗z = µQ∗z where

P ∗ = τ1P − δ1Q, Q∗ = τ2Q− δ2P. (5.7)

The parameters τ1, τ2, δ1 and δ2 are selected as a singular matrix Ω, given by

Ω =

τ2 δ1

δ2 τ1

. (5.8)

As discussed in [42], the problem P ∗z = µQ∗z would generate same eigenvectors as

that of Pz = µQz. The eigenvalue λ∗ of new problem is related to an eigenvalue λ

of initial problem by

λ =τ2λ∗ + δ1

τ1 + δ2λ∗. (5.9)

5.3. Eigenvalue Problem Based Classifiers 125

By setting τ1 = τ2 = 1 and ν1 = −δ1, ν2 = −δ2, the problem is stated as

Minw,b6=0

‖Aw + eb‖22 + ν1‖Bw + eb‖22‖Bw + eb‖22 + ν2‖Aw + eb‖22

. (5.10)

When Ω is singular and ν1, ν2 are non-negative, then the eigenvectors corresponding

to the minimum and maximum eigenvalues of (5.10) would be same as obtained by

solving the two GEPSVM problems [42]. In terms of learning time, RegGEPSVM

outperforms GEPSVM and SVM as RegGEPSVM [42] solves one GEP instead of

two. In [42], authors have shown the out-performance of RegGEPSVM over SVM

for linear kernel, but with Gaussian kernel, SVM achieves better performance than

RegGEPSVM and GEPSVM.

5.3.3 Improved GEPSVM

Improved GEPSVM (IGEPSVM) [43] replaces the generalized eigenvalue decompo-

sition by standard eigenvalue problems which resulted in solving two optimization

problems that are simpler than GEP. A parameter is introduced to the objective

function to improve the generalization ability. IGEPSVM formulated the two prob-

lems as

Minw,b6=0

‖Aw + eb‖22‖w‖22 + b2

− ν ‖Bw + eb‖22‖w‖22 + b2

, (5.11)

where ν > 0 arbitrates the terms in the objective functions. Thus, IGEPSVM has a

bias factor that can be adjusted by the user and is particularly useful when working

with imbalanced data. By introducing a Tikhonov regularization term [71] and

solving its Lagrange function [13], we get

((MT + δI)− νQT )z = λz, (5.12)

where M = [A e]T [A e], Q = [B e]T [B e], z = [w, b]T and λ is Lagrange multiplier.

The second problem can be defined similar to (5.11) by switching the roles of A and

B, as discussed for GEPSVM. IGEPSVM replaces GEP with a standard eigenvalue

problem and hence, results in a lighter optimization problem [43]. It also avoids the

possible singularity condition by adding a regularization term.

126

5.4 Extension of NHCAs for Multi-category Classifica-

tion

In this thesis, we present an extension of GEPSVM, RegGEPSVM, IGEPSVM and

TWSVM using OAA, BT and TDS approaches for multi-category classification.

Extending NHCA Classifiers using One-Against-All Approach

In order to solve a K-class classification problem using OAA multi-category ap-

proach, we construct K binary NHCA classifiers on the lines of OAA-TWSVM

(Section 1.3.1). With m data patterns ((xj , yj), j = 1 to m), the matrices A =

xp : yp = i and B = xq : yq 6= i are created. The patterns of A and B are

assigned labels +1 and −1 respectively. This data is used as input for GEPSVM in

(5.3), RegGEPSVM in (5.10), IGEPSVM in (5.11) and TWSVM in (1.4) to generate

K classifiers. Testing a pattern is based on minimum distance and is done as given

in Eq. (1.28).

Extending NHCA Classifiers through Binary Tree-based Approach

Binary Tree (BT) of TWSVM classifiers is explained in Section 5.2.1. Using this

approach, the hyperplanes for all four NHCAs are determined by: use (5.3) for

GEPSVM, (5.10) for RegGEPSVM, (5.11) for IGEPSVM and (1.4) for TWSVM.

Extending NHCA Classifiers through Ternary Decision Structure

In order to extend the capability of NHCA classifiers to handle multi-category data,

we present their use in TDS framework. To find the three hyperplanes, use (5.3) for

GEPSVM, (5.10) for RegGEPSVM, (5.11) for IGEPSVM and (1.4) for TWSVM.

Fig.5.3 shows three-class classification problem and the hyperplanes obtained with

OAA and TDS. In Fig.5.3a, the shaded area shows the confusion, which is resolved

in Fig.5.3b.

5.5. Experimental Results 127

(a) OAA

(b) TDS

Figure 5.3: Three-class problem classified by OAA and TDS

5.5 Experimental Results

To compare the four NHCAs i.e. GEPSVM, RegGEPSVM, IGEPSVM and TWSVM,

we implemented their extended version in multi-category framework with OAA, BT

and TDS approaches. The experiments are performed with ten benchmark UCI

datasets [52] and the competence of these algorithms is measured in terms of clas-

sification accuracy and computational efficiency in learning the model. The ex-

periments are conducted with 5-fold cross validation [49]. Ten multi-category UCI

datasets used for experiments.

5.5.1 Multi-category Classification Results: UCI Datasets

Table 5.1 shows the classification results of the four NHCAs with three multi-

category approaches on ten UCI datasets. The table lists the datasets along with

their dimension as m×n×K, where m, n, K are the number of data patterns, fea-

tures and classes respectively. For each multi-category classifier, we have reported

classification accuracy (Acc in %) along with standard deviation (SD) across the

five folds. The table also shows the learning time (in seconds) for each of these

128

Table 5.1: Comparison of NHCAs with linear classifiers

IGEPSVM GEPSVM Reg GEPSVM TWSVM

OAA BT TDS OAA BT TDS OAA BT TDS OAA BT TDS

Accuracy %Standard Deviation

Data sets Time (sec)

Iris 96.00 90.00 89.33 96.67 95.33 96.00 96.67 95.33 96.00 95.33 97.33 97.33150 × 4 × 3 5.96 3.33 6.41 3.33 2.98 2.79 3.33 2.98 2.79 3.80 1.49 1.49

3.3333 0.0006 0.0013 3.2055 0.0007 0.0011 2.3570 0.0006 0.0007 4.9441 0.0996 0.2021

Seeds 80.89 89.05 88.57 92.38 93.33 94.29 92.38 93.33 94.29 93.81 93.80 92.38210 × 7 × 3 3.21 2.71 1.99 3.53 1.99 2.13 3.53 1.99 2.13 4.87 3.61 4.57

1.9920 0.0006 0.0068 2.0137 0.0006 0.0069 1.7546 0.0004 0.0051 2.3810 0.0983 0.2971

Derm 88.19 89.77 89.75 84.42 84.37 86.32 84.42 86.62 86.32 94.82 92.38 95.08366 × 34 × 6 5.15 4.36 4.33 5.88 4.45 5.54 5.88 4.13 5.54 4.57 4.57 3.90

8.0134 0.0059 0.0078 3.1642 0.0051 0.0072 5.3190 0.0037 0.0108 4.9702 0.3337 0.5164

Wine 85.52 92.70 96.59 92.73 93.32 93.87 87.14 94.35 94.43 96.43 94.96 97.17178 × 13 × 3 5.95 4.21 3.73 3.01 6.68 9.08 5.66 6.33 7.85 4.07 3.09 2.02

0.0356 0.0007 0.0082 0.0218 0.0006 0.0068 0.0150 0.0004 0.0054 0.3767 0.0790 0.3056

Zoo 85.10 86.10 87.10 87.10 85.05 87.05 89.10 92.05 93.05 93.14 94.05 93.04101× 16 × 7 4.69 9.00 9.79 6.76 10.06 6.78 5.50 2.81 2.78 6.41 6.52 5.72

0.0346 0.0027 0.0197 0.0286 0.0022 0.0172 0.0203 0.0018 0.0113 1.0799 0.1918 0.2993

Ecoli 80.25 83.45 82.63 74.14 81.52 81.11 74.14 80.21 82.32 82.02 82.88 84.42327 × 7 × 5 2.19 4.02 2.49 4.93 3.24 12.98 4.93 2.24 2.86 3.51 1.91 3.67

0.0463 0.0023 0.0119 0.0428 0.0017 0.0089 0.0259 0.0009 0.0066 0.6168 0.1647 0.4852

Glass 50.36 56.29 51.37 52.91 58.32 54.78 52.89 57.35 53.79 57.48 58.80 57.83214 × 9 × 6 4.28 3.89 2.16 3.87 3.50 4.80 2.98 3.78 4.92 5.04 3.26 3.40

0.04 0.0018 0.0187 0.35 0.0016 0.0163 0.0229 0.0011 0.0116 0.6497 0.2083 0.4245

PB 90.55 87.81 87.81 88.09 89.92 90.44 88.09 90.54 90.44 87.55 93.09 93.135473 × 10 × 5 0.81 3.17 3.17 3.55 1.51 1.90 3.55 1.59 1.65 2.89 0.88 1.40

2.01754 0.0162 0.0257 1.9726 0.0113 0.0176 1.3161 0.008 0.013 563.64 77.1921 109.3996

MF 82.50 85.25 90.35 82.25 84.75 75.60 83.35 84.40 84.70 97.60 96.35 96.252000 × 649 × 10 2.80 4.65 1.97 1.25 4.51 6.36 4.67 4.83 2.79 0.87 1.92 2.45

121.6527 24.8009 44.864 86.8096 20.8418 36.8624 70.8630 15.1547 28.2157 520.0735 15.1251 10.2126

OD 88.25 90.94 90.64 89.54 92.43 90.23 89.84 91.68 92.46 88.25 90.94 90.645620 × 64 × 10 0.81 0.47 1.70 1.65 0.88 1.45 2.31 0.52 1.06 1.03 0.47 1.70

4.75 0.0543 0.1659 3.89 0.0412 0.0973 2.8513 0.0323 0.0680 1263.4512 0.0326 5.2645

Avg Acc 82.76 85.14 85.41 84.02 85.83 84.97 83.88 86.59 86.78 88.94 89.46 89.73Avg SD 3.58 3.98 3.77 3.78 3.98 5.38 4.24 3.12 3.44 3.13 2.77 3.03Avg Time 14.19 2.49 4.51 10.15 2.09 3.70 8.45 1.52 2.83 236.22 9.35 12.74

algorithms. From Table 5.1, it is evident that the linear TDS-TWSVM outperforms

the other multi-category classifiers in terms of classification accuracy and achieves

89.73% accuracy over the 10 UCI datasets. ‘Win-Loss-Tie’ (W-L-T) ratio gives a

count of wins, losses and ties for an algorithm in comparison to other algorithms.

From Table 5.1, W-L-T for TWSVM and all other GEPSVM-based classifiers are

8-2-0 and 2-8-0, for classification accuracy. This shows that TWSVM outperforms

GEPSVM-based classifiers. Also, W-L-T for OAA, BT and TDS are 1-9-0, 2-7-1

and 6-3-1, which demonstrates that TDS excels other two approaches in terms of

classification accuracy. TWSVM is a constrained optimization problem, whereas

GEPSVM is an unconstrained optimization problem which makes TWSVM more

adapted to the dataset [11]. As demonstrated in [59], tree-based multi-category

approaches give better result than OAA. The same is observed in Table 5.1 that

tree-based approaches (BT and TDS) are more accurate and efficient than OAA in

learning the classifier. Hence, TDS-TWSVM is the best among all combinations

for generalization ability. BT-RegGEPSVM takes the minimum learning time (1.52

sec), computed as average over 10 datasets. Further, GEPSVM-based classifiers are

5.6. Applications 129

Table 5.2: Comparison of NHCAs with nonlinear classifiers

IGEPSVM GEPSVM Reg GEPSVM TWSVMOAA BT TDS OAA BT TDS OAA BT TDS OAA BT TDS

Accuracy %Data sets Standard Deviation

Iris 92.00 94.67 94.67 89.33 96.67 96.67 93.33 96.00 98.00 94.00 96.67 97.336.91 2.74 1.83 7.23 2.36 2.36 2.36 2.79 1.82 4.34 2.35 1.49

Seeds 90.86 89.52 90.48 89.90 90.86 90.48 92.86 93.33 94.29 93.33 94.28 93.803.22 3.61 3.76 2.11 3.22 2.63 3.37 4.26 2.13 3.10 3.61 4.32

Derm 81.78 95.82 94.30 82.77 93.53 95.05 84.66 95.05 95.83 92.80 96.96 96.594.24 3.14 3.26 6.17 3.70 2.56 4.71 2.56 2.06 4.67 1.67 2.45

Wine 84.52 90.48 96.59 80.39 85.86 83.38 93.76 97.75 98.86 98.32 98.88 99.433.11 2.91 3.73 2.71 4.02 10.00 5.51 2.38 2.56 2.49 1.52 1.27

Zoo 82.56 95.05 86.10 87.45 90.05 88.05 89.05 90.05 91.05 96.04 97.04 97.045.50 3.54 9.67 4.32 2.12 5.77 5.54 5.06 4.24 4.17 2.69 2.69

Ecoli 80.32 84.02 85.05 85.64 82.15 85.26 77.38 84.52 85.26 76.76 82.88 87.473.61 3.46 14.76 3.01 2.31 12.45 4.19 1.25 3.26 3.99 1.91 3.19

Glass 65.23 68.52 69.23 65.12 66.48 69.66 64.32 71.23 70.12 62.68 70.54 69.173.62 4.32 4.56 3.21 3.11 2.95 2.01 3.14 3.12 3.67 4.90 4.95

PB 92.02 92.38 92.33 92.12 92.56 92.56 93.49 94.62 96.04 94.66 92.89 92.850.63 0.66 0.78 1.34 0.98 0.98 1.71 1.92 0.76 0.54 3.64 0.33

MF 80.55 84.65 90.40 87.20 82.10 83.00 87.50 83.25 87.35 98.20 82.85 96.754.09 4.94 1.77 5.41 6.58 3.64 6.36 5.48 3.15 0.54 3.10 0.85

OD 96.56 94.15 97.25 94.82 96.23 95.16 94.13 96.55 98.21 92.94 90.94 98.681.23 1.54 0.89 1.33 0.97 1.25 1.33 1.34 1.21 1.33 0.46 0.64

Avg Acc 84.64 88.93 89.64 85.48 87.65 87.93 87.05 90.23 91.50 89.97 90.39 92.91Avg SD 3.62 3.09 4.50 3.68 2.94 4.46 3.71 3.02 2.43 2.88 2.59 2.22

faster than TWSVM.

The comparison results of nonlinear classifiers, on UCI datasets, are listed in

Table 5.2. The table demonstrates that the nonlinear classifiers are more efficient

than the linear ones. The classification results of TDS-TWSVM are excellent among

all the algorithms, over ten datasets and mean accuracy is 92.91%. W-L-T for

TWSVM and GEPSVM-based classifiers, considering classification accuracy, are 5-

5-0 and 5-5-0. This shows that TWSVM and other GEPSVM-based classifiers have

comparable performance with nonlinear classifiers. Also, W-L-T for OAA, BT and

TDS are 0-10-0, 2-7-1 and 7-2-1, which demonstrates that TDS excels other two

approaches by bagging maximum wins, considering classification accuracy.

5.6 Applications

In order to check the efficacy of our method, we have conducted classification and

retrieval experiments on many benchmark image data sets like the Wangs Color,

Corel 5K, MIT VisTex texture and Oliva and Torralba (OT) Scene datasets (For

details, please refer Appendix E). TDS algorithm is used for image classification and

retrieval. We first determine the CR-LBP-Co and ART features for Wang’s, Corel

130

5K and OT-Scene dataset. For VisTex, we use only CR-LBP-Co features as it is

texture database and does not contain significant shape information.

Parameter Setting

Through validation set, we have determined the optimal values for the parameters

used in this work. For CR-LBP-Co features, α is 8 and the local window is of

radius one with eight neighbours per pixel. For ART feature, n = 3 and m = 8. For

TWSVM classifier, we use radial basis function (RBF) kernel, with kernel parameter

set to 0.47. The penalty parameter, Ci is 0.1.

5.6.1 Color Image Classification

Image classification is an automated technique of associating a class label with an

image, based on its visual content. The multi-category classification task includes

training a classifier model for all the image classes and evaluating the performance

of the classifier by computing its accuracy on unseen data. Table 5.3 shows the

classification accuracy for various benchmark image databases with OAA-TWSVM,

BT-TWSVM, TT-TWSVM and TDS-TWSVM. The result is given as average ac-

curacy over 5-folds with standard deviation.

Table 5.3: Classification accuracy on different image datasets

TWSVM

Dataset OAA BT TT TDS

Wang’s 84.62 ± 2.63 83.30 ± 3.45 84.6 ± 2.73 85.50 ± 3.26

Corel 5K 62.19 ± 1.44 61.78 ± 1.91 61.5 ± 1.43 63.10 ± 2.14

VisTex 94.30 ± 1.96 95.03 ± 1.38 96.09 ± 1.48 96.88 ± 1.14

OT-Scene 74.99 ± 1.89 74.70 ± 1.67 75.26 ± 1.66 75.30 ± 1.63

5.6.2 Content-based Image Retrieval

The performance of a retrieval system can be measured in terms of its precision-

recall (P-R) ratio. Recall is the ratio of the number of relevant images retrieved to

the total number of relevant images in the database. Precision is the ratio of the

number of relevant images retrieved to the total number images retrieved. For image

retrieval, we have used Average Retrieval Rate (ARR) [73]. (Please refer Appendix

E for details).

5.6. Applications 131

Result on Wang’s Dataset

The retrieval precision is calculated at recall 20 for each class. Precision is reported

as percent (%). Table 5.4 shows that TDS-TWSVM outperforms various image

retrieval systems. ARR for TDS-TWSVM on Wang’s database is 85.56%. Fig.5.4

shows the image retrieval result for a query image taken from Wang’s Color dataset.

Table 5.4: Average Retrieval Rate (%) for Wang’s Color Dataset

TWSVM

Class [74] [75] [76] [77] [78] OAA BT TT TDS

Africa 69.75 70.3 67.8 43.77 47.2 74.66 72.50 68.09 77.91

Beach 54.25 56.1 56.1 36.22 29.0 77.12 77.58 75.78 72.65

Building 63.95 57.1 61.1 34.74 44.4 84.26 87.36 89.38 88.51

Bus 89.65 87.6 95.7 66.29 67.4 90.81 99.00 96.13 98.13

Dinosaur 98.70 98.7 99.2 94.12 94.0 98.00 98.33 97.59 98.33

Elephant 48.80 67.5 67.4 74.56 31.1 79.80 73.72 82.08 80.54

Flower 92.30 91.4 88.6 47.86 72.1 95.69 97.69 93.69 99.12

Horse 89.45 83.4 75.9 42.49 77.6 91.26 87.94 89.05 88.88

Mountain 47.30 53.6 41.2 34.32 33.5 74.07 59.55 70.97 70.83

Food 70.90 74.1 74.9 45.24 56.3 80.27 80.88 79.00 80.66

Average 72.5 73.9 72.8 51.96 55.3 84.59 83.46 84.17 85.56

Result on COREL 5K Dataset

The ARR result on COREL 5K is shown in Table 5.5. The precision is calculated

at recall 20 for every image of the database. It is observed that TDS-TWSVM gives

the best performance with 64.15% ARR.

Table 5.5: Average Retrieval Rate (%) for COREL 5K Dataset

Method ARR %

[79] 48.8

[80] 62.35

OAA-TWSVM 63.85

BT-TWSVM 61.64

TT-TWSVM 61.75

TDS-TWSVM 64.15

Result on MIT VisTex Dataset

VisTex dataset contains 640 sub-images, categorized into 40 different classes. Every

image of this database is used as query and we compute ARR as in (A.6). Our

132

Figure 5.4: Image Retrieval Result for a Sample Query Image from Wang’s Dataset(a.) Query Image (b.) 20 Images retrieved by TDS-TWSVM

algorithm achieves ARR of 93.51%. Table 5.6 shows ARR (%) computed over all

the 40 classes of VisTex database, using various texture based retrieval methods.

Here precision at recall 4 is calculated, since each class has 16 images. The results

show that TDS-TWSVM outperforms other state-of-the-art methods.

Result on Oliva and Torralba Scene Dataset

ARR precision at recall 20 for OT-scene is given in Table 5.7. The table shows class-

wise accuracy as well as average accuracy over all the classes. The results prove that

TDS-TWSVM outperforms all other methods.

Time-based Comparison of OAA-TWSVM and TDS-TWSVM

The experiments performed on various benchmark image databases show that TDS-

TWSVM outperforms OAA-TWSVM, TT-TWSVM and BT-TWSVM. Also the

classification accuracy and retrieval precision achieved with TDS-TWSVM main-

5.6. Applications 133

Table 5.6: Average Retrieval Rate (%) for MIT VisTex Dataset

Method ARR %

LBP(8,1) [81] 82.23

GLBP(8,1) [76] 91.28

RLBP(8,1) [76] 91.37

GRLBP(8,1) [76] 92.18

OAA-TWSVM 92.55

BT-TWSVM 92.30

TT-TWSVM 91.39

TDS-TWSVM 93.51

Table 5.7: Average Retrieval Rate(ARR) (%) for OT-Scene Dataset

TWSVM

Class Name OAA BT TT TDS

Coast and Beach 78.87 78.78 84.84 85.51

Open Country 79.13 82.79 88.81 89.15

Forest 75.22 75.92 80.03 79.25

Mountain 71.33 70.01 73.34 74.06

Highway 75.72 71.02 58.93 57.57

Street 78.17 71.60 70.90 70.63

City Center 47.26 65.55 55.09 55.86

Tall Building 79.00 77.69 82.37 82.80

Average 73.09 74.17 74.29 74.35

tains a good margin from other well known methods. Another significant advantage

of TDS-TWSVM is the time required to build the classifier during learning phase.

Table 5.8 shows the time taken by TDS-TWSVM and OAA-TWSVM to generate

TWSVM-based classifier model using training dataset of Wang’s Image database.

We use 5-fold cross validation and the average time is shown in the table. To compare

these two approaches, we use a metric called speedup, which is defined as (5.13).

Sup =Time Taken by OAA-TWSVM

Time Taken by TDS-TWSVM(5.13)

For MIT VisTex, TDS-TWSVM requires only 33.5% of the time taken by OAA-

TWSVM. It is observed that speedup value increases with the size of database.

TDS-TWSVM performs exceptionally well with huge-sized databases.

Fig.5.5 graphically shows the time-complexity comparison of OAA-TWSVM and

TDS-TWSVM learning time. It is evident from the graph that TDS-TWSVM is

able to handle large databases with extensive categories. But the time complexity

of OAA-TWSVM grows faster with the size of the database. For a K-class problem,

134

Table 5.8: Average Time (sec) required to build the classifier

Image Dataset TDS-TWSVM OAA-TWSVM Speedup

MIT VisTex (640) 3.667 10.961 2.98

Wang’s (1000) 2.409 5.743 2.38

OT-Scene (2688) 31.775 79.659 2.50

COREL 5K (5000) 195.360 2752.699 14.09

OAA-TWSVM requires K-classifiers where each classifier works withm×K patterns.

Here, m is the number of patterns in each class (assuming that each class has equal

number of patterns). In contrast, TDS-TWSVM determines root classifier, as shown

in Fig.5.1, with m × K patterns and divides the problem into three sets. At the

next level of the decision structure, TDS-TWSVM works with three smaller QPPs.

Here each problem deals with lesser number of classes as its parent problem. So, the

number of patterns, at the next level of the decision structure, are approximately

m × (K/3). Therefore, TDS-TWSVM can efficiently handle large-sized problems

due to its divide-and-conquer approach.

Figure 5.5: Time Complexity Comparison of TDS-TWSVM and OAA-TWSVM

5.7 Conclusions

In this chapter, we have presented Ternary Decision Structure based Multi-category

Twin Support Vector Machine (TDS-TWSVM) classifier for classification and re-

trieval of color images. For a multi-category problem, TDS-TWSVM requires dlog3Ke

5.7. Conclusions 135

TWSVM comparisons for evaluating test data as compared to dlog2Ke TWSVM

comparisons required by Binary Tree based TWSVM. Further we compared the per-

formance of TDS-TWSVM with One-Against-All (OAA), Ternary Tree-based (TT)

and Binary Tree-based (BT) TWSVM and have shown that TDS-TWSVM outper-

forms other well-established multi-category methods. TDS-TWSVM is tested on a

variety of benchmark image databases. The results reveal that TDS-TWSVM per-

forms exceptionally well in terms of classification accuracy, testing time and retrieval

precision.

In this thesis, we have also presented a comparative study of nonparallel hy-

perplanes classification algorithms (NHCAs) in multi-category framework. We have

extended Generalized eigenvalue proximal SVM (GEPSVM), Regularized GEPSVM

(RegGEPSVM), Improved GEPSVM (IGEPSVM) and Twin SVM (TWSVM) in

multi-category scenario, using One-Against-All (OAA), Binary Tree-based (BT) and

Ternary Decision Structure (TDS) approaches. The experiments are conducted with

ten benchmark UCI datasets. It is observed that TWSVM achieves higher classifi-

cation accuracy as compared to GEPSVM-based classifiers, but TWSVM is compu-

tationally expensive than GEPSVM. The use of TWSVM is recommended when the

number of features are more and highly correlated; here, GEPSVM-based classifiers

do not perform well. It is also ascertained that GEPSVM-based classifiers performs

better than TWSVM, with large datasets, in terms of learning time. The tree-based

multi-category approaches are more efficient than OAA, regarding classification ac-

curacy as well as learning and testing time. TDS requires dlog3Ke comparisons for

evaluating test data as compared to dlog2Ke comparisons required by BT and K

comparisons required by OAA approaches. Thus, TDS requires minimum testing

time. The experimental results show that TDS-TWSVM outperforms other meth-

ods in terms of classification accuracy and BT-RegGEPSVM takes the minimum

time for building the classifier.

136

Chapter 6

Tree-Based Localized Fuzzy Twin Support

Vector Clustering with Square Loss Function

6.1 Introduction

Clustering is an unsupervised learning task, which aims at partitioning data into

a number of clusters [20] based on similar features like K-means clustering [25],

hierarchical clustering [21]. Following the success of margin based classifiers in

supervised learning, researchers have been trying to extend them to unsupervised

learning. Plane-based Clustering methods have been proposed such as K-plane

Clustering [82] by Bradley et al. and Proximal Plane Clustering [83] by Shao et

al. Recently, Xu et al. [27] proposed Maximum Margin Clustering (MMC) which

performs clustering in SVM framework and finds a maximum margin separating

hyperplane between clusters.

MMC based methods [84] resort to relaxing the non-convex clustering problem

as semidefinite program (SDP) [85]. MMC can not be used for very large datasets

because SDP is computationally expensive [86]. Zhang et al. [29] proposed a feasible

variation for MMC and implemented MMC as an Iterative Support Vector Machine

(iterSVM). Wang et al. proposed TWSVM for clustering (TWSVC) [28] that uses

information from both within cluster and between clusters. Recently, Khemchandani

et al. [87] proposed fuzzy least squares TWSVC (F-LS-TWSVC) that uses fuzzy

membership to create clusters, which are further obtained by solving systems of

linear equations only.

In this chapter, we present Tree-based Twin Support Vector Clustering (Tree-

138

TWSVC) which is motivated by MMC [27] and TWSVC [28]. In an unsupervised

scenario, it is not always possible to associate a given pattern with a unique cluster.

The pattern may belong to more that one cluster, based on distance or similarity

measures. Hence, their membership in a cluster is befitted to be treated as a fuzzy

quantity. So, we developed a fuzzy membership based clustering algorithm ,Tree-

TWSVC, which has the following characteristics:

• Tree-TWSVC is a clustering algorithm which is built upon TWSVM-like clas-

sifier. The novel classifier Localized Fuzzy Twin Support Vector Machine

(LF-TWSVM) is used in an iterative manner to identify two clusters from the

given data. These clusters can be further partitioned until the desired number

of clusters are obtained.

• Unlike MMC that solves expensive SDP problem, the novel clustering algo-

rithm Tree-TWSVC formulates convex optimization problems which are solved

as a system of linear equations. Also, MMC identifies only two clusters whereas

Tree-TWSVC identifies multiple clusters by building a tree with LF-TWSVM

classifiers at each level.

• Tree-TWSVC recursively divides the data to form the tree structure and it-

eratively generates the hyperplanes for the partitioned data, until the conver-

gence criterion is met. Due to its tree structure, Tree-TWSVC is much faster

than the classical approaches like OAA (used in TWSVC or F-LS-TWSVC),

for handling multi-cluster data. It can handle very large-sized datasets with

comparable or better clustering results than other TWSVM-based clustering

methods.

• At each node of the cluster tree, LF-TWSVM creates two clusters so that

the data points of one cluster are proximal to their cluster hyperplane and its

prototype, and the data points of other cluster should be unit distance away

from this cluster plane. The prototype prevents the hyperplane from extending

indefinitely and keeps the hyperplane aligned locally to its cluster.

• Tree-TWSVC avoids the approximation through Taylor’s series expansion, as

required by TWSVC and F-LS-TWSVC, due to constraints with mod (or

6.2. Tree-based Localized Fuzzy Twin Support Vector Clustering 139

absolute) function and hence Tree-TWSVC gives more accurate results.

• Tree-TWSVC determines a fuzzy membership matrix for data samples, that

associates a membership value of each data sample to all the given clusters.

The initial fuzzy membership matrix is obtained by using Localized Fuzzy

Nearest Neighbor Graph (LF-NNG).

• We have used square loss function which is symmetric and allows the output

to change (i.e. flip from +1 to -1 or vice-versa), if required, in successive iter-

ations. By using the square loss function, the optimization problem is solved

as a system of linear equations; whereas TWSVC solves QPPs to generate the

hyperplanes.

The chapter is organized as follows: Section 6.2 presents the classifier LF-

TWSVM and the clustering algorithm Tree-TWSVC. The comparison of our algo-

rithm with other approaches is done in Section 6.3, which is followed by experimental

results in Section 6.4. The application of Tree-TWSVC is discussed in Section 6.5

and the chapter is concluded in Section 6.6.

6.2 Tree-based Localized Fuzzy Twin Support Vector

Clustering

Taking motivation from MMC [27] and TWSVC [28], we present Tree-TWSVC,

which is an iterative tree-based clustering procedure. Tree-TWSVC employs fuzzy

membership matrix to create clusters using LF-TWSVM. Our algorithm can effi-

ciently handle large multi-cluster datasets. For a K-cluster problem, Tree-TWSVC

initially generates a fuzzy membership matrix for two clusters by using Localized

Fuzzy Nearest Neighbor Graph (LF-NNG) initialization algorithm (discussed in Sec-

tion 6.2.3). Based on higher membership values, the data is partitioned into two

clusters. Since, membership values are based on proximity of data points, the pat-

terns of one cluster are similar to each other and distinct from the other cluster’s

patterns. Hence, Tree-TWSVC considers the inter-cluster and intra-cluster rela-

tionships. Each of the two clusters thus obtained can be recursively divided until K

clusters are obtained. With each partition, the size of data is reduced which makes

140

the procedure more time-efficient.

The algorithm Tree-TWSVC starts with initial labels (+1,−1), as generated by

LF-NNG. By using the initial labels, the data X with m points is divided into two

clusters, A and B, of size m1 and m2 respectively (where m = m1 +m2) as shown in

Fig.6.1. The group A can be further partitioned into A1 and A2, but Tree-TWSVC

does not consider data points of B at this stage. This is because, in the first partition

of dataset, the data points of A are separated from B, by considering inter-cluster

relationship. In the second partition, the algorithm concentrates on the data points

of A only and is able to generate more stable results in lesser time. Our algorithm is

more efficient than other plane-based clustering like TWSVC [28] and F-LS-TWSVC

[87] that use classical OAA multi-category approach and the same is established by

the results of numerical experiments in Section 6.4.

Figure 6.1: Illustration of tree of classifiers.

TWSVM [10] was initially proposed for classification problems. TWSVM deals

with L1 norm error function, which when used in clustering framework, as done

in TWSVC [28], could lead to premature convergence as the error function does

not facilitate flipping of cluster labels, if required. The procedure gets stuck in a

poor local optimum and there is little or no change between initial and final labels.

This happens because the loss function is not symmetric and fails to change the

labels in successive iterations (Please see Appendix B.). To overcome this issue,

we present a new classifier LF-TWSVM, that efficiently handles the problem of

premature convergence and is used to build the cluster model of Tree-TWSVC.

6.2. Tree-based Localized Fuzzy Twin Support Vector Clustering 141

6.2.1 Localized Fuzzy TWSVM Classifier (Linear version)

In this thesis, we present a novel classifier, termed as LF-TWSVM which we further

use in unsupervised framework. Unlike TWSVM, LF-TWSVM uses square loss

function and a cluster prototype. The prototype prevents the hyperplane from

extending indefinitely and keeps it aligned locally to the data points of its own

cluster. Let the dataset X consists of m points in n-dimensional space. The data is

divided into two clusters and hyperplanes are generated, which are given by (1.1).

LF-TWSVM employs the fuzzy membership matrix F ∈ Rm×2 generated by LF-

NNG and based on higher membership value, it partitions the data X into two

clusters A (positive cluster) and B (negative cluster), of size m1 and m2 respectively.

The hyperplanes for the two clusters A and B are obtained by solving the following

problems:

LF-TWSVM1:

minw1,b1,ξ2,A,v1

1

2‖SAAAw1 + e1b1‖22 +

c1

2‖ξ2‖22

+c2

2‖SAAA− e1v1‖22 +

c3

2(‖w1‖22 + b21)

subject to −(SBABw1 + e2b1) + ξ2 = e2, (6.1)

LF-TWSVM2:

minw2,b2,ξ1,B,v2

1

2‖SBBBw2 + e2b2‖22 +

c1

2‖ξ1‖22

+c2

2‖SBBB − e2v2‖22 +

c3

2(‖w1‖22 + b22)

subject to (SABAw2 + e1b2) + ξ1 = e1. (6.2)

The diagonal matrices SAA (size (m1 ×m1)) and SBA (size (m2 ×m2)) define the

membership value of data points of A and B respectively, in positive cluster, taken

from matrix F . Similarly, the other two diagonal matrices SAB and SBB are defined

for the negative cluster. The primal problems in (6.1) and (6.2) are motivated

from TWSVM [10] and are modified on the lines of LS-TWSVM [14]. Thus, the

inequality constraints are replaced with equality constraints and L2-norm of error

variables ξ1 and ξ2 is used; c1 is the associated weight and e1 , e2 are vectors of

142

one’s of appropriate dimensions. The constraints of LF-TWSVM (6.1) and (6.2)

do not require mod (|.|) function as required in the constraints of TWSVC (1.25).

TWSVC determines the cluster hyperplanes using OAA multi-category approach

and considers all the data points to find the cluster planes. The data points of other

clusters may lie on both sides of the cluster hyperplane and hence, the constraints

with mod function are required. For Tree-TWSVC, the data is divided into two

clusters at each node, Therefore one cluster would lie on only one side of other

cluster. Hence, the constraints of Tree-TWSVC are written without mod function.

The first term in the objective function of (6.1) and (6.2) is the sum of squared

distances of the hyperplane to the data points of its own cluster. Thus, minimizing

this term tends to keep the hyperplane closer to the data points of one cluster (say

cluster A) and the constraints require the hyperplane to be at unit distance from

the points of other cluster (say cluster B). The error vectors ξ1 and ξ2 are used to

measure the error if the hyperplane is not unit distance away from data points of

other cluster. The second term of the objective function minimizes the squared sum

of error variables ξ1 and ξ2. The variable vi (i = 1, 2) is the prototype [88] of the ith

cluster and prevents the cluster hyperplane from extending infinitely and controls

its localization, proximal to the cluster. The parameter c2 is weight associated with

the proximal term. LF-TWSVM takes into consideration the principle of structural

risk minimization (SRM) [44] by introducing the term (wTi wi + b2i , i = 1, 2), in

the objective function and thus improves the generalization ability. It also takes

care of the possible ill-conditioning that might arise during matrix inversion. The

parameter c3 is chosen to be a very small value. The error function of (6.1) and (6.2)

is different from that of TWSVC (1.26) and has been modified for two major reasons.

First, to allow flipping of labels during subsequent iterations, which is otherwise

limited due to hinge loss function. This is required to minimize the total error. The

second reason for using square loss function is that it leads to solving system of

linear equations instead of QPPs. After substituting the equality constraints in the

objective function of (6.1) and (6.2), the problems become:

6.2. Tree-based Localized Fuzzy Twin Support Vector Clustering 143

LF-TWSVM1:

minw1,b1,A,v1

P1 =1

2‖SAAAw1 + e1b1‖22 +

c1

2‖SBABw1 + e2b1 + e2‖22

+c2

2‖SAAA− e1v1‖22 +

c3

2((‖w1‖22 + b21), (6.3)

LF-TWSVM2:

minw2,b2,B,v2

P2 =1

2‖SBBBw2 + e2b2‖22 +

c1

2‖−(SABAw2 + e1b2) + e1‖22

+c2

2‖SBBB − e2v2‖22 +

c3

2((‖w2‖22 + b22). (6.4)

To get the solution of (6.3), set the gradient of P1 with respect to w1, b1 and v1

equal to zero. We get

∂P1

∂w1= 0⇒ (SAAA)T (SAAAw1 + e1b1) + c3w1

+c1(SBAB)T (SBABw1 + e2b1 + e2) = 0e1, (6.5)

∂P1

∂b1= 0⇒ eT1 (SAAAw1 + e1b1) + c3b1 + c1e

T2 (SBABw1 + e2b1 + e2) = 0, (6.6)

∂P1

∂v1= 0⇒ −c2e

T1 (SAAA− e1v1) = 0. (6.7)

Let E = [SAAA e1], F = [SBAB e2] and z1 = [w1 b1]T , by combining (6.5) and

(6.6) we obtain

ETEz1 + c3z1 + c1FTFz1 + c1F

T e1 = 0e1,

⇒ z1 = −c1(c1FTF + ETE + c3I)−1F T e1. (6.8)

Here, I is an identity matrix of appropriate dimensions. From (6.7),

v1 = (eT1 SAAA)/(eT1 e1). (6.9)

144

The second problem i.e. LF-TWSVM2 can be solved in similar manner. From (6.4),

we get

GTGz2 + c3z2 + c1HTHz2 − c1H

T e2 = 0e2,

⇒ z2 = c1(c1HTH +GTG+ c3I)−1HT e2, (6.10)

where G = [SBBB e2], H = [SABA e1] and z2 = [w2 b2]T . The prototype variable

v2 is obtained as

v2 = (eT2 SBBB)/(eT2 e2). (6.11)

The augmented vectors z1 and z2 can be obtained from (6.8) and (6.10) respec-

tively and are used to generate the hyperplanes, as given in (1.1). The prototypes

vi for the two clusters can be calculated by using (6.9) and (6.11) respectively. A

pattern x ∈ Rn is assigned to cluster i (i = 1, 2), depending on which of the two

hyperplanes given by (1.1) it lies closer to, i.e.

y = argmini

(‖wTi x+ bi‖22 + c2‖x− vi‖22). (6.12)

It finds the distance of point x from the hyperplane xTwi + bi = 0, where i = 1, 2

and also considers distance from the corresponding prototype. The predicted label

for pattern x is given by y.

6.2.2 LF-TWSVM (Kernel version)

The results can be extended to non-linear version by considering the kernel-generated

surfaces given in (1.9) and (1.10). The primal QPP of the non-linear LF-TWSVM

corresponding to the first surface of is given as

KLF-TWSVM1:

minu1,b1,KA,V1

Q1 =1

2‖KAu1 + e1b1‖22 +

c1

2‖KBu1 + e2b1 + e2‖22

+c2

2‖KA − e1V1‖22 +

c3

2((‖u1‖22 + b21), (6.13)

where KA = SAAKer(A,CT ), KB = SBAKer(B,C

T ). The solution for the prob-

lem (6.13) is obtained in similar manner as linear case. The augmented vector

6.2. Tree-based Localized Fuzzy Twin Support Vector Clustering 145

r1 = [u1 b1]T is given as

r1 = −c1(c1KTFKF +KT

EKE + c3I)−1KTF e1. (6.14)

Here, KE = [KA e1] and KF = [KB e2]. The identity matrix I is of appropriate

dimensions and the prototype V1 is determined as

V1 = (eT1 KA)/(eT1 e1). (6.15)

The second hyperplane can be retrieved in a similar manner from (6.16).

KLF-TWSVM2:

minu2,b2,KB ,V2

Q2 =1

2‖KBu2 + e2b2‖22 +

c1

2‖−(KAw2 + e1b2) + e1‖22

+c2

2‖KB − e2V2‖22 +

c3

2((‖u2‖22 + b22). (6.16)

Here KA = SABKer(A,CT ), KB = SBBKer(B,C

T ). Once we obtain the surfaces,

a new pattern x ∈ Rn is assigned to class 1 or class -1 in a manner similar to the

linear case.

6.2.3 Clustering Algorithms: BTree-TWSVC and OAA-Tree-TWSVC

Tree-TWSVC is a multi-category clustering algorithm that creates a binary tree

of clusters by partitioning the data at multiple levels until the desired number of

clusters are obtained. Tree-TWSVC uses an iterative approach to generate two

cluster center hyperplanes, using LF-TWSVM at each node of the tree and updates

the hyperplane parameters in each iteration there by aligning the cluster hyperplane

along the data. Thus, it minimizes the empirical risk. Tree-TWSVC also minimizes

structural risk due to the regularization term added to its formulation. In this

thesis, we present two implementations for Tree-TWSVC namely BTree-TWSVC

and OAA-Tree-TWSVC.

146

Binary Tree-based Localized Fuzzy Twin Support Vector Clustering (BTree-

TWSVC)

BTree-TWSVC is an unsupervised learning procedure that creates K clusters from

m-data points. The algorithm takes two inputs: X ∈ Rm×n and K, where X

represents m data points in n-dimension feature space and K is the number of

clusters. The other symbols have the same meaning as given in Section 1.2.

Algorithm 8 for BTree-TWSVC generates final solution in the form of clusters

identified by LF-TWSVM at multiple levels, arranged as nodes of the tree. The root

node contains the entire data and leaf nodes correspond to the final clusters. Thus,

for a K-cluster problem, we obtain a tree with K leaf nodes and (K-1) internal

nodes. Generally, most of the clustering algorithms like K-Means [25] and KPC

[82] initiate with randomly generated labels which leads to unstable results due to

their dependency on the initial labels. For Tree-TWSVC, we use an initialization

algorithm based on K-Nearest Neighbor Graph [89], termed as Localized Fuzzy

NNG (LF-NNG), discussed in Section 6.2.3. BTree-TWSVC generates the fuzzy

membership matrix F2 ∈ Rm×2 through LF-NNG and assigns either of the cluster

labels (+1, −1) to all data points based on higher membership value towards cluster

1 or -1 respectively. Then, the two cluster center hyperplanes are determined and

the membership matrix F2 is updated. BTree-TWSVC alternatively determines the

cluster hyperplanes and membership matrix for data points until the convergence

criterion is met (Step 5d) and the two clusters Anew and Bnew are obtained. In order

to decide whether the obtained clusters, Anew and Bnew, can be further partitioned

or not, BTree-TWSVC algorithm uses K-Means clustering [25], to get K clusters and

labels Yk ∈ 1, ...,K for all the data points. Determine if clusters Anew or Bnew

are associated with more than one label from Yk. This can be done by counting

the number of samples of ith cluster (i = 1 : K) distributed in new cluster groups

i.e. Anew and Bnew. The cluster i will be associated with Anew or Bnew, depending

on which of these have the higher percentage of samples from cluster i. If there

are more than one cluster in a cluster-group Anew and/or Bnew, then they can be

further partitioned by recursively calling the same algorithm with new inputs. With

the new inputs, size of the data is approximately reduced to half (assuming Anew

6.2. Tree-based Localized Fuzzy Twin Support Vector Clustering 147

Input : The dataset X; the number of clusters K.Output : Hyperplane parameters for internal and leaf nodes of the treeProcess:1. Select the value for parameters - c1, c2, c3, tol, kernel type and kernelparameter (only for non-linear case).

2. Determine the initial labels, YK , for the data points using K-Meansclustering [25] for K clusters.

3. Use LF-NNG to get the fuzzy membership matrix of all data points fortwo clusters, F2 ∈ Rm×2. Based on higher membership value, assign labels(+1, − 1) to each data sample and partition X into two clusters A and B.Also determine the diagonal matrices SAA, SAB, SBA, SBB from F2.Here, SAA and SBA define the membership value of data points of A andB for positive cluster and SAB and SBB can be defined analogously.

4. Find the initial hyperplanes [w1, b1] and [w2, b2] for the two clusters bysolving the equations (6.8) and (6.10) respectively. Get cluster prototypesv1, v2 from (6.9) and (6.11).

5. Repeat

a. Determine the distance of each data point from the two clusters and updatethe membership values as

Fnewi,j =1

di,j,

where i = 1, ...,m and j = 1, 2. Here, di,j represents the distance of ith datapoint from jth cluster hyperplane and is given by

di,j = ‖wTj xi + bj‖22 + c2‖xi − vj‖22,

where ‖.‖22 is the L2 − norm.

b. Create two modified clusters as Anew and Bnew based on Fnew.

c. Update the hyperplanes [wnew1 , bnew1 ], [wnew2 , bnew2 ] and prototypes vnew1 , vnew2

for new clusters Anew and Bnew respectively, by solving the equations(6.8)-(6.11).

d. If ‖F2 − Fnew‖22 < tol, then break.

e. wi = wnewi , bi = bnewi , vi = vnewi , (i = 1, 2) and F2 = Fnew.

6. Use YK to determine if Anew and Bnew can be further partitioned i.e. ifthere are labels from more than one clusters. If required, recursivelypartition Anew and Bnew by calling

BTree-TWSVC(Anew,K1);BTree-TWSVC(Bnew,K2);

where K = K1 +K2.7. End.

Algorithm 8: BTree-TWSVC

148

and Bnew contain approximately equal number of data points), due to partitioning.

Thus, the input data diminishes in size as we traverse down the cluster tree and a

tree of height dlog2Ke is created.

One-Against-All Tree-based Localized Fuzzy Twin Support Vector Clus-

tering (OAA-Tree-TWSVC)

OAA-Tree-TWSVC is another tree-based implementation of Tree-TWSVC and is

explained in Algorithm 9. The algorithm for OAA-Tree-TWSVC generates cluster

model by arranging LF-TWSVM generated clusters in the form of a tree. At each

internal node, one cluster is separated from rest of the clusters. Hence, this method

represents modified One-Against-All (OAA) multi-category strategy. The height of

the cluster tree is (K − 1).

Binary tree vs. One-Against-All tree

In this thesis, we have presented two implementations for Tree-TWSVC as discussed

in Section 6.2.3 and 6.2.3. Out of the two approaches, BTree-TWSVC is more

robust and achieves better clustering accuracy than OAA-Tree-TWSVC which is

experimentally proved in Section 6.4. One such scenario is presented in Table 6.1 i.e.

clustering problem with four clusters (a.). Here, we present the clustering result with

TWSVC, OAA-Tree-TWSVC and BTree-TWSVC. For TWSVC, the hyperplanes

are obtained using OAA strategy (b.) and this leads to an ambiguous region i.e.

the data points lying in this region might be wrongly clustered. With OAA-Tree-

TWSVC, one of the clusters obtained with LF-NNG initialization is selected as

positive cluster (green squares) and remaining are regarded as negative cluster (red

frames, violet triangles and blue dots), as shown in (c.). The localized hyperplanes

are generated using LF-TWSVM, but it still leads to some ambiguity. Once the

green cluster is identified, we apply the same procedure on remaining data points,

as presented in (d-e.).The final OAA-tree thus obtained is shown in (e.). For BTree-

TWSVC, LF-NNG is used to identify two clusters at a time, as demonstrated in

(f.), which separates the blue-violet points from red-green points. BTree-TWSVC is

able to generate a stable clustering model as depicted in (f-h.) and has got better

clustering ability.

6.2. Tree-based Localized Fuzzy Twin Support Vector Clustering 149

Input : The dataset X; fuzzy membership matrix FK ∈ Rm×K (fromLF-NNG to get K initial clusters); the number of clusters K.

Output : Hyperplane parameters for internal and leaf nodes of the treeProcess:1. Determine the initial labels, Yini, for the data points using FK .2. Select the value for parameters - c1, c2, c3, tol, kernel type and kernelparameter (only for non-linear case).

3. All the data points with Yini = 1 are selected as patterns of positivecluster, whereas the rest of the points form the negative cluster. Determinea new fuzzy membership matrix, for two clusters, F2 ∈ Rm×2 from F .Here, F2(j, 1) = FK(j, 1) and F2(j, 2) =

∑Ki=2 FK(j, i) where j = 1, ...,m.

4. Based on higher membership value, get the initial labels (+1, − 1) fromF2 and partition the data into two sets A and B. Also determine thediagonal matrices SAA, SAB, SBA, SBB.

5. Find the initial hyperplanes [w1, b1], [w2, b2] and prototypes v1, v2 forthe two clusters by solving the equations (6.8)-(6.11).

6. Repeat

a. Determine the distance of each data point from the two hyperplanes [w1, b1]and [w2, b2] and update the membership values as

Fnewi,j =1

di,j,

where i = 1, ...,m and j = 1, 2. Here, di,j represents the distance of ith datapoint from jth cluster and is given by

di,j = ‖wTj xi + bj‖22 + c2‖xi − vj‖22.

b. Create two modified clusters as Anew and Bnew based on Fnew.

c. Update the hyperplanes [wnew1 , bnew1 ], [wnew2 , bnew2 ] and prototypes vnew1 , vnew2

for the new clusters Anew and Bnew respectively, by solving the equations(6.8)-(6.11).

d. If ‖F2 − Fnew‖22 < tol, then break.

e. wi = wnewi , bi = bnewi , vi = vnewi , (i = 1, 2) and F2 = Fnew.

7. Use Yini to determine if Bnew can be further partitioned. If required,recursively partition Bnew by calling

OAA-Tree-TWSVC(Bnew, Fnew,K − 1).(Here Fnew = FK(i, j), i=Indexes of data points of Bnew and j=2,...,K.)8. End.

Algorithm 9: OAA-Tree-TWSVC

150

Initialization with Localized Fuzzy Nearest Neighbor Graph (LF-NNG)

Wang et al. presentd NNG based initialization method for TWSVC [28] and F-LS-

TWSVC [87] used Fuzzy NNG (FNNG) to generate the membership values. For

this work, we present Localized Fuzzy Nearest Neighbor Graph (LF-NNG) which

generates a membership matrix F . This matrix is used to obtain the initial data

labels to be used by Tree-TWSVC. The steps involved in LF-NNG for getting initial

labels of K clusters are given in Algorithm 10.

6.3 Discussion

In this section, we have discussed the comparison of our clustering algorithm with

MMC [29], TWSVC [28] and F-LS-TWSVC [87].

Tree-TWSVC vs. MMC

The clustering algorithm Tree-TWSVC determines multiple clusters by solving a

system of linear equations, whereas MMC is a binary clustering method that solves a

non-convex optimization problem which is relaxed to solving expensive SDP. Unlike

MMC, Tree-TWSVC does not use alternate optimization to determine w and b,

whereas they are obtained as vector zi = [wi bi]T , (i = 1, 2), by solving a system

of linear equations (6.8) and (6.10). Therefore, Tree-TWSVC is more time-efficient

than MMC. Also, Tree-TWSVC uses fuzzy nearest Neighbor based initialization

method, which improves its clustering accuracy.

Tree-TWSVC vs. TWSVC

TWSVC involves constraints with mod function (|.|) and it uses Taylor’s series

expansion to get an approximation of this function. Whereas Tree-TWSVC considers

only two clusters at each level of the tree and these clusters would lie on either side

of the mean cluster plane, hence it does not require constraints with mod function.

Also, Tree-TWSVC formulation involves square loss function which results in solving

a series of system of linear equations whereas TWSVC solves a series of QPPs using

concave-convex procedure. Therefore, Tree-TWSVC is more efficient in terms of

computational effort as well as clustering accuracy than TWSVC. Also, TWSVC is

6.3. Discussion 151

based on OAA strategy, but Tree-TWSVC uses tree-based approach. Tree-TWSVC

uses a better initialization algorithm (i.e. LF-NNG) and its decision function for

test data also takes into account the distance from the cluster prototype. Hence,

Tree-TWSVC achieves better results in lesser time than TWSVC.

Table 6.1: Clustering with TWSVC and Tree-TWSVC for four clusters

(a.) Dataset with four clusters (b.) Clustering model with TWSVC

(c.)-(d.) Clustering with OAA-Tree-TWSVC and resulting OAA-tree

(e.) Final OAA-tree (f.) Clustering with BTree-TWSVC

(g.)-(h.) Clustering with BTree-TWSVC and resulting binary tree

152

Input : The dataset X; the number of clusters K; nearest neighbors p.Output : F matrixProcess:1. For the given data set X ∈ Rm×n and a parameter p, construct p nearestneighbor undirected graph where edges represent the distance between thepattern xi (i=1,...,m) and its p nearest neighbors.

2. From the graph, identify t clusters (C1, ..., Ct) by associating the nearestsamples i.e. neighbors must fall in the same cluster.

3. If the current number of clusters t is equal to K, then construct a fuzzymembership matrix Fi,j where i = 1, ...m and j = 1, ...t where Fi,j is givenas

Fi,j =1

di,j.

Here di,j is the distance of ith sample from jth cluster prototype and isgiven by

di,j = ‖xi − vj‖22, (6.17)

where vj is the cluster prototype and is given as

vj = (eTCj)/(eT e),

and Cj represents the data points in jth cluster. Go to Step 6.Else, go to Step 4 or 5.4. If t < k, disconnect the two connected samples with the maximumdistance and go to Step 2.

5. If t > k, compute the Hausdorff distance [90] between every two clustersamong the t clusters and sort all pairs in ascending order. Merge thenearest pair of clusters into one, until k clusters are formulated, where theHausdorff distance between two sets S1 and S2 is defined as

h(S1, S2) = maxmaxi∈S1

minj∈S2

||i− j||, maxi∈S2

minj∈S1

||i− j||. (6.18)

6. End.Algorithm 10: Localized Fuzzy Nearest Neighbor Graph (LF-NNG) basedcluster membership

6.3. Discussion 153

Tree-TWSVC vs. F-LS-TWSVC

F-LS-TWSVC solves a series of system of linear equations to get the cluster hyper-

planes, but similar to TWSVC, it uses Taylor’s series approximation for the con-

straints and therefore the results may not be accurate. Tree-TWSVC formulates a

convex optimization problem which is solved as a series of system of linear equations.

F-LS-TWSVC is based on OAA multi-category strategy and Tree-TWSVC handles

OAA using tree-based approach, which is more efficient. In addition, Tree-TWSVC

has BTree-TWSVC approach which is even faster than OAA-Tree-TWSVC.

Complexity analysis

The strength of Tree-TWSVC is the tree-based approach which reduces the com-

plexity of the algorithm. The size of data diminishes as it is partitioned to obtain the

clusters. This characteristic is of utmost importance for non-linear (kernel) classifiers

where the complexity is dependent on the size of data. For a K-cluster problem, the

OAA multi-category approach uses entire dataset K-times to determine the cluster

planes. Assuming that all clusters have equal size i.e. m/K, where m is the num-

ber of data points. If any TWSVM-based classifier is used with OAA (as done in

TWSVC), then the algorithm solves K QPPs, each of size ((K − 1)/K) ∗m. Hence,

the complexity of TWSVM-based clustering algorithm is given by

TOAA = K ∗ c ∗(K−1K ∗m

)3,

' K ∗ c ∗m3, (6.19)

where c is a constant that includes the count for maximum number of iterations

for finding the final cluster planes. So, the complexity of OAA TWSVM-based

clustering is TOAA = O(m3).

In BTree-TWSVC, the optimization problem is solved as a system of linear

equation. For the linear case, LF-TWSVM finds the inverse of two matrices, each of

dimension (n+1)×(n+1), where n is the number of features, for each internal node

of the Binary Tree. As we traverse down the tree, size of the data is approximately

reduced to half. Thus, the complexity of BTree-TWSVC can be recursively defined

154

as

T (m) = c(n+ 1)3 + 2 ∗ T(m2

),

T(mK

)= 1, (6.20)

where m is the number of data points and c is the complexity constant. We assume

that data is divided into two clusters of almost equal size. The base condition

T(mK

)= 1 represents cost of leaf node that contains data from one cluster only.

The time complexity of (6.20) is given as [91]

T (m) = c(n+ 1)3 + 2 ∗ c(n+ 1)3 + ...+ 2h−1 ∗ c(n+ 1)3 + 2h ∗ 1, (6.21)

where h = dlog2Ke. The height of the tree ‘h’ depends on the number of clusters

K. The above equation can be solved as

T (m) = c(n+ 1)3(1 + 2 + 4 + ...+ 2h−1) + 2h,

= c(n+ 1)3(2h − 1) + 2h,

= c(n+ 1)3(K − 1) +K.

≤ cK(n+ 1)3 +K. (6.22)

Therefore, the complexity of linear BTree-TWSVC implemented as a Binary Tree

(BT) is TBT = O(Kn3) and is independent of the size of data. For large-sized

datasets (m n), the efficiency of BTree-TWSVC is not much affected, but for

TWSVC (implemented using OAA-TWSVM) the learning time increases with size

of data.

For kernel version, the complexity of BTree-TWSVC can be recursively defined

as

T (m) = c(m+ 1)3 + 2 ∗ T (m2 ),

T(mK

)= 1, (6.23)

where m is the number of data points and c is the complexity constant. The com-

6.4. Experimental Results 155

plexity (6.23) can be written as [91]

T (m) = c(m+ 1)3 +1

4c(m+ 1)3 +

1

16c(m+ 1)3 + ...+

1

4h−1c(m+ 1)3 + 2h, (6.24)

where h = dlog2Ke. The above equation can be solved as

T (m) = c(m+ 1)3(1 + 14 + 1

16 + ...+ 14h−1 ) + 2h,

≤ 43c(m+ 1)3 +K,

' 43c(m+ 1)3. (6.25)

So, the complexity of kernel BTree-TWSVC is independent of the number of clusters

K. BTree-TWSVC is more time-efficient than OAA multi-category clustering for

both linear and kernel versions. We can discuss the time complexity of OAA-Tree-

TWSVC in similar way as TWSVC as both of them are based on OAA strategy. But

OAA-Tree-TWSVC is more time-efficient than TWSVC because the number of data

points get diminished as we traverse down the OAA-tree. To validate the efficiency

of our method, we have the compared the learning time of OAA-Tree-TWSVC with

TWSVC (also based on OAA strategy) in Section 6.4.

6.4 Experimental Results

In this section, we compare the performance of two variations of Tree-TWSVC i.e.

BTree-TWSVC and OAA-Tree-TWSVC, with other clustering methods and inves-

tigate their accuracy and computational efficiency. The other clustering methods

used for comparison are Fuzzy C-means (FCM) clustering [26], TWSVC [28] and

F-LS-TWSVC [87]. We have also implemented a non-fuzzy version of OAA-Tree-

TWSVC, which is referred as OAA-T-TWSVC. These two algorithms are compared

to study the effect of adding fuzziness to the clustering model. For OAA-T-TWSVC,

the initial clusters are generated using NNG [89]. The experiments are conducted

on benchmark UCI datasets [52]. In all experiments, the focus is on the comparison

of our clustering approach with clustering methods listed above. The parameters

c1 and c2 are selected in the range 0.01 to 1; c3 ∈ 10−i, i = 1, ..., 5 and tol

is selected to be very small value of order 10−5. The kernel parameter is tuned in

156

the range 0.1 to 1. The metric Clustering-Accuracy [92] is used to measure the

performance of clustering methods.

Out-of-Sample Testing

In an unsupervised framework, generally the clustering model is built using entire

dataset. But, the formulation of Tree-TWSVC allows it to obtain the clustering

model with learning data and the accuracy of the model can be examined using

Out-of-Sample (OoS) or unseen test data [93, 94]. The clustering model is built with

some part of learning data provided as input to the Tree-TWSVC algorithm, and is

used to predict the labels of the unseen OoS data. This feature is particularly useful

when working with very large datasets where the clustering model can be built with

few samples and rest of the samples are assigned labels using OoS approach. Tree-

TWSVC also takes advantage of the tree structure and LF-TWSVM formulation

and generates the results in much less time. In our simulations with UCI dataset,

we have given the results with entire dataset and OoS testing of clustering model.

For OoS, 80% samples are randomly selected from the entire data for learning the

model and rest 20% are used to determine the clustering accuracy.

6.4.1 Clustering Results: UCI Datasets

We have selected 14 UCI multi-category datasets [52] for the experiments.

Results for linear case

The simulation results for UCI datasets with linear clustering methods are recorded

in Table 6.2 for FCM, TWSVC, F-LS-TWSVC, OAA-T-TWSVC, OAA-Tree-TWSVC

and BTree-TWSVC. The simulation results demonstrate that both versions of Tree-

TWSVC i.e. BTree-TWSVC and OAA-Tree-TWSVC, outperform FCM, TWSVC

and F-LS-TWSVC for clustering accuracy. In Table 6.2, the entire dataset is used

for building the clustering model. For 13 out of 14 UCI datasets, one of the two

versions of Tree-TWSVC achieves the highest accuracy. This can be attributed

to the fact that a good initialization algorithm can improve the accuracy of the

clustering algorithm. It is also observed that Binary Tree based algorithm (BTree-

TWSVC) generates better result than OAA-Tree-TWSVC. The table demonstrates

6.4. Experimental Results 157

that OAA-Tree-TWSVC (fuzzy version) achieves better clustering accuracy than

OAA-T-TWSVC (non-fuzzy version). Table 6.3 shows accuracy result with OoS

clustering for OAA-T-TWSVC, OAA-Tree-TWSVC and BTree-TWSVC. The clus-

tering algorithms achieve better results when entire data is used for clustering.

Table 6.2: Clustering accuracy for UCI datasets (Linear version)

Data FCM TWSVC F-LS-TWSVC OAA-T-TWSVC Tree-TWSVC

OAA-Tree BTree

(m×n, K) Clustering-Accuracy (%)

Zoo (101× 16, 7 ) 85.70 88.20 92.16 95.00 95.23 95.49

Iris (150× 4, 3) 89.88 89.88 94.61 89.23 93.33 95.33

Wine (178 × 13, 3) 89.18 73.46 88.65 89.82 90.06 90.84

Seeds (210× 7, 3) 83.93 75.14 86.74 86.36 88.02 91.20

Segment (210× 19, 7) 71.43 77.29 82.65 84.28 86.86 88.56

Glass (214×9, 6) 54.21 68.08 69.02 69.29 71.07 73.25

Dermatology (366× 34, 6) 55.89 82.31 91.44 93.33 93.79 94.30

Ecoli (336×7, 8) 79.59 83.60 86.24 80.90 84.05 83.93

Compound (399×2, 6) 82.85 86.53 88.70 87.38 89.06 90.06

Libra (360× 90, 15) 64.82 88.06 90.14 85.26 89.12 92.76

Large Datasets

Pageblocks (5473× 10, 5) 90.50 62.35 81.01 86.03 91.78 92.56

Optical digits (5620× 64, 9) 42.15 48.45 80.17 78.74 81.76 82.44

Satimage (6435× 36, 7) 73.07 59.95 75.29 73.96 80.65 79.18

Pen digits (10992× 16, 9) 59.74 50.25 63.45 66.07 66.26 68.78

Average Clustering-Accuracy 73.07 73.83 83.61 83.26 85.79 87.05

Table 6.3: OoS Clustering accuracy for UCI datasets (Linear version)

Data OAA-T-TWSVC OAA-Tree-TWSVC BTree-TWSVC

Clustering-Accuracy (%)

Zoo 93.23 93.16 93.18

Iris 86.29 91.28 93.56

Wine 87.16 88.52 89.71

Seeds 81.84 87.65 88.90

Segment 81.51 83.75 85.27

Glass 65.26 65.72 72.82

Dermatology 90.56 91.84 91.14

Ecoli 78.19 82.65 79.61

Compound 85.88 86.34 86.41

Libra 82.45 88.45 91.02

Large Datasets

Pageblocks 83.25 87.33 88.73

Opt.digits 75.28 77.86 79.16

Satimage 71.87 78.52 77.29

Pendigits 63.84 64.52 65.29

Average Clustering-Accuracy 80.47 83.40 84.43

158

Results for non-linear case

Our clustering approach is extended using non-linear LF-TWSVM classifier and

Table 6.4 compares the performance of Tree-TWSVC (both versions) with that of

TWSVC, F-LS-TWSVM and FCM using RBF kernel, Ker(x, x′) = exp(−σ‖x −

x′‖22). The table shows the clustering accuracy of these algorithms on UCI datasets.

The results illustrate that Tree-TWSVC (both versions) achieve better accuracy for

most of the datasets. It is also observed that the clustering results are better for

non-linear version as compared to linear ones. Table 6.5 shows accuracy result with

OoS clustering for non-linear versions of OAA-T-TWSVC, OAA-Tree-TWSVC and

BTree-TWSVC.

Table 6.4: Clustering accuracy for UCI datasets (Non-linear version)

Data FCM TWSVC F-LS-TWSVM OAA-T- OAA-Tree- BTree-TWSVC TWSVC TWSVC

Clustering-Accuracy (%)

Zoo 83.17 89.18 95.14 96.15 97.31 97.13

Iris 91.33 92.67 96.66 92.55 95.68 97.83

Wine 91.57 95.59 94.66 95.43 96.27 96.13

Seeds 86.52 84.76 88.37 86.62 88.11 92.85

Segment 70.95 80.32 84.61 85.49 86.35 88.87

Glass 55.61 69.04 70.96 69.56 71.87 73.56

Dermatology 87.45 86.71 93.22 93.81 94.26 95.30

Ecoli 77.37 85.45 90.17 84.83 87.40 87.40

Compound 81.45 96.19 95.38 93.52 94.47 96.43

Libra 77.89 90.08 92.01 91.10 92.857 93.64

Large Datasets

Pageblocks 92.56 64.01 82.38 91.69 93.65 94.51

Opt.digits 55.29 45.28 82.14 86.72 88.59 91.69

Satimage 78.82 77.29 81.02 80.94 87.42 88.69

Pendigits 63.85 53.94 62.27 66.29 73.51 73.56

Average 78.13 79.32 86.35 86.76 89.12 90.54Clustering-Accuracy

Learning time:

We have compared the learning time (i.e. time for building the clustering model)

of OAA-Tree-TWSVC with TWSVC for UCI datasets in Fig.6.2. In this figure,

Derm, OD, SI, PB and PD represent Dermatology, Optical digits, Satimage, Page-

blocks and Pen digits datasets respectively. Although, both of these clustering

6.4. Experimental Results 159

Table 6.5: OoS Clustering accuracy for UCI datasets (Non-linear version)

Data OAA-T-TWSVC OAA-Tree-TWSVC BTree-TWSVC

Clustering-Accuracy (%)

Zoo 93.72 94.67 95.80

Iris 89.10 92.85 95.09

Wine 93.72 95.03 94.51

Seeds 84.50 87.91 89.03

Segment 82.11 84.82 86.95

Glass 65.42 67.55 72.88

Dermatology 91.12 92.17 92.54

Ecoli 82.49 84.52 84.66

Compound 88.50 90.28 91.56

Libra 90.46 91.14 92.19

Large Datasets

Pageblocks 87.51 88.75 90.03

Opt.digits 83.46 86.15 87.79

Satimage 76.22 84.91 86.34

Pendigits 64.27 68.59 71.94

Average Clustering-Accuracy 83.76 86.38 87.95

methods are based on OAA multi-category strategy, but OAA-Tree-TWSVC takes

much less time for building the tree-based model than TWSVC. The efficiency of

OAA-Tree-TWSVC is significant for datasets with large number of classes i.e. Li-

bra, Compound, Satimage and Pen digits, where OAA-Tree-TWSVC is much faster

than TWSVC. For pen digits dataset, OAA-Tree-TWSVC is almost 16 times faster

than TWSVC. The learning time of non-linear versions of OAA-Tree-TWSVC and

TWSVC are compared in Fig.6.3. It is observed that OAA-Tree-TWSVC is very

efficient in dealing with large datasets; whereas learning time of TWSVC is highly

affected by size and number of classes in the dataset.

6.4.2 Clustering Results: Large Sized Datasets

In order to demonstrate the scalability and effectiveness of Tree-TWSVC, we per-

formed experiments on large UCI datasets i.e. Optical digits, Satimage, Pen digits

and Pageblocks. It is observed that the performance of TWSVC deteriorates as the

size of data increases; whereas Tree-TWSVC can efficiently handle large datasets.

Similarly, FCM fails to give good accuracy for Pen digits and optical digits. From

Table 6.2, there is significant difference in the clustering accuracy achieved by Tree-

TWSVC (both versions) as compared to FCM and TWSVC, for the above mentioned

large datasets. Tree-TWSVC scales well on these datasets and is not much affected

160

Figure 6.2: Learning time (Linear)

Figure 6.3: Learning time (Non-linear)

by the number of classes.

6.5 Application: Image Segmentation

To evaluate the performance of Tree-TWSVC on large datasets, we present its ap-

plication on image segmentation which is a clustering problem. The image is par-

titioned into non-overlapping regions that share certain homogeneous features. For

the experiments, we have taken color images from Berkeley image segmentation

dataset (BSD) [57]. We use a dynamic method to determine the number of regions

for each image. The histogram for image is generated and the prominent peaks are

identified. The number of prominent peaks determine the number of regions (L)

6.6. Conclusions 161

in the image. The color image is then partitioned using minimum variance color

quantization with L levels. For the experiments, we have taken a combination of

color and texture features. The image features used for this work are Gabor texture

features [56] and RGB color value of the pixel. Gabor features are extracted with

4-orientation (0, 45, 90, 135) and 3-scale (0.5, 1.0, 2.0) sub-bands and the maximum

of the 12 coefficients determine the orientation at a given pixel location.

The segmentation model is built using OoS approach with BTree-TWSVC and

1% pixels are randomly selected from the image for learning. Rest of the image pixels

are used for testing the model. The images are segmented using BTree-TWSVC

and the results are compared with linear TWSVC and Multi-class Semi-supervised

Kernel Spectral Clustering (MSS-KSC) [95] segmentation methods, as shown in

Fig.6.4. MSS-KSC uses few labeled pixels to build the clustering model with Kernel

Spectral Clustering approach. It is observed that the segmentation results of BTree-

TWSVC are visually more accurate than other algorithms. For TWSVC, the image

is over-segmented, which results in formation of multiple smaller regions within

one large region. For BSD images, the ground truth segmentations are known and

the images segmented by BTree-TWSVC and TWSVC are compared with ground

truth. To statistically evaluate the segmentation algorithms, two evaluation criteria

are used: F-measure (FM) and error rate (ER). These measures are calculated with

respect to ground-truth boundaries and results are presented in Fig.6.4 and Table

6.6. BTree-TWSVC achieves better F-measure and error rate values than TWSVC

and MSS-KSC.

6.6 Conclusions

In this chapter, we present Tree-based Localized Fuzzy Twin Support Vector Clus-

tering (Tree-TWSVC) which is an iterative algorithm and extends the novel classi-

fier, Localized Fuzzy Twin Support Vector Machine (LF-TWSVM), in unsupervised

framework. Since, a patterns can not be associated with a unique cluster, our algo-

rithm determines fuzzy membership values for training patterns. LF-TWSVM is a

binary classifier that generates the non-parallel hyperplanes by solving system of lin-

ear equations. Tree-TWSVC develops a tree-based clustering model which consists

162

Figure 6.4: Segmentation results on BSD images (a.) Original image (b.) MSS-KSC(c.) TWSVC (d.) BTree-TWSVC

of several LF-TWSVM classifiers. In this chapter, we present two implementations

of Tree-TWSVC, namely Binary Tree-TWSVC and One-Against-All Tree-TWSVC.

Our clustering algorithm outperforms the other TWSVM-based clustering meth-

ods like TWSVC and F-LS-TWSVC, which are based on classical One-Against-All

multi-category approach and use Taylor’s series for approximating the constraints

of the optimization problem. Experimental results have proved that Tree-TWSVC

has superior clustering accuracy and efficient learning time for UCI datasets as com-

pared to FCM, TWSVC and F-LS-TWSVC. Our clustering algorithm is extended

for image segmentation problems also.

6.6. Conclusions 163

Table 6.6: Segmentation result for BSD color images

Image L F-measure Error rate

MSS-KSC TWSVC BTree-TWSVC MSS-KSC TWSVC BTree-TWSVC

385039 5 0.49 0.52 0.69 0.0726 0.0818 0.04998049 4 0.71 0.63 0.78 0.0784 0.0800 0.0676100007 3 0.57 0.57 0.66 0.0774 0.0798 0.0463295087 5 0.62 0.69 0.76 0.0910 0.0844 0.0527372019 4 0.49 0.47 0.54 0.0624 0.0755 0.0420388067 5 0.62 0.65 0.76 0.0980 0.1185 0.088755067 3 0.58 0.55 0.61 0.0214 0.0234 0.0201113044 3 0.71 0.63 0.73 0.0348 0.0400 0.0312118035 3 0.72 0.69 0.74 0.0513 0.0473 0.0431124084 3 0.54 0.48 0.69 0.0637 0.0818 0.0577161062 4 0.62 0.58 0.74 0.0343 0.0639 0.0168198023 4 0.57 0.58 0.78 0.0235 0.0363 0.0228388016 3 0.46 0.41 0.62 0.0806 0.1369 0.049051084 4 0.66 0.64 0.68 0.0695 0.0743 0.0613196027 4 0.63 0.45 0.67 0.0294 0.0359 0.0159

164

Chapter 7

Concluding Remarks

In this thesis, we have developed novel supervised and unsupervised learning algo-

rithms to perform the task of classification, clustering etc. To show the practical

application of our work, we have extended these machine learning techniques to

perform image processing tasks. Our work is motivated by nonparallel hyperplanes

classifier Twin Support Vector Machine (TWSVM). The objective is to develop

time-efficient algorithms which have better or comparable classification accuracy as

TWSVM-based algorithms.

With the emergence of new database technologies, an enormous amount of data

can be collected at very low cost. The decision ‘Whether data is apparent or not?’ is

deferred and everything is stored. This leads to a massive corpus for which machine

learning algorithms are required. The aim of our work is to develop algorithms

which can process huge amount of data in lesser time than existing algorithms. In

this chapter, we conclude our work along with discussion of advantages, pitfalls to

be avoided and future directions for extending this work.

7.1 Advantages of our Work

This thesis introduced time-efficient TWSVM-based classifiers which can effectively

handle large datasets. After analysis of TWSVM, it is observed that there is a scope

of improvement in terms of learning time and generalization ability. Since TWSVM

solves a pair of quadratic programming problems (QPPs) which also involves finding

inverse of a matrix, its learning time can be improved if the optimization problems

are formulated in a different manner. Working on this idea, we presented Improve-

166

ments of ν-Twin Support Vector Machine (Iν-TWSVM) which avoids solving QPPs

and instead solves unconstrained minimization problems (UMPs). It requires lesser

time to solve a UMP as it leads to solving a system of linear equations; whereas solv-

ing a QPP is more expensive. Iν-TWSVM considers ρ-distance separability between

patterns of one class and hyperplane of other class. Angle-based Twin Support Vector

Machine (ATWSVM) is another classifier which is developed on similar lines and

avoids solving QPPs. It is based on the concept of maximizing the angle between

two nonparallel hyperplanes.

In another binary classifier i.e. Angle-based Twin Parametric-margin Support

Vector Machine (ATP-SVM), only one optimization problem is formulated which

simultaneously determines both the nonparallel hyperplanes. The problem is for-

mulated so that it avoids solving inverse of matrices in the dual problem. The

learning time of this classifier is further improved by considering only representative

patterns while learning. These patterns are selected so that they can represent the

entire training set and therefore achieve good classification results. Since ATP-SVM

generates parametric-margin hyperplanes, it can efficiently handle heteroscedastic

noise.

Our work also includes a ternary tree based multi-category classification algo-

rithm, termed as Reduced Tree for Ternary Support Vector Machine (RT-TerSVM).

This algorithm uses a novel 3-class classifier, Ternary Support Vector Machine. This

work also includes development of algorithms which could extend existing binary

classifiers so as to handle multi-category data. These approaches are Ternary Deci-

sion Structure (TDS) and Binary Tree (BT) for extension of binary classifiers like

TWSVM, Generalized Eigenvalue Proximal SVM (GEPSVM) etc., to multi-category

scenario. All these algorithms are more efficient than classical multi-category algo-

rithms like One-Against-All, in terms of learning time and classification accuracy.

The success of plane-based classifiers motivated us to develop clustering algo-

rithms that use these classifiers in an iterative manner. Hence, we developed a

clustering algorithm based on TWSVM, termed as Tree-based Localized Fuzzy Twin

Support Vector Clustering (Tree-TWSVC). Tree-TWSVC builds the cluster model as

a Binary Tree of novel TWSVM-based classifier, termed as Localized Fuzzy TWSVM

(LF-TWSVM). Tree-TWSVC has efficient learning time, achieved due to tree struc-

7.2. Utility and Comparative Analysis of Algorithms 167

ture and the formulation that leads to solving a series of system of linear equations.

All the above mentioned algorithms have been successfully applied to perform

image processing tasks like image classification, content-based image retrieval and

segmentation.

7.2 Utility and Comparative Analysis of Algorithms

1. Iν-TWSVM improves the time complexity of TWSVM-based classifiers, by

replacing QPPs with UMPs and solves system of linear equations. Hence,

it can be used effectively to handle large datasets. The formulation of Iν-

TWSVM is attractive for handling unbalanced datasets. Another contribution

of this work is that Iν-TWSVM uses class representatives instead of considering

all the data patterns.

2. The concept of class representative is further improved in ATP-SVM, where a

subset of class patterns is created in a way that it represents the entire class.

These representative points capture the geometry of the class. Therefore, ATP-

SVM is more robust classifier than Iν-TWSVM. Also, ATP-SVM generates

both the hyperplanes simultaneously by solving one optimization problem and

is an effective classifier to handle noisy data. So, ATP-SVM does not require

parallel processing to generate hyperplanes, which is otherwise required by

classifiers, if both the hyperplanes are to be generated at the same time. Since

the dual problem of ATP-SVM does not require inverse of matrix, it solves an

efficient optimization problem.

3. ATWSVM presents a generic approach to transform any TWSVM-based clas-

sifier so that its time complexity can be improved without much affecting the

accuracy. It can efficiently handle large datasets. Its time-complexity is better

than that of ATP-SVM, whereas ATP-SVM is more robust than ATWSVM.

4. TerSVM is a ternary classifier and if required, can be used as binary classifier

also. TerSVM formulates all three problems as UMPs and can therefore handle

large-sized datasets. RT-TerSVM extends TerSVM to manage large number of

classes. TerSVM does not require dual formulation as required by ATP-SVM

168

and solves systems of linear equations to get the solution.

5. TDS is a generic multi-category approach which can be used with any TWSVM-

based classifier. Its time-complexity is better than that of RT-TerSVM.

6. Tree-TWSVC is an unsupervised algorithm developed for this work. It is used

to cluster data when labels are not available. Since patterns can not be crisply

associated with one cluster, Tree-TWSVC uses fuzzy memberships to initially

associate patterns with clusters.

7.3 Pitfalls to be Avoided

As discussed in the previous section, our work involves development of new classifi-

cation and clustering algorithms which are on the lines of TWSVM and its variants.

The success of machine learning projects depends on the quality of underlying classi-

fication and clustering algorithms. It is very important that these algorithms should

be carefully designed and implemented; otherwise the comparative study of these

algorithms can result in statistically wrong conclusions. The following section dis-

cusses the common pitfalls and the ways to avoid them.

Wrong selection of datasets

To authenticate the efficacy of our algorithm, we use empirical validation. The

machine learning community has developed and maintained some benchmark data

repositories like UC Irvine (UCI) Machine Learning Repository [52], which currently

has 378 datasets. These datasets are available to perform numerical experiments to

validate different algorithms. The datasets should be carefully chosen for performing

experiments and should not be restricted to only four or five. There should be a right

mix of datasets, varying from small-sized (i.e. number of instances) to large-sized;

those with few attributes to large number of attributes. The synthetic datasets

should be cautiously picked or created, which can highlight the contribution of our

work.

There are few datasets in UCI repository with missing values like Dermatology,

Hepatitis etc. To use such datasets, some value must be assigned for the missing

7.3. Pitfalls to be Avoided 169

data. Do not set ‘0’ for numerical attributes. A better option is to fill the miss-

ing values, in numerical attributes, with the mean of the corresponding feature’s

remaining values and missing logical values could be replaced with the value that

appears maximum number of times for that attribute.

Incomplete comparative study

The proposed algorithms are compared with existing methods to prove the efficacy

of our work. At first glance, these studies may appear to be quite easy to do,

but in reality it requires considerable skill and a thorough knowledge of underlying

mathematics, to be successful at both improving known algorithms and designing the

experiments. The empirical validation is necessary but not sufficient to establish the

efficacy of our algorithm. Some statistical tests should also be performed to compare

multiple algorithms. In our work, we have compared algorithms using Friedman test

[54] and Holm-Bonferroni test [55].

Random tuning of parameters

Researchers tune the parameters repeatedly, so that the algorithms can perform

optimally on the chosen datasets. While experimenting with the algorithms, a great

deal of time is spent in determining the optimal parameter values. During parameter

tuning, every change should be considered as a separate experiment. Instead of

randomly selecting the parameters, grid search method [50] should be used.

Conventional validation

Conventional validation which partitions the data set into two sets of ratio 70 : 30

for training-test data, is not appropriate for conducting experiments. For a model

with one or more user-defined parameters and a fixed training set, the tuning process

optimizes the model parameters so that the model fits the training data as accurately

as possible. If we then take an unseen sample of test data, it is observed that the

model does not fit the test data as well as it fits the training data. This is called

over-fitting and it generally happens when the size of the training dataset is small, or

when the number of parameters in the model is large. To generate unbiased results,

the training and test set should be created using cross validation [49]. It segments

170

the dataset such that in each fold a new segment is taken as training set. When

the parameters get the optimal settings, accuracy can be measured on the test data.

This gives an opportunity to the researcher to test the algorithm with unseen data.

The final accuracy should be the mean accuracy over all the folds.

7.4 The Road-map Ahead

The following future trends of our research are identified, which can extend our work

in new directions.

• In our thesis, supervised and unsupervised machine learning algorithms have

been developed. There are various real time applications like video surveil-

lance, medical imaging etc. where the data is available in large amounts, but

it is expensive to obtain their labels. For such applications, semi-supervised

classification algorithms can be developed. Our binary classifiers can be used

in semi-supervised framework as Qi et al. proposed Laplacian Twin Support

Vector Machine for semi-supervised classification [96].

• Working in the spirit of Khemchandani et al. [97], future line of work could be

to discuss the kernel selection problem of our binary classifiers like Iν-TWSVM,

ATP-SVM etc. over the convex set of finitely many basic kernels, which can

be formulated as an iterative alternating optimization problem.

• Most of the SVM and TWSVM based classification algorithms assume that

the entire training data could fit into main memory. With the growth of

businesses, the amount of data exceeds the memory limit, which is available

to the learning systems. Hence, there is a need to identify scalable algorithms

for classification. These algorithms could be implemented in the framework of

incremental learning.

• When the optimization problem is independent of inverse of matrix, it can

be solved by efficient Sequential Minimal Optimization (SMO) [62] technique.

Since, the dual of ATP-SVM does not involve inverse of matrices, it could

be implemented using SMO. Another work could be to explore angle-based

methodology like ATP-SVM and ATWSVM, for regression problems.

7.4. The Road-map Ahead 171

• In order to apply our algorithms to large-scale data mining processes, there

is a need to make them even more computationally efficient. The potential

direction in this regard is the parallelization of training phase especially for

multi-category classification algorithms.

• Future work could be to develop a sparse algorithm version for our classifiers

like Iν-TWSVM, ATP-SVM, TerSVM etc.

• We could also explore the option of combining two or more machine learning

algorithms to get a new, efficient algorithm.

172

References

[1] C. Cortes, V. Vapnik, Support-vector networks, Machine Learning 20 (3)

(1995) 273–297.

[2] C. J. Burges, A tutorial on support vector machines for pattern recognition,

Data Mining and Knowledge Discovery 2 (2) (1998) 121–167.

[3] V. N. Vapnik, An overview of statistical learning theory, IEEE Transactions

on Neural Networks, 10 (5) (1999) 988–999.

[4] V. Vapnik, The nature of statistical learning theory, Springer Science & Busi-

ness Media, 2000.

[5] O. L. Mangasarian, D. R. Musicant, Lagrangian support vector machines, The

Journal of Machine Learning Research 1 (2001) 161–177.

[6] Y.-J. Lee, O. L. Mangasarian, Ssvm: A smooth support vector machine for

classification, Computational Optimization and Applications 20 (1) (2001) 5–

22.

[7] J. A. Suykens, J. Vandewalle, Least squares support vector machine classifiers,

Neural Processing Letters 9 (3) (1999) 293–300.

[8] O. L. Mangasarian, E. W. Wild, Proximal support vector machine classifiers,

in: Proceedings KDD-2001: Knowledge Discovery and Data Mining, Citeseer,

2001.

[9] O. L. Mangasarian, E. W. Wild, Multisurface proximal support vector ma-

chine classification via generalized eigenvalues, IEEE Transactions on Pattern

Analysis and Machine Intelligence, 28 (1) (2006) 69–74.

174

[10] Jayadeva, R. Khemchandani, S. Chandra, Twin support vector machines for

pattern classification, IEEE Transactions on Pattern Analysis and Machine

Intelligence 29 (5) (2007) 905–910.

[11] R. Khemchandani, Mathematical programming applications in machine learn-

ing, Ph.D. thesis, Indian Institute of Technology Delhi New Delhi-110016,

India (2008).

[12] Jayadeva, R. Khemchandani, S. Chandra, Twin Support Vector Machines:

Models, Extensions and Applications, Vol. 659, Springer, 2016.

[13] O. L. Mangasarian, Nonlinear programming, Vol. 10, SIAM, 1993.

[14] M. A. Kumar, M. Gopal, Least squares twin support vector machines for

pattern classification, Expert Systems with Applications 36 (4) (2009) 7535–

7543.

[15] Y.-H. Shao, C.-H. Zhang, X.-B. Wang, N.-Y. Deng, Improvements on twin sup-

port vector machines, IEEE Transactions on Neural Networks, 22 (6) (2011)

962–968.

[16] X. Peng, Tpmsvm: a novel twin parametric-margin support vector machine

for pattern recognition, Pattern Recognition 44 (10) (2011) 2678–2692.

[17] P.-Y. Hao, New support vector algorithms with parametric insensitive/margin

model, Neural Networks 23 (1) (2010) 60–73.

[18] X. Peng, A ν-twin support vector machine (ν-tsvm) classifier and its geometric

algorithms, Information Sciences 180 (20) (2010) 3863–3875.

[19] Y.-J. Tian, X.-C. Ju, Nonparallel support vector machine based on one opti-

mization problem for pattern recognition, Journal of the Operations Research

Society of China 3 (4) (2015) 499–519.

[20] T. Hastie, R. Tibshirani, J. Friedman, Unsupervised learning, Springer, 2009.

[21] A. K. Jain, R. C. Dubes, Algorithms for clustering data, Prentice-Hall, Inc.,

1988.

7.4. The Road-map Ahead 175

[22] A. Y. Ng, M. I. Jordan, Y. Weiss, et al., On spectral clustering: Analysis and

an algorithm, Advances in Neural Information Processing Systems 2 (2002)

849–856.

[23] J. Shi, J. Malik, Normalized cuts and image segmentation, IEEE Transactions

on Pattern Analysis and Machine Intelligence, 22 (8) (2000) 888–905.

[24] W. Wu, H. Xiong, S. Shekhar, Clustering and information retrieval, Vol. 11,

Springer Science & Business Media, 2013.

[25] A. K. Jain, Data clustering: 50 years beyond k-means, Pattern Recognition

Letters 31 (8) (2010) 651–666.

[26] X. Wang, Y. Wang, L. Wang, Improving fuzzy c-means clustering based on

feature-weight learning, Pattern Recognition Letters 25 (10) (2004) 1123–1132.

[27] L. Xu, J. Neufeld, B. Larson, D. Schuurmans, Maximum margin clustering,

Advances in Neural Information Processing Systems 17 (2004) 1537–1544.

[28] Z. Wang, Y.-H. Shao, L. Bai, N.-Y. Deng, Twin support vector machine for

clustering, Neural Networks and Learning Systems, IEEE Transactions on

26 (10) (2015) 2583–2588.

[29] K. Zhang, I. W. Tsang, J. T. Kwok, Maximum margin clustering made prac-

tical, IEEE Transactions on Neural Networks, 20 (4) (2009) 583–596.

[30] A. L. Yuille, A. Rangarajan, The concave-convex procedure, Neural Compu-

tation 15 (4) (2003) 915–936.

[31] A. J. Smola, B. Scholkopf, Learning with kernels, Citeseer, 1998.

[32] C.-W. Hsu, C.-J. Lin, A comparison of methods for multiclass support vector

machines, IEEE Transactions on Neural Networks, 13 (2) (2002) 415–425.

[33] J. C. Platt, N. Cristianini, J. Shawe-Taylor, Large margin dags for multiclass

classification., in: nips, Vol. 12, 1999, pp. 547–553.

[34] Jayadeva, R. Khemchandani, S. Chandra, Fuzzy linear proximal support vec-

tor machines for multi-category data classification, Neurocomputing 67 (2005)

426–435.

176

[35] H. Lei, V. Govindaraju, Half-against-half multi-class support vector machines,

in: Multiple Classifier Systems, Springer, 2005, pp. 156–164.

[36] Y.-H. Shao, W.-J. Chen, W.-B. Huang, Z.-M. Yang, N.-Y. Deng, The best sep-

arating decision tree twin support vector machine for multi-class classification,

Procedia Computer Science 17 (2013) 1032–1038.

[37] J. Xie, K. Hone, W. Xie, X. Gao, Y. Shi, X. Liu, Extending twin support

vector machine classifier for multi-category classification problems, Intelligent

Data Analysis 17 (4) (2013) 649–664.

[38] C. Angulo, X. Parra, A. Catala, K-svcr. a support vector machine for multi-

class classification, Neurocomputing 55 (1) (2003) 57–77.

[39] Y. Xu, R. Guo, L. Wang, A twin multi-class classification support vector

machine, Cognitive computation 5 (4) (2013) 580–588.

[40] B. Scholkopf, P. L. Bartlett, A. J. Smola, R. Williamson, Shrinking the tube:

a new support vector regression algorithm, Advances in Neural Information

Processing Systems (1999) 330–336.

[41] B. Scholkopf, A. J. Smola, R. C. Williamson, P. L. Bartlett, New support

vector algorithms, Neural Computation 12 (5) (2000) 1207–1245.

[42] M. R. Guarracino, C. Cifarelli, O. Seref, P. M. Pardalos, A classification

method based on generalized eigenvalue problems, Optimisation Methods and

Software 22 (1) (2007) 73–81.

[43] Y.-H. Shao, N.-Y. Deng, W.-J. Chen, Z. Wang, Improved generalized eigen-

value proximal support vector machine, Signal Processing Letters, IEEE 20 (3)

(2013) 213–216.

[44] S. R. Gunn, et al., Support vector machines for classification and regression,

ISIS technical report 14.

[45] J. Nocedal, S. Wright, Numerical optimization, Springer Science & Business

Media, 2006.

7.4. The Road-map Ahead 177

[46] S. Chandra, Jayadeva, A. Mehra, Numerical optimization with Applications,

Alpha Science International, 2009.

[47] J. A. Hartigan, M. A. Wong, Algorithm as 136: A k-means clustering algo-

rithm, Applied Statistics (1979) 100–108.

[48] V. J. Hodge, J. Austin, A survey of outlier detection methodologies, Artificial

Intelligence Review 22 (2) (2004) 85–126.

[49] R. O. Duda, P. E. Hart, D. G. Stork, Pattern classification, John Wiley &

Sons, 2012.

[50] C. J. Lin, C.-W. Hsu, C.-C. Chang, A practical guide to support

vector classification, 2003, National Taiwan U., www. csie. ntu. edu.

tw/cjlin/papers/guide/guide. pdf.

[51] M. Belkin, P. Niyogi, V. Sindhwani, Manifold regularization: A geometric

framework for learning from labeled and unlabeled examples, The Journal of

Machine Learning Research 7 (2006) 2399–2434.

[52] C. Blake, C. J. Merz, Uci repository of machine learning databases.

URL http://www.ics.uci.edu/~mlearn/MLRepository.html,1998.

[53] D. Musicant, Ndc: normally distributed clustered datasets, Computer Sciences

Department, University of Wisconsin, Madison 1998.

URL <http://www.cs.wisc.edu/~musicant/data/ndc>

[54] J. Demsar, Statistical comparisons of classifiers over multiple data sets, The

Journal of Machine Learning Research 7 (2006) 1–30.

[55] S. Holm, A simple sequentially rejective multiple test procedure, Scandinavian

journal of statistics (1979) 65–70.

[56] J. F. Khan, R. R. Adhami, S. M. Bhuiyan, A customized gabor filter for

unsupervised color image segmentation, Image and Vision Computing 27 (4)

(2009) 489–501.

178

[57] P. Arbelaez, C. Fowlkes, D. Martin, The berkeley segmenta-

tion dataset and benchmark, see http://www. eecs. berkeley.

edu/Research/Projects/CS/vision/bsds.

[58] Y.-H. Shao, W.-J. Chen, N.-Y. Deng, Nonparallel hyperplane support vector

machine for binary classification problems, Information Sciences 263 (2014)

22–35.

[59] R. Khemchandani, P. Saigal, Color image classification and retrieval through

ternary decision structure based multi-category twsvm, Neurocomputing 165

(2015) 444–455.

[60] R. Khemchandani, P. Saigal, S. Chandra, Improvements on ν-twin support

vector machine, Neural Networks 79 (2016) 97–107.

[61] Y. Tian, X. Ju, Z. Qi, Y. Shi, Improved twin support vector machine, Science

China Mathematics 57 (2) (2014) 417–432.

[62] X. Peng, Tsvr: an efficient twin support vector machine for regression, Neural

Networks 23 (3) (2010) 365–372.

[63] B. D. Ripley, Pattern recognition and neural networks, Cambridge university

press, 2007.

[64] Y.-J. Lee, O. L. Mangasarian, Rsvm: Reduced support vector machines., in:

SDM, Vol. 1, 2001, pp. 325–361.

[65] B. S. Manjunath, W.-Y. Ma, Texture features for browsing and retrieval of

image data, Pattern Analysis and Machine Intelligence, IEEE Transactions on

18 (8) (1996) 837–842.

[66] W. Tao, H. Jin, Y. Zhang, Color image segmentation based on mean shift and

normalized cuts, IEEE Transactions on Systems, Man, and Cybernetics, Part

B (Cybernetics) 37 (5) (2007) 1382–1389.

[67] P. Simard, Y. LeCun, J. S. Denker, Efficient pattern recognition using a new

transformation distance, Advances in neural information processing systems

(1993) 50–50.

7.4. The Road-map Ahead 179

[68] P. Y. Simard, Y. A. LeCun, J. S. Denker, B. Victorri, Transformation in-

variance in pattern recognitiontangent distance and tangent propagation, in:

Neural networks: tricks of the trade, Springer, 1998, pp. 239–274.

[69] D. Keysers, T. Deselaers, C. Gollan, H. Ney, Deformation models for image

recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence

29 (8).

[70] Y. Liu, D. Zhang, G. Lu, W.-Y. Ma, A survey of content-based image retrieval

with high-level semantics, Pattern recognition 40 (1) (2007) 262–282.

[71] A. N. Tikhonov, V. I. Arsenin, Solutions of ill-posed problems, Vh Winston,

1977.

[72] B. N. Parlett, The symmetric eigenvalue problem, Vol. 7, SIAM, 1980.

[73] B. S. Manjunath, J.-R. Ohm, V. V. Vasudevan, A. Yamada, Color and texture

descriptors, IEEE Transactions on circuits and systems for video technology

11 (6) (2001) 703–715.

[74] M. Subrahmanyam, Q. J. Wu, R. Maheshwari, R. Balasubramanian, Modified

color motif co-occurrence matrix for image indexing and retrieval, Computers

& Electrical Engineering 39 (3) (2013) 762–774.

[75] M. E. ElAlami, A novel image retrieval model based on the most relevant

features, Knowledge-Based Systems 24 (1) (2011) 23–32.

[76] S. Murala, Q. J. Wu, Expert content-based image retrieval system using robust

local patterns, Journal of Visual Communication and Image Representation

25 (6) (2014) 1324–1334.

[77] E. Walia, A. Pal, D. Pandian, D. Lohani, Variant of completed robust lbp

for two-level probabilistic content based image retrieval, Proceedings of Com-

puter and Advanced Technology in Education http://dx. doi. org/10.2316/P

3 (2014) 2014.

180

[78] A. H. Reddy, N. S. Chandra, Local oppugnant color space extrema patterns for

content based natural and texture image retrieval, AEU-International Journal

of Electronics and Communications 69 (1) (2015) 290–298.

[79] S. Murala, R. Maheshwari, R. Balasubramanian, Directional local extrema

patterns: a new descriptor for content based image retrieval, International

journal of multimedia information retrieval 1 (3) (2012) 191–203.

[80] K. P. Jasmine, P. R. Kumar, Color and local maximum edge patterns his-

togram for content based image retrieval, International Journal of Intelligent

Systems and Applications 6 (11) (2014) 66.

[81] T. Ojala, M. Pietikainen, T. Maenpaa, Multiresolution gray-scale and rotation

invariant texture classification with local binary patterns, IEEE Transactions

on pattern analysis and machine intelligence 24 (7) (2002) 971–987.

[82] P. S. Bradley, O. L. Mangasarian, k-plane clustering, Journal of Global Opti-

mization 16 (1) (2000) 23–32.

[83] Y.-H. Shao, L. Bai, Z. Wang, X.-Y. Hua, N.-Y. Deng, Proximal plane clustering

via eigenvalues, Procedia Computer Science 17 (2013) 41–47.

[84] H. Valizadegan, R. Jin, Generalized maximum margin clustering and unsuper-

vised kernel learning, in: Advances in Neural Information Processing Systems,

2006, pp. 1417–1424.

[85] S. Boyd, L. Vandenberghe, Convex optimization, Cambridge university press,

2004.

[86] M. S. Lobo, L. Vandenberghe, S. Boyd, H. Lebret, Applications of second-

order cone programming, Linear Algebra and its Applications 284 (1) (1998)

193–228.

[87] R. Khemchandani, A. Pal, S. Chandra, Fuzzy least squares twin support vector

clustering, Neural Computing and Applications (2016) 1–11.

[88] Z.-M. Yang, Y.-R. Guo, C.-N. Li, Y.-H. Shao, Local k-proximal plane cluster-

ing, Neural Computing and Applications 26 (1) (2015) 199–211.

7.4. The Road-map Ahead 181

[89] D. T. Larose, k-nearest neighbor algorithm, Discovering Knowledge in Data:

An Introduction to Data Mining (2005) 90–106.

[90] D. P. Huttenlocher, G. A. Klanderman, W. J. Rucklidge, Comparing images

using the hausdorff distance, IEEE Transactions on Pattern Analysis and Ma-

chine Intelligence, 15 (9) (1993) 850–863.

[91] T. H. Cormen, Introduction to algorithms, MIT press, 2009.

[92] P.-N. Tan, M. Steinbach, V. Kumar, Introduction to data mining addison-

wesley (2005).

[93] C. Alzate, J. A. Suykens, Multiway spectral clustering with out-of-sample ex-

tensions through weighted kernel pca, IEEE Transactions on Pattern Analysis

and Machine Intelligence, 32 (2) (2010) 335–347.

[94] C. Alzate, J. A. K. Suykens, Out-of-sample eigenvectors in kernel spectral

clustering, in: The 2011 International Joint Conference on Neural Networks

(IJCNN),, IEEE, 2011, pp. 2349–2356.

[95] S. Mehrkanoon, C. Alzate, R. Mall, R. Langone, J. Suykens, et al., Multiclass

semisupervised learning based upon kernel spectral clustering, IEEE Transac-

tions on Neural Networks and Learning Systems, 26 (4) (2015) 720–733.

[96] Z. Qi, Y. Tian, Y. Shi, Laplacian twin support vector machine for semi-

supervised classification, Neural Networks 35 (2012) 46–53.

[97] R. Khemchandani, Jayadeva, S. Chandra, Optimal kernel selection in twin

support vector machines, Optimization Letters 3 (1) (2009) 77–88.

[98] R. J. Simes, An improved bonferroni procedure for multiple tests of signifi-

cance, Biometrika 73 (3) (1986) 751–754.

[99] Y.-H. Shao, N.-Y. Deng, Z.-M. Yang, Least squares recursive projection twin

support vector machine for classification, Pattern Recognition 45 (6) (2012)

2299–2307.

[100] N. L. Johnson, S. Kotz, N. Balakrishnan, Lognormal distributions, Continuous

Univariate Distributions, 1, 1994.

182

[101] Z. Guo, L. Zhang, D. Zhang, A completed modeling of local binary pattern

operator for texture classification, IEEE Transactions on Image Processing

19 (6) (2010) 1657–1663.

[102] Y. Zhao, W. Jia, R.-X. Hu, H. Min, Completed robust local binary pattern

for texture classification, Neurocomputing 106 (2013) 68–76.

[103] A. Amanatiadis, V. Kaburlasos, A. Gasteratos, S. Papadakis, Evaluation of

shape descriptors for shape-based image retrieval, IET Image Processing 5 (5)

(2011) 493–499.

[104] J. Z. Wang, J. Li, G. Wiederhold, Simplicity: Semantics-sensitive integrated

matching for picture libraries, Pattern Analysis and Machine Intelligence,

IEEE Transactions on 23 (9) (2001) 947–963.

[105] Corel 10000 image database.

URL wang.ist.psu.edu/docs/related

[106] Mit vistex database of textures, media laboratory,.

URL http://vismod.media.mit.edu/vismod/imagery/VisionTexture/

[107] Oliva and torralba scene dataset,.

URL http://cvcl.mit.edu/database.htm

Appendices

183

Appendix A

Evaluation Measures

1. Accuracy:

‘Accuracy’ of a classification algorithm is defined as follows

Accuracy =TP + TN

TP + FP + TN + FN. (A.1)

Here TP, TN, FP, and FN are the number of true positive, true negative, false

positive and false negative respectively.

2. Friedman test:

The Friedman test [54] is a non-parametric test for comparing three or more

related samples and it makes no assumptions about the underlying distribu-

tion of the data. The data is set out in a table comprising of n1 rows and

n2 columns. It ranks the algorithms for each dataset separately, the best per-

forming algorithm getting the rank of 1, the second best rank 2 and so on. In

case of ties, average ranks are assigned. It then compares the average ranks of

algorithms, Rj = 1NΣir

ji where rji is the rank of the jth algorithm on the ith

dataset of N dataset.

3. p-value and Holm-Bonferroni test:

The p-value [54] is calculated by performing pairwise t-test. The null hypothe-

sis assumes that the data of two-sample t-test comes from independent random

samples with equal means and equal but unknown variances. The p-values are

tested for a significance level α = 0.05.

To analyze the performance of multiple algorithms, Holm-Bonferroni method

186

[98] is used. It compares the algorithms and tests their hypotheses. Let there

are N statistics T1, ..., TN with corresponding p-values P1, ..., PN for test hy-

pothesis H1, ...,HN . For a significance level α, Holm-Bonferroni test orders

the p-values from minimum to maximum as P(1), ..., P(N) with corresponding

null hypothesis H(1), ...,H(N). It rejects the null hypotheses H(1), ...,H(k−1)

and does not reject H(k), ...,H(N), if

P(k) >α

N + 1− k.

4. F-measure (FM) and error rate (ER):

To evaluate the segmentation algorithms statistically, F-measure is used which

is harmonic mean of precision and recall and is given by

FM =2× Precision×RecallPrecision+Recall

, (A.2)

and ER is given by

ER =FP + FN

Total number of patterns, (A.3)

where Precision and Recall are defined as

Precision =TP

TP + FP,

Recall =TP

TP + FN.

TP, FP, TN, FN are true-positive, false-positive, true-negative and false-

negative respectively. For color image segmentation problems, these measures

are calculated with respect to ground-truth boundaries of images.

5. Average Retrieval Rate:

The (P-R) ratio is measured using (A.4) and (A.5).

Precision =Number of relevant images retrieved

Number of retrieved images, (A.4)

Recall =Number of relevant images retrieved

Total number of relevant images. (A.5)

187

Average Retrieval Rate (ARR) [73] is a robust metric for comparison of image

retrieval methods. It can be computed using (A.6).

ARR =1

N

NQ∑q=1

RR(q) (A.6)

where NQ represents the number of queries that are used for the purpose of

verifying the descriptor in some dataset. RR(q) represents the retrieval rate of

a single query and RR(q) ≤ 1. In our experiments, each image in the database

is treated as the query image and average over the entire database gives the

performance of the retrieval method.

6. Clustering-Accuracy:

The metric Clustering-Accuracy [92] is used to measure the performance of

clustering methods. For finding the accuracy of clustering algorithm, a simi-

larity matrix S ∈ Rm×m is computed with the given data labels yi, yj ∈ 1 :

K, i = 1 : m, j = 1 : m, where

S(i, j) =

1, if yi = yj

0, otherwise.

Let St and Sp are the similarity matrices computed by the true cluster labels

and predicted labels respectively. The accuracy of clustering method is defined

as the Rand statistic [92] and is given as

Clustering −Accuracy =nzeros + nones −m

m2 −m× 100 (A.7)

where nzeros is the number of zeros at corresponding indices in both St and

Sp, and nones is the number of ones in both St and Sp.

188

Appendix B

Loss Function of TWSVM

TWSVM uses hinge loss function which is given by

Lh =

0, yifi ≥ 1

1− yifi, otherwise

For any SVM or TWSVM based clustering method, the clustering error or the hy-

perplanes change little after initial labeling or during subsequent iterations. This

arises due to hinge loss function, as shown in Fig.B.1a, where the classifier tries to

push yifi to the point beyond yifi = 1 (towards right) [29]. Here, solid line shows

loss with initial labels and dotted line shows loss after flipping of labels. As observed

from the empirical margin distribution of yifi, most of the patterns have margins

yifi 1. If the label of a pattern is changed, the loss will be very large and the

classifier is unwilling to flip the class labels. So, the procedure gets stuck in a lo-

cal optimum and adheres to the initial label estimates. To prevent the premature

convergence of the iterative procedure, the loss function is changed to square loss

and is given as Ls = (1 − yifi)2. This loss function is symmetric around yifi = 1,

as shown in Fig.B.1b and penalizes preliminary wrong predictions. Therefore, it

permits flipping of labels if needed and leads to a significant improvement in the

clustering performance.

190

(a) Hinge loss function (b) Square loss function

Figure B.1: Flipping of labels. a. Hinge loss function; b. Square loss function

Appendix C

UCI Datasets

The binary UCI datasets [52] used for numerical experiments in our work, are listed

below:

1. Australian Credit Approval Dataset (ACA) (690 × 14): This dataset has 690

instances with 14 attributes. The two classes contain 383 and 307 patterns,

related to credit card applications. It has a variety of attributes - continuous,

nominal and binary.

2. Breast Cancer Wisconsin Prognostic (WPBC) (698 × 34): For this dataset,

the problem is to predict accurately the presence or absence of a malignant

tumor, with 458 benign and 240 malignant patterns.

3. BUPA Liver Disorders Data Set (345 × 7): Each record represents the blood

test reports of a single male individual. The first five attributes are tests which

are sensitive to liver disorders arising from from excessive alcohol consumption.

The patterns are distributed as 200 and 145 in two classes.

4. Cleveland Heart Disease Dataset (Heart-C) (303 × 13): The original database

has 76 attributes, but most of the published experiments use a subset of 14

attributes. The problem is to predict the presence of heart disease in a patient

(“goal” values 1,2,3,4) or absence (value 0). The dataset contains 164 positive

and 139 negative instances.

5. Congressional Voting Records Dataset (Votes) (435 × 16): This dataset con-

sists of 1984 United Stated Congressional Voting Records with two classes-

Republican (267) or Democratic (168).

192

6. Connectionist Bench Sonar Dataset(208 × 60): The problem is to classify

sonar signals bounced off a metal cylinder and those bounced off a roughly

cylindrical rock. The are 111 patterns obtained by bouncing sonar signals off

a metal cylinder at various angles and 97 patterns obtained from rocks under

similar conditions. Each pattern has 60 numeric attributes.

7. Contraceptive Method Choice Dataset (CMC) (1473 × 9): This dataset is a

subset of the 1987 National Indonesia Contraceptive Prevalence Survey. The

patterns were married women who were either not pregnant or do not know

if they were at the time of interview. The problem is to predict the current

contraceptive method choice of a woman based on her demographic and socio-

economic characteristics. The two classes represent contraceptive users (844)

and non-users (629).

8. German Credit Dataset (1000 × 20): This numeric dataset contains credit

data and is produced by Strathclyde University. All the categorical attributes

have been coded as integers. The instances are labeled as “good” (700) and

“bad” (300).

9. Heart Statlog Dataset (Heart-S) (270 × 13): This dataset contains 13 real,

binary and nominal attributes, with 150 positive and 120 negative patterns.

10. Ionosphere Dataset (351 × 34): This dataset contains radar data, collected by

a system in Goose Bay, Labrador. It classifies free electrons in the ionosphere

as “Good” (225) when radar returns show evidence of some type of structure

in the ionosphere and “Bad” (126) otherwise.

11. Pima Indians Diabetes Dataset (768 × 8): This data is collected by National

Institute of Diabetes and Digestive and Kidney Diseases. All the patients

are females at least 21 years old of Pima Indian heritage. 500 instances are

negative while 268 are positive ones.

12. Thyroid Disease Dataset (215 × 5): This data is collected by Stefan Aeberhard

and has 150 and 65 negative and positive patterns respectively.

13. Two-norm Dataset (400 × 20): This is a dataset with two classes of 351 and

193

49 instances.

The UCI multi-category datasets [52] used for numerical experiments are:

1. Dermatology Dataset (366 × 34, 6 classes): This dataset is used to predict the

type of Eryhemato-Squamous Disease in dermatology. It has 34 attributes, 33

of which are linear valued and one is nominal.

2. Ecoli Dataset (336× 7, 8 classes): This dataset contains records for protein

localization sites and is created by Kenta Nakai.

3. Glass Identification Dataset (Glass) (214× 9, 6 classes): This data is generated

by USA Forensic Science Service. It has instances for six types of glasses

defined in terms of their oxide content. The study of glass was motivated by

criminology investigation, where the glass left behind at crime scene can be

used as evidence, if it is correctly identified.

4. Iris Dataset (150 × 4, 3 classes): This is a well known dataset often used to

test the efficacy of pattern recognition method, first used by Fisher in 1936.

The dataset has 3 classes of 50 instances each, where each class refers to a

type of iris plant. One class is linearly separable from the other 2; the latter

are NOT linearly separable from each other.

5. Libras (360 × 91, 15 classes): The dataset contains 15 classes of 24 instances

each. Each class references to a hand movement type in LIBRAS (Portuguese

name ’Lngua BRAsileira de Sinais’, oficial brazilian signal language).

6. Multiple Features Dataset (MF) (2,000 × 649, 10 classes): This dataset con-

sists of features of handwritten numerals (‘0’-‘9’) extracted from a collection

of Dutch utility maps. 200 patterns per class (for a total of 2,000 patterns)

have been digitized in binary images.

7. Optical Recognition of Handwritten Digits Dataset (Optical digits or OD)

(5,620 × 64, 10 classes): This data describes bitmaps of handwritten digits.

43 people contributed for data collection. 32 × 32 bitmaps are divided into

non-overlapping blocks of 4x4 and the number of on pixels are counted in each

block.

194

8. Page Blocks Classification Dataset (PB) (5,473 × 10, 5 classes): The problem

is to classify all the blocks of the page layout of a document that has been

detected by a segmentation process. It has 5473 patterns taken from 54 distinct

documents, where each instance concerns one block.

9. Satimage (6435 × 36, 7 classes): Multi-spectral values of pixels in 3x3 neigh-

borhoods in a satellite image, and the classification associated with the cen-

tral pixel in each neighborhood. The original Landsat data for this database

was generated from data purchased from NASA by the Australian Centre for

Remote Sensing, and used for research at The Centre for Remote Sensing,

University of New South Wales, Australia.

10. Seeds Dataset (210 × 7, 3 classes): This data contains records of geometrical

properties of kernels belonging to three different varieties of wheat. The pat-

tern instances are generated by soft X-ray technique and GRAINS package to

construct 7 real-valued attributes.

11. Statlog Image Segmentation Dataset (Segment) (210× 19, 7 classes): This

dataset is an image segmentation database where the instances are randomly

selected from a database of seven outdoor images. The images are hand-

segmented to create a classification for every pixel.

12. Wine Dataset(178× 13, 3 classes): This data is generated by chemical anal-

ysis of wines grown in the same region in Italy but raised by three different

cultivators. It has 13 attributes listing the quantity of constituents found in

each pattern of wine.

13. Zoo Dataset (101× 16, 7 classes): This database of animal types in a zoo,

contains boolean-valued attributes.

Appendix D

Synthetic Datasets

Dataset 1: Cross planes

The cross-planes data [8] consists of data points lying near two intersecting lines. It

can be considered as a perturbed generalization of exclusive-OR (XOR) classification

problem. Fig.D.1a illustrates the cross planes data with 200 training points. The

red ‘dots’ and blue ‘plus’ represent the data points of two classes.

(a) Cross planes data (b) Syn1 data

(c) Ripley’s data(d) Complex XOR data

Figure D.1: Synthetic Datasets

196

Dataset 2: Syn1 data

The dataset Syn1, as shown in Fig. D.1b, has patterns in R2, created with

+1 : x = [ρcosθ, ρsinθ], ρ ∼ U(0, 1), θ ∼ U(0, π/3),

−1 : x = [ρcosθ, ρsinθ], ρ ∼ U(0, 1), θ ∼ U(π/2, 5π/6), (D.1)

where U(a, b) denotes the uniform distribution in (a, b) and 100 data points are

randomly created for both the classes.

Dataset 3: Ripley’s data

The Ripleys dataset is an artificially-generated binary dataset [63] which includes

250 training points and 1000 test points, as shown in Fig.D.1c.

Dataset 4: Complex XOR

Mangasarian et al. [9] proposed nonparallel hyperplanes classifier GEPSVM and

established its efficacy using cross-planes data. This dataset is generated as data

points lying near two intersecting lines and is considered as perturbed generalization

of exclusive-OR (XOR) classification problem. We have performed experiments with

complex XOR dataset [99], which is generalization of XOR problem, with added

white Gaussian noise. Fig.D.1d shows the nonparallel planes obtained with linear

version of ATWSVM and TBSVM. The data consists of 120 patterns in R2 where

red ‘dots’ (80) and blue ‘stars’ (40) represent the data points of positive and negative

classes respectively.

Dataset 5: Two-moons

The two moons dataset [51], as shown in Fig.D.2 consists of 200 data points in R2,

belonging to two classes.

Dataset 6: NDC and Exp-NDC

NDC data is generated using David Musicants NDC Data Generator [53] and is

normally distributed. The NDC datasets are skewed by making use of log-normal

197

Figure D.2: Two moons dataset

distribution [100]. In probability theory, a continuous random variable whose log-

arithm is normally distributed, is said to possess log-normal distribution. Thus, if

the random variable X1 is log-normally distributed, then X2 = ln(X1) has a normal

distribution. Similarly, if X2 has a normal distribution, then X1 = exp(X2) has a

log-normal distribution and is skewed towards right. We have termed the skewed

NDC data as Exp-NDC. The feature dimension of Exp-NDC data is 32 and the size

varies from 500 to 100,000 data points.

198

Appendix E

Image Features and Datasets

We have used Complete Robust Local Binary Pattern with Co-occurrence Matrix

(CR-LBP-Co) and Gabor Texture Features, as well as Angular Radial Transform

(ART) shape descriptors to efficiently capture the texture and shape information of

color images. These features are discussed in the following section.

E.1 Image Descriptors

Gabor Texture Features

Texture is a low level image feature and in our work, we have used Gabor filter [65] to

extract texture features from color images. Gabor filter [65, 56] is a class of oriented

filters in which a filter of arbitrary orientation and scale is synthesized as a linear

combination of a set of “basis filters”. The edge located at different orientations

and scales in an image can be detected by splitting the image into orientation and

scale sub-bands obtained by the basis filters. It allows one to adaptively steer a

filter to any orientation and scale, and to determine analytically the filter output

as a function of orientation and scale. A two dimensional Gabor filter g(x, y) is

an oriented sinusoidal grating, which is modulated by a two dimensional Gaussian

function h(x, y) as follows

g(x, y) = h(x, y) ∗ exp[−1

2

(x2

σ2x

+y2

σ2y

)∗ exp(2πjWx)

]. (E.1)

200

Here σ2x, σ2

y are user defined parameters. For the mother Gabor filter g(x, y), its

children Gabor filters gm,n(x, y) are defined to be its scaled and rotated versions

gm,n(x, y) = a−2mg(x′, y′), a ≥ 1,x′y′

= a−m

cosθ sinθ

−sinθ cosθ

xy

, (E.2)

where a is a fixed scale factor, m is the scale parameter, n is the orientation pa-

rameter, K is the total number of scales, and L is the total number of orientations.

In this chapter, we set the Gabor function parameters as follows: W = 1, a = 2,

σx = σy = 12π , K = 3, L = 4. The Gabor filtered output of an M × N image

can be obtained by convolution of image with the Gabor filter, at given scale and

orientation.

Complete Robust-Local Binary Pattern with Co-occurrence Matrix

Local Binary Pattern (LBP) is a well known texture feature [81], but it has few

limitations like sensitivity to noise and sometimes it tends to associate different

structural patterns with the same binary code which reduces its discriminating abil-

ity. A variant of LBP called Complete LBP (CLBP) is proposed in [101] where

local image differences are decomposed into two complementary components, i.e.,

sign and magnitude. In [102], Completed Robust LBP (CR-LBP) is proposed that

overcomes both the limitations of LBP. CR-LBP measures three components i.e. pat-

tern, magnitude and center information, for each pixel, from the image. Complete

Robust-Local Binary Pattern with Co-Occurrence Matrix (CR-LBP-Co) is proposed

by Walia et al. [77]. Here, the value of each center pixel in a 3× 3 local window is

replaced by its average local gray level. The average local gray level is more robust

to noise and illumination as compared to center pixel value. CR-LBP-Co determines

the features as given by CR-LBP which are modified to capture texture information

from color images. The resulting values are quantized for the computation of four-

directional co-occurrence matrices to get the texture descriptor. The CR-LBP-Co

descriptor for color RGB image is computed as follows:

E.1. Image Descriptors 201

1. Find the magnitude for image pixels as given in (E.3).

val =√R2 +G2 +B2 (E.3)

2. Compute the pattern, CR− LBP , as in (E.4)

CR− LBP =P−1∑p=0

s(valp −Gc

8 + α)2p (E.4)

where

Gc =8∑i=1

valci + αvalc, s(x) =

1, x ≥ 0

0, x < 0

(E.5)

Here, P is the total number of neighbors, valp is the color magnitude of pth

neighbor of center pixel, c. α is a parameter that gives weight to center pixel.

3. Compute mp, for each neighbor pixel

mp =1

8 + α|Gp −Gc| (E.6)

where

Gp =8∑i=1

valpi + αvalp (E.7)

Here, Gp is computed as (E.7) and valpi is the color value of ith neighbour of

valp.

4. The CR-LBP-magnitude for each pixel is given as follows

CR− LBPmag =

P−1∑p=0

s(mp − c)2p (E.8)

where c is a threshold set to mean of mp for the entire image. It determines

the local variance of local color information.

5. The local central information (CR− LBPCI) is extracted as in (E.9).

CR− LBPCI = s(WCc − ci) (E.9)

202

where

WCc =Gc

8 + α. (E.10)

Here, ci is average local color value of the entire image.

6. Compute the histograms for all three components i.e. pattern, magnitude and

center information. Quantize each of these components into 16 bins.

7. Compute co-occurrence matrix for the quantized features in four directions;

horizontal, vertical, diagonal-45 and diagonal-135.

8. Add corresponding direction matrices to get 4 matrices of size 16×16. The ma-

trices thus obtained are symmetric about diagonal, so extract non-redundant

136 values from each matrix and add them.

The feature dimension is 136 for each image. This descriptor works for color images.

Angular Radial Transform

Human perception is based on shape of an object and the shape alone can be used

to recognize objects. So, shape based features are considered the most relevant

among all the visual features. Shape descriptors can be classified into contour-

based descriptors which extract features from the outer boundary and region-based

descriptors which extract features from the entire region. Region-based descriptors

contain more information as they deal with boundary as well as interior of the object.

The important region-based shape descriptors are Zernike Moments (ZMs), Angu-

lar Radial Transform (ART), Geometric Moments, Moment Invariant etc. [103].

ART possesses characteristics such as compact size, robustness to noise, invariance

to rotation and ability to describe complex objects. This descriptor is a complex

orthogonal unitary transform defined on a unit disk based on complex orthogonal

sinusoidal basis functions in polar co-ordinates [103].

The ART coefficients, Fnm of order n and m, are defined as (E.11).

Fnm =

∫ 2π

0

∫ 1

0V ∗n,m(r, θ)f(r, θ)rdrdθ (E.11)

where f(r, θ) is image intensity function in polar coordinates. V ∗n,m(r, θ) is ART

E.1. Image Descriptors 203

basis function and is complex conjugate of Vn,m(r, θ) that is separable along angular

and radial directions, as given below.

Vn,m(r, θ) = Rn(r)Am(θ) (E.12)

with

Am(θ) =1

2πejmθ (E.13)

and

Rn(r) =

1 (n = 0)

2cos(πnr), (n > 0)

where n and m are the order and repetition of ART, respectively.

Berkley Segmentation Dataset

Berkley segmentation dataset (BSD) [57] provides an empirical basis for research

on image segmentation and boundary detection. There are 500 color images in this

dataset, each of size 481 × 321 or 321 × 481 i.e. 154,401 pixels per image. The

dataset also contains ground-truth images which are used to determine the accuracy

of segmentation algorithm.

Color Image Datasets

Wang’s Color Dataset

The Wang’s Color database is provided by Wang et al. [104]. It is a subset of

COREL image database. It contains 1,000 images, which are equally divided into

ten different categories: African people, beach, building, bus, dinosaur, elephant,

flower, horse, mountain and food. Each image is of size 256 × 384 or 384 × 256

pixels. Fig.E.1 shows some sample images of this database.

COREL 5K Dataset

To further confirm the efficiency of our method, we evaluate it with COREL 5K

image database. COREL 5K is a large database of color images. It is subset of

COREL 10K dataset [105]. This dataset contains 50 categories covering 5000 images,

204

Figure E.1: Sample Wang’s Color Images

including diverse content such as fireworks, bark, cars, sculptures, feast, horses,

buildings, flags, microscopy images, tiles, trees, waves, pills, stained glass etc. Every

category contains 100 images of size 192×128 or 128×192 in JPEG format. Fig.E.2

shows sample images from this database.

Figure E.2: Sample COREL 5K Images

MIT VisTex Texture Dataset

VisTex texture database is a collection of 40 color texture images created by MIT

media Lab [106]. The database was created to provide a set of high quality texture

images for computer vision applications. Each image is a square image of 512× 512

pixels. These images are divided into 16 sub-images, each of size 128 × 128. So,

there are 640 sub-images in the dataset. Fig.E.3 shows some sample sub-images.

Oliva and Torralba Scene Dataset

We also evaluated the novel algorithm for scene classification [107]. OT-scene dataset

consists of 2,688 color images from eight scene categories: coast (360 samples),

forest (328 samples), mountain (374 samples), open country (410 samples), highway

(260 samples), inside city (308 samples), tall building (456 samples) and street (292

samples). Fig.E.4 shows some sample images from this database.

E.1. Image Descriptors 205

Figure E.3: Sample MIT VisTex Sub-images

Figure E.4: Sample OT-scene Images

Hand-written Digit Recognition: USPS Dataset

The US Postal (USPS) handwritten digit dataset2 is a benchmark digit recognition

dataset. USPS consists of gray-scale handwritten digit images from 0 to 9, as shown

in Figure E.5. It is the output of a project under US postal department for recog-

nizing handwritten digits on mail envelopes. The dataset consists of 11,000 images

where each digit has 1100 images. The size of each image is 1616 pixels with 256

gray levels.

2Available at: http://www.cs.toronto.edu/roweis/data.html.

206

Figure E.5: Sample USPS digits (0-9)