5107.pdf - polyu electronic theses

Copyright Undertaking

This thesis is protected by copyright, with all rights reserved.

By reading and using the thesis, the reader understands and agrees to the following terms:

1. The reader will abide by the rules and legal ordinances governing copyright regarding the use of the thesis.

2. The reader will use the thesis for the purpose of research or private study only and not for distribution or further reproduction or any other purpose.

3. The reader agrees to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.

IMPORTANT

If you have reasons to believe that any materials in this thesis are deemed not suitable to be distributed in this form, or a copyright owner having difficulty with the material being included in our database, please contact [email protected] providing details. The Library will look into your claim and consider taking remedial action upon receipt of the written requests.

Pao Yue-kong Library, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong

http://www.lib.polyu.edu.hk

i

ESTIMATION OF AIRCRAFT TRIP FUEL

CONSUMPTION: A NOVEL SELF-ORGANIZING

MACHINE LEARNING CONSTRUCTIVE NEURAL

NETWORK

WAQAR AHMED KHAN

PhD

The Hong Kong Polytechnic University

2020

ii

The Hong Kong Polytechnic University

Department of Industrial and Systems Engineering

Estimation of Aircraft Trip Fuel Consumption: A

Novel Self-Organizing Machine Learning

Constructive Neural Network

WAQAR AHMED KHAN

A thesis submitted in partial fulfilment of the

requirements for the degree of Doctor of Philosophy

May 2020

iii

Certificate of Originality

I hereby declare that this thesis is my own work and that, to the best of my knowledge and

belief, it reproduces no material previously published or written, nor material that has been

accepted for the award of any other degree or diploma, except where due acknowledgement

has been made in the text.

__________________________ (Signed)

KHAN, Waqar Ahmed (Name of student)

iv

Abstract

Accurate estimation of aircraft fuel consumption is critical for airlines in terms of safety and

profitability. Recently, the growth in the aviation industry worldwide has made accurate fuel

estimation an important research topic. It can be revealed that the demand for passengers and

cargos has increased by 7.4% and 3.4% respectively in 2018 as compared to the level in 2017.

Similarly, in 2006, the airline industry fuel consumption was 69 billion gallons and it was

forecasted that it will change to 97 billion gallons in 2019. Despite these pleasing economic

conditions for airlines, the increase in jet fuel prices and restrictions to protect environmental

degradation comprises a lot of challenges. It was forecasted that jet fuel prices will increase by

31.18% in 2019 as compared to those in 2015. International authorities stress to reduce carbon

dioxide (CO2) emissions by 50% in 2050 and reduce fuel consumption by 1.5% per year to

avoid ozone depletion. In the future, international authorities are planning to make it mandatory

for airlines to certify their aircraft according to CO2 certification standards. These challenges

and restrictions force airlines operating organizations to control excess fuel consumption.

Furthermore, among various airline operating expenses, fuel consumption cost accounts for the

highest contribution of 28.2% of the total operating cost. A slight change in fuel prices can

create an enormous impact on airlines operating expenses making it more valuable to study.

The increasing awareness for environmental protection by international authorities in

conjunction with growing fuel prices and boosting demand from tourism are encouraging

airline operating companies to adopt competitive strategies in fuel management to control

excess fuel consumption for long-term sustainability.

In current practice, estimation of fuel consumption for a flight trip is usually done by an energy

balance approaches (EAs). However, according to the existing literature, the information

needed to determine the coefficients are not always available in a real scenario and a lot of

v

flight testing need to be performed to generate data which may make the EAs expensive. The

unavailability of data may result in the usage of global parameters with default values rather

than local parameters, which may lead to inaccurate results. The complex mathematical

computation, high testing and consultation cost, expert involvement, and estimation errors,

which may become worse in certain conditions limit EA methods applicability. Later, machine

learning backpropagation neural networks (BPNNs) was proposed to provide an alternative to

EAs. The main limitation of earlier works on the application of a BPNNs for fuel estimation is

that they only covered a small number of aircraft types with limited flight data. The reasons for

this may be the weak generalization performance and slow convergence drawbacks of a BPNN

based on trial-and-error approaches to select the best hyperparameters. Besides, BPNN based

fuel estimation models were proposed by considering low-level operational parameters with

the future recommendation of incorporating more parameters that may have a significant effect

on fuel consumption. Other than EAs and BPNNs, many other models for fuel estimation are

proposed in the literature by considering distinct flight phases. Their cumulative effect to

estimate fuel for the whole journey may even result in the suboptimal estimation.

The application of neural networks (NNs) is gaining much popularity in the airline sector to

improve its various operations and enhance services. Limiting our study scope to EA and

BPNN based fuel estimation models, the actual performance of aircraft usually deviates from

such estimation. The required quantity of fuel needed for a safe journey is dependent on many

operational parameters and estimation methods. Loading suboptimal or abundant fuel in

aircraft may result in deviation of fuel from estimation. The fuel deviation may be either

positive or negative known as overestimation or underestimation, respectively. Therefore, extra

fuel is loaded in the discrepancy reservoirs based on experience because of less confidence in

the estimation methods and to meet unforeseen conditions and accounts for aircraft

deterioration. This increases the weight of the aircraft, requiring more thrust to balance drag

vi

and weight. Ultimately more fuel is consumed, and more frequent aircraft maintenance is

required than planned.

To overcome the above limitations of trip fuel estimation, our objectives are threefold: i) to

formulate a model and define an objective function of minimizing fuel deviation, ii) to propose

a novel self-organizing constructive neural network (CNN) featuring a cascade topology and

capable of analytically calculating connection weight coefficients to achieve better

generalization performance and faster learning speed, and iii) to apply the novel CNN for

minimizing fuel deviation and make performance comparison with the existing airline energy

approach (AEA) and the BPNN. The purpose is to achieve better estimation by adding high-

level operational parameters, avoid using global operational parameters, eliminate the need for

a trial-and-error approach, reduce the number of hyperparameter adjustments and expert

involvement. We consider that insufficient attempts have been reported in the literature

concerning estimates of trip fuel using CNNs along with high-dimensional data for the entire

trip flight phases collectively. A comparative study of the proposed CNN with the existing

AEA and the BPNN gives an important managerial insight. The numerical results demonstrate

that the trip fuel estimation by the proposed CNN achieves better results compared to AEA and

BPNN while requiring the shortest time compared to the BPNN. The significant improvement

in trip fuel estimation creates greater confidence in the proposed CNN given that it may

eliminate the need for adding more fuel based solely on experience.

vii

List of Publications

International Journal Publications

Khan, W.A., Chung, S.H., Ma, H.L., Liu, S.Q. and Chan, C.Y. (2019), “A novel self-

organizing constructive neural network for estimating aircraft trip fuel consumption”,

Transportation Research Part E: Logistics and Transportation Review, Vol. 132, pp. 72–

96.

Khan, W.A., Chung, S.H., Awan, M.U. and Wen, X. (2019a), “Machine learning facilitated

business intelligence (Part I): Neural networks learning algorithms and applications”,

Industrial Management & Data Systems, Vol. 120 No. 1, pp. 164–195.

Khan, W.A., Chung, S.H., Awan, M.U. and Wen, X. (2019b), “Machine learning facilitated

business intelligence (Part II): Neural networks optimization techniques and

applications”, Industrial Management & Data Systems, Vol. 120 No. 1, pp. 128–163.

International Conference Proceedings

Khan, W.A., Chung, S.H. and Chan, C.Y. (2018), “Cascade principal component least squares

neural network learning algorithm”, 2018 24th International Conference on Automation

and Computing (ICAC), IEEE, pp. 1–6.

Khan, W.A., Chung, S.H., Awan, M.U. and Wen, X. (2019c), “Improving the convergence

and extracting useful information from high dimensional big data with neural networks”,

2019 International Conference on Business, Big-Data, and Decision Sciences (ICBBD),

pp. 40–41.

viii

Acknowledgment

Firstly, I would like to express my sincere gratitude to my supervisor Dr. Sai-Ho Chung for his

trust, patience, support, motivation, and encouragement during my Ph.D. period. His guidance

helped me in all the time of my research work and writing of this thesis. I could not have

imagined a better supervisor and mentor than him for my Ph.D. study. Honestly, without his

support, I would not be able to achieve this milestone.

My sincere thanks also go to my co-supervisor Dr. Ching Yuen Chan who provided me an

opportunity to join his team. Without his precious support, it would not be possible to conduct

this research.

Many thanks to Prof. Shi Qiang Liu, Dr. Muhammad Usman Awan, Dr. MA Hoi Lam, and Dr.

Wen Xin for their discussions, feedbacks, and suggestions in improving the quality of research.

Last, but not least, I wish to express special thanks to my parents, wife, and friends for

supporting me spiritually throughout my Ph.D. period and my life in general.

ix

Table of Contents

Certificate of Originality ........................................................................................................ iii

Abstract .................................................................................................................................... iv

List of Publications ................................................................................................................ vii

Acknowledgment ................................................................................................................... viii

Table of Contents .................................................................................................................... ix

List of Figures ......................................................................................................................... xii

List of Tables ......................................................................................................................... xiv

List of Abbreviations ............................................................................................................ xvi

Chapter 1. Introduction .......................................................................................................... 1

1.1 Research background ..................................................................................................... 1

1.2 Problem statements ........................................................................................................ 5

1.3 Research objectives ........................................................................................................ 7

1.4 Research contributions ................................................................................................... 8

1.5 Research scope and significance .................................................................................. 10

1.6 Structure of the thesis................................................................................................... 11

Chapter 2. Literature Review ............................................................................................... 13

2.1 Aircraft fuel estimation ................................................................................................ 13

2.1.1 EAs based fuel estimation ................................................................................ 14

2.1.2 BPNNs based fuel estimation ........................................................................... 16

2.1.3 Estimation methods other than NN-based machine learning methodologies .. 18

2.2 Neural networks learning algorithms and optimization techniques ............................. 19

2.2.1 Survey methodology ........................................................................................ 21

2.2.1.1 Source of literature ............................................................................... 21

2.2.1.2 The philosophy of the review work ...................................................... 25

2.2.1.3 Classification schemes ......................................................................... 26

2.2.2 FNN: An overview ........................................................................................... 31

2.2.3 Learning algorithms and applications .............................................................. 33

2.2.3.1 Gradient learning algorithms for network training .............................. 33

2.2.3.2 Gradient free algorithms ....................................................................... 41

2.2.4 Optimization techniques and applications ........................................................ 51

x

2.2.4.1 Optimization algorithms for learning rate ............................................ 51

2.2.4.2 Bias and variance (underfitting and overfitting) minimization

algorithms ............................................................................................. 58

2.2.4.3 Constructive topology FNN ................................................................. 66

2.2.4.4 Metaheuristic search algorithms ........................................................... 71

2.3 Discussion .................................................................................................................... 75

Chapter 3. Trip Fuel Model Formulation for Minimizing Fuel deviation ....................... 82

3.1 Introduction .................................................................................................................. 82

3.2 Fuel consumption and deviation problem framework ................................................. 84

3.3 Model formulation ....................................................................................................... 87

3.3.1 Complex high-level operational parameters ................................................... 87

3.3.1.1 Statistical analysis and extraction ........................................................ 87

3.3.1.2 Influence of extracted parameters on the fuel consumption ................ 88

3.3.2 Mathematical and data-driven solution methods ............................................. 91

3.3.2.1 Relationship between mathematical and data-driven methods ............ 97

3.3.2.2 Significance of the proposed method ................................................. 100

3.3.3 Trip fuel model formulation ........................................................................... 103

3.4 Summary .................................................................................................................... 108

Chapter 4. A Novel Self-Organizing Constructive Neural Network Algorithm ............ 111

4.1 Introduction ................................................................................................................ 111

4.2 Convergence limitations of CasCor and its extensions .............................................. 112

4.3 Orthogonal linear transformation and ordinary least squares .................................... 115

4.4 Cascade principal component least squares neural network ...................................... 116

4.4.1 Lemma and statement ..................................................................................... 117

4.4.2 Analytically determining hidden units input connection weights .................. 117

4.4.3 Analytically determining hidden units output connection weights ................ 119

4.4.4 Adding hidden layers ..................................................................................... 120

4.4.5 Hyperparameters ............................................................................................ 120

4.4.6 Convergence theoretical justification ............................................................. 121

4.5 Experimental work ..................................................................................................... 125

4.5.1 Artificial benchmarking problems ................................................................. 126

4.5.1.1 Two Spiral classification task ............................................................ 126

xi

4.5.1.2 SinC function approximation regression task .................................... 128

4.5.2 Real-world problems ...................................................................................... 129

4.5.3 Selection of hidden units in the hidden layers ................................................ 134

4.5.4 Role of connecting only newly added hidden layer to the output layer ......... 137

4.6 Summary .................................................................................................................... 140

Chapter 5. Reducing the Negative Role of Global Parameters with Default Values and

Varying Operational Parameters Effect on the Trip Fuel Estimation ........ 142

5.1 Introduction ................................................................................................................ 142

5.2 Problem context ......................................................................................................... 143

5.3 Data analytics for fuel consumption .......................................................................... 143

5.4 Solution methods and computational results ............................................................. 144

5.4.1 Proposed method and existing methods ......................................................... 144

5.4.2 Trip fuel estimation methods performance analysis and discussion .............. 145

5.4.2.1 Combined model ................................................................................ 145

5.4.2.2 Individual models ............................................................................... 149

5.5 Summary .................................................................................................................... 160

Chapter 6. Conclusion and Future Work .......................................................................... 162

6.1 Conclusion ................................................................................................................. 162

6.2 Future work ................................................................................................................ 169

Appendix A. Neural Networks Learning Algorithms and Optimization Techniques

Applications ................................................................................................ 175

References ............................................................................................................................. 192

xii

List of Figures

Figure Title Page

Figure 1.1. Fuel cost (US$ billion) and cost per barrel (US$) 2

Figure 1.2. Passenger demand (RPK) 2

Figure 1.3. Cargo demand (FTK) 3

Figure 1.4. Fuel consumption and carbon dioxide emission 3

Figure 2.1. Articles distribution 22

Figure 2.2. Articles published over time 25

Figure 2.3. Algorithms distribution categories wise 29

Figure 2.4. Algorithms proposed over time 30

Figure 2.5. Papers distribution categories wise 30

Figure 3.1. Fuel estimation before flight and consumption after the flight 85

Figure 3.2. BPNN learning algorithm network 96

Figure 4.1. CasCor learning algorithm network 114

Figure 4.2. CPCLS learning algorithm network 118

Figure 4.3. Two spiral classification task 126

Figure 4.4. SinC function regression task 128

Figure 4.5. Smooth and stable convergence of CPCLS 133

Figure 4.6. Hidden units generation effect on generalization performance of

CPCLS

135

Figure 4.7. Hidden units generation effect on learning of CPCLS 136

Figure 4.8. Hidden units generation effect on network of CPCLS 136

Figure 4.9 Shuffled dataset results of LHL and AHL 138

Figure 4.10. Different test size dataset results of LHL and AHL 139

xiii

Figure 5.1. Fuel deviation (combined model) 148

Figure 5.2. Fuel deviation interval (combined model) 148

Figure 5.3. Fuel deviation (𝑆𝑅𝐻𝑈𝑆𝐼) 153

Figure 5.4. Fuel deviation (𝑆𝑅𝐻𝐷) 153

Figure 5.5. Fuel deviation (𝑆𝑅𝑈𝑆1𝐻 ) 153

Figure 5.6. Fuel deviation (𝑆𝑅𝑈𝑆2𝐻 ) 154

Figure 5.7. Fuel deviation (𝑆𝑅𝑈𝐾𝐻 ) 154

Figure 5.8. Fuel deviation (𝑆𝑅𝐻𝑈𝑆2) 154

Figure 5.9. Fuel deviation (𝑆𝑅𝐻𝑈𝐾) 155

Figure 5.10. Fuel deviation (𝑆𝑅𝐻𝐴) 155

Figure 5.11. Fuel deviation interval (𝑆𝑅𝐻𝑈𝑆𝐼) 156

Figure 5.12. Fuel deviation interval (𝑆𝑅𝐻𝐷) 156

Figure 5.13. Fuel deviation interval (𝑆𝑅𝑈𝑆1𝐻 ) 156

Figure 5.14. Fuel deviation interval (𝑆𝑅𝑈𝑆2𝐻 ) 157

Figure 5.15. Fuel deviation interval (𝑆𝑅𝑈𝐾𝐻 ) 157

Figure 5.16. Fuel deviation interval (𝑆𝑅𝐻𝑈𝑆2) 157

Figure 5.17. Fuel deviation interval (𝑆𝑅𝐻𝑈𝐾) 158

Figure 5.18. Fuel deviation interval (𝑆𝑅𝐻𝐴) 158

xiv

List of Tables

Table Title Page

Table 2.1. Articles source description 23

Table 2.2. Classification of FNN published algorithms 27

Table 3.1. Correlation analysis of selected operational parameters with trip fuel

consumption

88

Table 3.2. Relationship between the mathematical method and data-driven

method

99

Table 3.3. Relationship among existing BPNN-based fuel estimation models

and the proposed method

102

Table 4.1. Generalization accuracy and learning speed of two spiral

classification task

127

Table 4.2. Generalization performance and learning speed of SinC regression

task

128

Table 4.3. Dataset extracted from UCI 130

Table 4.4. Algorithms comparison on real world regression problems 131

Table 4.5. Algorithms comparison on real world classification problems 132

Table 4.6. CPCLS and CasCor extensions comparison 134

Table 4.7. Connecting hidden layers to output unit in CPCLS 138

Table 5.1. Trip fuel deviation by AEA, BPNN and CPCLS (combined model) 146

Table 5.2. Performance improvement comparison (combined model) 146

Table 5.3. Trip fuel deviation by the AEA, BPNN and CPCLS (individual

models)

151

Table 5.4. Performance accuracy comparison (individual models) 151

xv

Table A.1. Applications of gradient learning algorithms 176

Table A.2. Applications of gradient free learning algorithms 179

Table A.3. Applications of optimization algorithms for learning rate 184

Table A.4. Applications of bias and variance minimization algorithms 186

Table A.5. Applications of constructive algorithms 188

Table A.6. Applications of metaheuristic search algorithms 190

xvi

List of Abbreviations (Alphabetical order)

AdaGrad Adaptive Gradient algorithm

Adam Adaptive Moment estimation

AEA Airline Energy balance Approach

AFBM Advanced Fuel Burn Model

AFCM Aircraft Fuel Consumption Model

AHL All Hidden Layers

APSO Adaptive Particle Swarm Optimization

BADA Base of Aircraft Data

B-ELM Bidirectional Extreme Learning Machine

BFGS Broyden–Fletcher–Goldfarb–Shanno

BP Backpropagation

BPNN Backpropagation Neural Network

CasCor Cascade-Correlation neural network

CCOEN Cascade Correntropy Network

CG Conjugate Gradient method

COG Centre of Gravity

CI-ELM Convex Incremental Extreme Learning Machine

CNN Constructive Neural Network

CNNE Constructive Neural Network Ensemble

CO2 Carbon Dioxide

CPCLS Cascade Principal Component Least Squares Neural Network

DAOI-ELM Orthogonal Incremental Extreme Learning Machine based on the

Driving Amount

xvii

EA Energy balance Approach

ECNN Evolving Cascade Neural Network

EI-ELM Enhanced Incremental Extreme Learning Machine

ELM Extreme Learning Machine

EMA Exponential Moving weighted Average

EM-ELM Error Minimized Extreme Learning Machine

Eurocontrol European Organization for the Safety of Air Navigation

FAA Federal Aviation Administration

FCNN Faster Cascade Neural Network

FNN Feedforward Neural Network

FTKs Freight Tonne Kilometers

GA Genetic Algorithm

GANN Genetic Algorithm based Feedforward Neural Network

GD Gradient Descent

GHG Greenhouse Gases

GN Gauss-Newton

GPR Gaussian Process Regression

GRNN General Regression Neural Network

H-ELM Hierarchical Extreme Learning Machine

IATA International Air Transport Association

I-ELM Incremental Extreme Learning Machine

IFNNRWs Iterative Feedforward Neural Networks with Random Weights

IPSO-EM-ELM Particle Swarm Optimization Error Minimized Extreme Learning

Machine

LHL Last Hidden Layer

xviii

LM Levenberg-Marquardt

ML-ELM Multilayer Extreme Learning Machine

NBN Neuron by Neuron

NFL No Free Lunch theorem

NM Newton Method

NN Neural Network

No-Prop No-Propagation

OAA Optimized Approximation Algorithm

OI-ELM Orthogonal Incremental Extreme Learning Machine

OLSCN Orthogonal Least Squares based Cascade Network

OPF Operational Performance Files

OS-ELM Online Sequential Extreme Learning Machine

PNN Probabilistic Neural Network

PSO Particle Swarm Optimization

PSONN Particle Swarm Optimization based Feedforward Neural Network

QP Quickprop

quasi NM Quasi Newton Method

RPKs Revenue Passenger Kilometers

RProp Resilient Propagation

SGD Stochastic Gradient Descent

SLFN Single Layer Feedforward Neural network

SNRF Signal to Noise Ratio Figure

𝑆𝑅𝐻𝐴 Sector route from departure airport at Australia and arrival airport at

Hong Kong

xix

𝑆𝑅𝐻𝐷 Sector route from departure airport at Dubai and arrival airport at

Hong Kong

𝑆𝑅𝑈𝐾𝐻 Sector route from departure airport at Hong Kong and arrival airport

at the United Kingdom

𝑆𝑅𝐻𝑈𝐾 Sector route from departure airport at the United Kingdom and

arrival airport at Hong Kong

𝑆𝑅𝑈𝑆1𝐻 Sector route from departure airport at Hong Kong and arrival

airport-one at the United States of America

𝑆𝑅𝐻𝑈𝑆𝐼 Sector route from departure airport-one at United States of America

and arrival airport at Hong Kong

𝑆𝑅𝑈𝑆2𝐻 Sector route from departure airport at Hong Kong and arrival

airport-two at the United States of America

𝑆𝑅𝐻𝑈𝑆2 Sector route from departure airport-two at United States of America

and arrival airport at Hong Kong

TOW Takeoff Weight

TSFC Thrust-Specific Fuel Consumption

W-ELM Weighted Extreme Learning Machine

WOA Whale Optimization Algorithm

1

Chapter 1. Introduction

Chapter 1 is about Introduction and is organized as follows. Section 1.1 is about the research

background. Section 1.2 concisely give statements of the problem by highlighting existing

research gaps. Section 1.3 is about research objectives to add knowledge to the existing

literature. Section 1.4 explains the research contributions. Section 1.5 briefly discusses research

scope and significance, and Section 1.6 elaborates on the structure of the thesis.

1.1 Research background

The calculation of the amount of trip fuel is essential for an aircraft to safely reach the

destination. Consequently, it receives much attention in the aviation sector. Controlling excess

fuel consumption has become one of the major concerns for airline operating organizations

given that it contributes to increasing operating expenses (Sheng et al., 2019). Globally, during

the last decade, fuel cost accounts for an average contribution of 28.2% of the total operating

cost among various airline operating expenses. Therefore, it is critical to design methodologies

for accurately planning the trip fuel required for each flight (IATA, 2019). The required

quantity of trip fuel loaded in an aircraft depends on many operational parameters and

estimation methods. Loading suboptimal trip fuel may result in the utilization of fuel from the

supplementary reservoir, whereas abundant trip fuel may increase the ramp weight. Both

situations are undesirable for the smooth operation of an airline. Utilizing supplementary

reservoir fuels, which are reserved to meet unexpected flight conditions such as bad weather,

alternative airport divergence, airport congestion, and hold-on, may create uncertainty for the

flight crew. Conversely, loading abundant fuel may ensure a safe journey but with an additional

cost in terms of excess fuel consumption and early aircraft maintenance. In both scenarios, the

actual trip fuel consumed during each flight significantly deviates from the estimated trip fuel.

2

Fuel deviation, defined as the difference between the actual trip fuel consumed during a flight

and the trip fuel estimated before that flight, may take either negative or positive values

corresponding to underestimation or overestimation, respectively. A low confidence in

estimation methods and the need to meet unforeseen flight issues require adding an extra

amount of fuel in the discrepancy – rather than contingency or final emergency – reservoir

based on experience. This makes the situation worse for airlines. First, it increases the total

weight of the aircraft, requiring more thrust to balance weight and drag in combination with

other atmospheric and physical factors (Irrgang et al., 2015). Taking more fuel onboard not

only increases the weight of an aircraft but also affects the performance of its engines in the

Figure 1.1. Fuel cost (US$ billion) and cost per barrel (US$)

020406080100120140

0

50

100

150

200

250

2006

2007

2008

2009

2010

2011

2012

2013

2014

2015

2016

2017

2018

2019F

US

$

US

$ b

illi

on

Year

Fuel, US$ billion Jet kerosene price, US$/barrel

Figure 1.2. Passenger demand (RPK)

-2

0

2

4

6

8

10

0

1,000

2,000

3,000

4,000

5,000

20

06

20

07

20

08

20

09

20

10

20

11

20

12

20

13

20

14

20

15

20

16

20

17

20

18

20

19F

Per

cen

tage

(%)

Mil

lio

n (

Nu

mb

ers)

Year

Sched passenger numbers (million)

Passenger growth RPK (%)

3

long run by burning more fuel per unit distance. This ultimately shortens the lifetime of

engines, which need more frequent maintenance than planned (Abdelghany et al., 2005).

Moreover, an aircraft may require more fuel to burn than a brand-new one and the pilots may

be unaware of the actual amount of wear (deterioration) in the aircraft or may not know how

the wear is calculated by particular estimation methods. As a result, the pilots will demand

more fuel as a buffer in the discrepancy reservoir of the aircraft (Irrgang et al., 2015).

Therefore, the objective of reaching the destination with the smallest possible amount of fuel

left in the trip tank or utilized from the supplementary tank constitutes a challenging task to

accomplish.

Figure 1.3. Cargo demand (FTK)

-15

-10

-5

0

5

10

15

2025

0

10

20

30

40

50

60

70

20

06

20

07

20

08

20

09

20

10

20

11

2012

20

13

20

14

20

15

20

16

20

17

20

18

20

19F

Per

cen

atge

(%)

Mil

lio

n to

nn

es

Year

Freight tonnes (million) Cargo growth FTK (%)

Figure 1.4. Fuel consumption and carbon dioxide emission

0

20

40

60

80

100

0

200

400

600

800

1000

20

06

20

07

20

08

20

09

20

10

20

11

20

12

20

13

20

14

20

15

20

16

20

17

20

18

20

19F

US

bil

lio

n g

allo

ns

Mil

lio

n to

nn

es

Year

CO2 emissions, million tonnes

Fuel consumption, US billion gallons

4

Excess fuel consumption is disadvantageous for airlines both economically and

environmentally. Economically, because the increasing trend of fuel prices and a low

confidence in estimation methods may lead to more costs for airline operating organizations to

ensure smooth operations while meeting the growing need of passengers and cargo. Figure 1.1

shows the trend of increasing jet fuel prices. In 2019, the forecast has been that fuel prices will

increase by 31.18% as compared to those in 2015 (IATA, 2019). This forces airlines to adopt

risk management strategies such as fuel hedging and fuel ferrying, which may not be useful in

many cases (Merkert and Swidan, 2019; Sibdari et al., 2018). Fuel hedging and fuel ferrying

are certainly good strategies, but they become expensive when fuel prices fall ̶ fuel hedging ̶

and because of the high aircraft maintenance cost associated with loading excess fuel from

inexpensive stations ̶ fuel ferrying. Figures 1.2 and 1.3 show data on traveling passengers and

cargo movement in the airline sector over the past decade. The demand, in terms of revenue

passenger kilometres (RPKs), has increased 7.4% in 2018 as compared to 2017. The cargo

demand, in terms of freight tonne kilometres (FTKs), has risen to its highest level in the last

ten years with a growth rate of 9.7% in 2017 as compared to the level in 2016, which is more

than double the 3.6% growth rate in 2016. Indeed, the growth rate has further increased 3.4%

in 2018 as compared to the level in 2017 (IATA, 2019). Environmentally, higher fuel

consumption causes ozone depletion by more emission of carbon dioxide (CO2) and

greenhouse gases (GHG) (Pagoni and Psaraki-Kalouptsidi, 2017; Seufert et al., 2017). Figure

1.4 shows the trend of fuel consumption and carbon dioxide emission. Actually, the

International Air Transport Association (IATA) stresses to avoid ozone depletion by joint

efforts to reduce CO2 emissions by 50% by 2050 with respect to 2005. In the future,

international authorities are planning to make it compulsory for airlines to certify their aircrafts

based on size and weight, according to CO2 certification standards (IATA, 2018). The

increasing awareness for environmental protection by international authorities in conjunction

5

with growing fuel prices and boosting demand from tourism are encouraging airline operating

companies to adopt competitive strategies in fuel management to control excess fuel

consumption for long-term sustainability.

1.2 Problem statements

Accurately estimating the trip fuel for an aircraft has become an important research topic

because of profitability and safety issues. Several models have been studied in the literature,

however, still many gaps exist to improve and add a contribution to the existing literature. The

research gaps that need to be addressed are:

1 In current practice, estimation of fuel consumption for a flight trip is usually done by

an energy balance mathematical approaches (EAs). The most popular of them are the

Simmod simulation program, developed from the so-called advanced fuel burn model

(AFBM) (Collins, 1982) by the Federal Aviation Administration (FAA) (Baklacioglu,

2016), and the base of aircraft data (BADA) developed by the European Organization

for the Safety of Air Navigation (Eurocontrol) (Nuic, 2014). However, the actual

performance of a flight usually deviates from such estimation. Firstly, the information

needed to determine fuel flow is not always available in a real scenario and a great deal

of flight testing is needed to generate the parameter data, what may make the EAs

expensive technique to adopt (Trani and Wing-Ho, 1997). The parameters include

engine constants, airspeed, aircraft mass, atmospheric density, extra thrust needed for

ascending, and wing area (Huang et al., 2017). Secondly, aircraft fuel estimation may

not be accurate in EA methods owing to outdated existing databases and the

unavailability of data for all types of aircrafts and engines (Yanto and Liem, 2018). The

unavailability of data may result in the usage of more global parameters with default

values rather than local parameters, what in turn may lead to inaccurate results (Pagoni

6

and Psaraki-Kalouptsidi, 2017). Insufficient attempts have been made to study the

effect of usage of global parameters on the deviation of fuel consumption from

estimation.

2 The applications of machine learning neural networks (NNs) are gaining significant

interest among researchers to improve the various operation of airlines. To provide an

alternative and simplify the EA based fuel estimation methods, the application of a

backpropagation neural network (BPNN) was suggested in the literature. BPNN is a

fixed topology NN and adjustment of various hyperparameters is dependent on user

expertise. The trial and error experimental works and the problem of local minima may

not assure that selected network is optimal causing weak generalization performance

and slow convergence of the BPNN (Huang et al., 2006; Kapanova et al., 2018; Krogh

and Hertz, 1992; Liew et al., 2016; Srivastava et al., 2014). In existing works, BPNN

is applied to estimate fuel without addressing the above key issues.

3 In the existing literature, operational parameters are selected based on prior experience

and knowledge. There exist many operational parameters that may contribute to fuel

consumption and are previously ignored. The existing BPNN fuel estimation models

are proposed by considering low-level operational parameters, covering a small number

of aircraft types with limited flight data, with the future recommendation of

incorporating more parameters. A gap exists to extract and add high-level operational

parameters that can contribute to fuel consumption which is previously ignored.

4 Other than EAs and NNs, researchers have also developed models to achieve a better

fuel estimation. Similar to EAs and BPNNs, the limitation of models is that many of

them are proposed for distinct flight phases, e.g., take-off, cruise, and descent. The

weakness is that the models cannot be used for estimating fuel for a whole flight

journey. Taking the cumulative effect of phases to estimate fuel for the journey may

7

even lead to higher fuel deviation. It should be noted that complete flight trip fuel

consumption is dependent on many operational parameters and their changing

behaviour at the departure airport, cruise phase, and arrival airport.

5 Lastly, insufficient work exists that has studied the effect of dividing the high

dimensional historical data into chunks (portions) of data on fuel estimation, by

considering all phases collectively, using machine learning NNs. Machine learning

algorithms are highly dependent on historical datasets and application of an algorithm

that may give better results regardless of a change in data structure and size is gaining

an important consideration.

1.3 Research objectives

The purpose of this work is to develop a novel machine learning NN algorithm having

characteristics of better generalization performance and learning speed to handle efficiently

real and large-scale aircraft trip fuel estimation problems. The aim is to eliminate the need for

a trial-and-error approach, reduce the number of hyperparameter adjustments, usage of global

parameters, outdated database, and expert involvement. We consider that insufficient attempts

have been reported in the literature concerning the estimation of trip fuel using feedforward

constructive neural networks (CNNs) along with high-dimensional data for the entire trip flight

phases collectively. More specifically, the research objectives can be concisely described

below:

1 To formulate the trip fuel estimation model and define an objective function of

minimizing fuel deviation. Relevant operational parameters that can contribute to fuel

consumption will be extracted from the real historical high dimensional data of one of

the international airlines operating in Hong Kong. The extracted operational parameters

will be incorporated in the newly formulated model to estimate trip fuel for all phases

8

collectively. This objective will facilitate addressing the first four research gaps

identified in Section 1.2.

2 To develop a novel machine learning NN algorithm having characteristics of better

generalization performance and faster learning speed with a fewer hyperparameter

adjustment. The roadmap is to conduct a comprehensive review of the existing

literature, carried out in the last three decades, that proposed to improve the feedforward

neural networks (FNNs) learning algorithms and optimization techniques. Based on

extensive review and understanding of the merits, limitations, recent research trends

and research gaps in FNNs we plan to develop a novel CNN. The basic idea is to

generate linearly independent hidden units in each hidden layer from analytically

calculation of input units and connect only the last hidden layer to the output unit

through output connection weight having maximum error convergence property. This

objective will facilitate addressing second and fifth research gaps as identified in

Section 1.2.

3 To apply a novel CNN for minimizing the fuel deviation. The performance comparison

of novel CNN with an existing airline energy balance approach (AEA) and a traditional

BPNN will be made to understand the effectiveness of proposed CNN in minimizing

fuel deviation. This objective will mainly address all research gaps as identified in

Section 1.2.

1.4 Research contributions

Early attempts to estimate the trip fuel by EAs have gained popularity. However, the successful

wide use of machine learning in many applications has changed the interest of the airlines

operating organization. The unavailability of data related to local parameters, complex

mathematical formulations, and expertise involvement limits the applicability of EAs. The

9

application of BPNNs as an alternative to EAs constitutes an attempt to overcome the above

limitations with improved results. However, the existing BPNN-based fuel estimation models

only cover a small number of aircraft types with limited flight data. The possible reasons are

the random generation and iteratively tuning of the connection weights on both sides of BPNNs

by gradient-based learning algorithms. This may create a complex co-adaptation by generating

redundant hidden units, thereby contributing less to error convergence. The adjustment of local

and global hyperparameters with the connection weights may require many experimental trials

to obtain an optimal fixed-topology network. These problems increase the expertise

requirement as well as cause weak generalization performance and slow convergence.

The application of traditional machine learning NNs for airline trip fuel estimation is not

straightforward. Extensive knowledge is required to fit the algorithm to the model. The varying

nature of airline input parameters and the high dimensionality of data might be challenging for

traditional BPNNs to accurately estimate the trip fuel with the aforementioned limitations. The

same problem has been observed in our experimental work (discussed in Chapter 5). Similarly,

in the current literature, the application of fixed-topology BPNNs for fuel estimation is limited

in scope by considering a few aircraft types, flight data, and operational parameters. This means

that the application of machine learning NNs on the varying nature of operational parameters

and the high dimensionality of data is still an open area. To overcome the limitations of BPNNs,

this study proposes a CNN featuring a self-organizing cascading architecture and capable of

analytically calculating connection weights. These characteristics generate linearly

independent (non-redundant) hidden units having the capability of maximum error reduction.

The proposed CNN thus achieves a better generalization performance and converges

significantly faster than traditional BPNNs.

10

The application of the proposed CNN with varying operational parameters and high-

dimensional data for airline trip fuel estimation can bring a breakthrough improvement

compared to existing fuel estimation methods. Unlike EAs, the advantages of the proposed

CNN are that it requires neither the involvement and consultation of an expert for a complex

mathematical formulation nor the cost to determine the coefficients. In addition, it can work

with any type of available data. Compared to BPNNs, the proposed CNN avoids the fixed-

topology trail-and-error experimental works with too much adjustment of hyperparameters and

iterative tuning of the connection weights. These advantages simplify the implementation of

high-dimensional data with the objective of achieving better trip fuel estimation with fast

convergence. A comparative study has been performed among the proposed self-organizing

CNN, AEA, and a BPNN to estimate airline trip fuel. To make the study more valuable, the

comparison is not limited to high-dimensional data and its varying operational parameters

(referred to as a combined model in Section 5.4.2.1). We also include the portion of data

resulting from dividing high-dimensional data into a small dataset representing the same

behaviour within operational parameters (referred to as individual models in Section 5.4.2.2).

The significant improvement in trip fuel estimation and the learning speed in both models

(combined model and individual models) increase the confidence in that the proposed CNN

may eliminate the need of adding more fuel-based solely on experience.

1.5 Research scope and significance

The main focus of the research is to formulate a machine learning CNN model for aircraft trip

fuel estimation that is able to handle practical large-scale problems. The work will help to add

knowledge to the existing literature from two different viewpoints: academically and

practically. Academically, the research work will help to fill the identified research gaps in the

area of aircraft trip fuel estimation and machine learning as discussed in Section 1.2.

11

Practically, the research findings will provide insights into the aviation industry in controlling

airlines’ excess fuel consumption. Furthermore, it will add a contribution to the global goal of

IATA to make an average improvement in fuel efficiency of 1.5% per year (IATA, 2018).

1.6 Structure of the thesis

The thesis is organized as follows:

▪ Chapter 2 is about the literature review. This chapter is further divided into two major

sections followed by discussion. Section 2.1 is about the literature review explaining

mainly existing aircraft fuel estimation models. Section 2.2 is about literature review

explaining mainly FNNs learning algorithms and optimization techniques with their

merits, limitations, a shift in research trends, and real-world applications. Section 2.3

summarizes the research gaps.

▪ Chapter 3 explains the framework of fuel consumption and its deviation from the

estimation. The relevant operational parameters that may contribute to fuel

consumption are extracted and their relationships to fuel consumption are explained in

detail. The relationship between mathematical and data-driven methods is constructed

and the significance of the proposed method is elaborated. Then, the trip fuel estimation

model is formulated, and the objective function is defined.

▪ Chapter 4 is about the novel self-organizing CNN. This chapter initially explains

existing CNN and its extension convergence limitations. Then, a novel self-organizing

CNN learning algorithm is proposed and explained with supporting lemma and

theoretical justification. Finally, various set of artificial benchmarking and real-world

application dataset is used to demonstrate the generalization performance and learning

speed of proposed CNN.

12

▪ Chapter 5 is about aircrafts trip fuel estimation and comparison. The chapter describes

the data and pre-processing techniques. Then experimental work is carried out by

estimating trip fuel by novel CNN and making a comparison with AEA and BPNN.

Major findings are discussed in detail to get a better insight.

▪ Chapter 6 concludes the thesis by explaining the limitations of existing work and

possible future research directions.

▪ The application of FNNs in solving real-world problems, described in Chapter 2, are

presented in Appendix A.

13

Chapter 2. Literature Review

In Chapter 2, a literature review is carried out for two topics: aircraft fuel estimation, and FNNs

learning algorithms and optimization techniques. Firstly, for aircraft fuel estimation (Section

2.1), the existing fuel estimation models based on EAs (Subsection 2.1.1) and BPNNs

(Subsection 2.1.2) are reviewed. Later, other estimation methods are also reviewed in

Subsection 2.1.3 to enrich the content. Secondly, for neural networks learning algorithms and

optimization techniques (Section 2.2), we conducted a comprehensive review to understand

noteworthy contributions made in the area of the FNN and their merits and limitations; to

identify new research directions that will help to design novel and efficient algorithm. The

survey methodology of comprehensive review is discussed in Subsection 2.2.1. Subsection

2.2.2 gives a brief introduction to FNN. Learning algorithms and their applications are

reviewed in Subsection 2.2.3, whereas, optimization techniques and their applications are

reviewed in Subsection 2.2.4. Finally, in Section 2.3, the summary of the literature review and

research gaps of fuel estimation and FNN are identified and elaborated.

2.1 Aircraft fuel estimation

The trip fuel containing a maximum portion of fuel weight must be considered for

improvement. A small improvement in fuel estimation can bring significant savings in fuel

consumption and substantial cost benefit (Jensen et al., 2013). Irrgang et al. (2015) concluded

from their work on aircraft fuel optimization analytics that the optimal amount of fuel should

be accurately determined, resulting in the minimum amount of fuel remaining at the destination

airport. Taking extra fuel increases the weight of the aircraft, which results in more fuel

consumption than required. Turgut et al. (2014) worked on the fuel flow during the cruise

phase. They suggested that reducing the aircraft weight by 1 tonne, increasing the flying

14

altitude by 100 ft, and reducing the cruise speed by 1 knot may result in a reduction of per-hour

fuel consumption of 15-21 kg, 26-28 kg, and 7.7-8.7 kg, respectively. Taking more fuel

increases the weight of the aircraft, and in consequence, more fuel is consumed with higher

maintenance costs (Abdelghany et al., 2005). Therefore, more efficient and effective methods

are needed to make the estimation models applicable (Choi et al., 2016).

2.1.1 EAs based fuel estimation

Early attempts made use of EAs to estimate the fuel for aircrafts. EAs are based on extensive

mathematical formulations and are derived from the basic concept of energy balance. They

assume static constants of aircraft performance and a dynamic input of the path profile. The

energy balance can be expressed in terms of the energy the aircraft gains and losses as it travels

along the prescribed path profile. During each flight operation, as a result of the change in

kinetic and potential energies, the aircraft suffers energy losses because of drag. These losses

are adjusted by a certain amount of thrust to maintain the energy balance. Thus, an aircraft can

be viewed as a system undergoing energy losses and gains that should be continuously balanced

by the consumption of fuel energy. Collins (1982) derived an algorithm for fuel estimation

based on such energy balance approach by considering the description of aircraft configuration,

weight, and path profile. The path profile depends on a change in true airspeed, altitude, and

flight time. Collins (1982) explained that various aircraft-specific constants need to be

determined for each aircraft to define the relationship between drag and lift/weight coefficients

and to define the relationship between fuel consumption and energy gain from the thrust as a

function of velocity and altitude. Experimental work demonstrated that the algorithm could

result in saving more than two million US gallons of fuel annually. The accuracy check of the

algorithm with Eastern Airlines demonstrated its effectiveness with a difference of less than

3%. This algorithm, known as AFBM, was developed by the Mitre corporation under the FAA

15

and was incorporated in their simulation software Simmod to predict the fuel consumption for

each flight (Baklacioglu, 2016; Schilling, 1997). Similarly, BADA, developed by Eurocontrol

(Nuic, 2014), is another energy balance-based fuel estimation method that estimates thrust-

specific fuel consumption (TSFC) as a function of true airspeed. The fuel for different phases

and engine types (Jet, Turbo, and Piston) can be calculated from coefficients defined for each

aircraft type in operational performance files (OPF) of BADA. The BADA database holds a

set of files explaining the performance and operating procedure coefficient of many different

types of aircrafts. The coefficients are used to calculate drag, thrust, and fuel flow. As a result,

climb, nominal cruise, and descent speeds for the targeted aircraft are provided.

The AFBM Simmod model is based on various parameters to estimate fuel flow. The

parameters include engine constants, airspeed, aircraft mass, atmospheric density, extra thrust

needed for ascending, and wing area (Huang et al., 2017). Trani and Wing-Ho (1997) argued

that the information needed to determine fuel flow is not always available in a real scenario

and a great deal of flight testing is needed to generate the parameter data, what may make the

Simmod method expensive. The Simmod method has certainly improved the measurement of

airline fuel efficiency (Baklacioglu, 2016; Schilling, 1997; Senzig et al., 2009; Trani et al.,

2004). However, Senzig et al. (2009) reported that BADA is an effective method that has

gained much more popularity than Simmod. Indeed, BADA can estimate fuel consumption

with an average deviation of 3% during the cruise phase. For the terminal areas, such as takeoff

and descent, BADA losses accuracy. An average difference of -22.3% was observed between

the BADA model and the fuel consumption reported by an airline during the takeoff phase.

Moreover, according to recent studies, aircraft fuel estimation may not be accurate in EA

methods (Simmod and BADA) owing to outdated existing databases and the unavailability of

data for all types of aircrafts and engines (Yanto and Liem, 2018). The unavailability of data

may result in the usage of more global parameters with default values rather than local

16

parameters, which in turn may lead to inaccurate results (Pagoni and Psaraki-Kalouptsidi,

2017). Complex mathematical computation, high testing and consultation cost, expert

involvement, and estimation errors, which may become worse in certain conditions, limit EA

applicability.

2.1.2 BPNNs based fuel estimation

In the literature, many efforts have been made to simplify EA-based fuel estimation methods

with respect to BPNNs. Schilling (1997) proposed to estimate aircraft fuel by a BPNN in

conjunction with the AFBM energy balance theory to simplify the computational capability

and improve the estimation accuracy of the AFBM. The engine-specific constant in AFBM

was determined by multivariate regression with an average accuracy error of 4% (Schilling,

1997). Their proposed aircraft fuel consumption model (AFCM) comprises two systems: 1)

initial fuel estimation by AFBM; and 2) information generated by AFBM such as altitude,

velocity, and fuel estimated to map into the BPNN. Two inputs with 600 flight instances were

trained and a network with 2 layers and 7 hidden units in each layer was selected through a

trial-and-error approach among six candidate BPNNs. The employed various activation

function with a backpropagation Levenberg-Marquardt (LM) learning algorithm. Boeings 747-

100 and 767-200, Bombardier Dash-7, DC 10-30, and Jetstar were aircrafts used for

performance comparison. Schilling (1997) demonstrated that AFCM achieved better results

compared to AFBM. AFCM was modelled based on low-level input variables such as altitude

and velocity with the future recommendation of incorporating more operational parameters.

Trani and Wing-Ho (1997) pointed out that EAs are challenging to implement because of the

information required to define the constants and to determine the coefficients demands a great

effort in addition to testing and expertise. These disadvantages are the main constraints for their

expansion. Trani et al. (2004) proposed a BPNN as an alternative approach for fuel estimation

17

during the climb, cruise, and descent phases of a flight. Eight different sizes of BPNN candidate

networks for fuel estimation were constructed according to four operational parameters,

namely Mach number, weight, temperature, and altitude, covering the cruise phase in a dataset

comprising 1,610 flight instances performed by Fokker F-100 aircrafts. They suggested that a

BPNN trained by LM with three layers is the best approach for fuel estimation with eight

hidden units in the first two layers each and one unit in the output layer. The employed various

activation function with maximum stopping training epochs of 10,000. The BPNN was trained

for the cruise phase with a target vector of a specific air range rather than consumed fuel. The

trained BPNN model of the cruise phase was also implemented for the climb and descent flight

phases. Actually, the climb and descent phases consisted of a fuel burn output vector. This

implies that this research work is more focused on engine performance over distance traveled

during the cruise phase. The work does not clearly distinguish the estimation in different flight

phases because the specific air range is defined as the distance covered per unit of consumed

fuel (Nautical Miles/ Kilogram or NM/kg), whereas fuel burn may be related to fuel

consumption (kg) during flight. There exist many similarities between the suggestions of

(Schilling, 1997) and (Trani et al., 2004) concerning BPNNs. These similarities are based on

the same assumptions, involving trial-and-error approaches to find hidden units for two hidden

layers along with different activation functions.

Baklacioglu (2016) improved the work reported in (Trani et al., 2004) and proposed a so-called

genetic algorithm optimized neural network (GANN) to determine the network

hyperparameters and number of hidden layers to achieve the best model with less effort. Such

networks considered two input operational parameters, i.e., altitude and airspeed, as a function

of fuel flow rate, with a dataset composed of 347, 404, and 483 flight instances for the climb,

cruise, and descent phases, respectively. The targeted aircraft was a medium-weight Boeing

737-800. The work indicated that GANNs with one hidden layer is better than those with two

18

hidden layers and achieve improved results for the climb and descent phases compared to a

previously suggested model (Trani et al., 2004).

2.1.3 Estimation methods other than NN-based machine learning

methodologies

Apart from NN machine learning methodologies, researchers have also developed models to

achieve a better fuel estimation. Turgut and Rosen (2012) proposed a genetic algorithm fuel

consumption model for commercial flights in a Boeing B737-800 by studying the relationship

between altitude and fuel flow during four different descents within the descent flight phase.

This model suggests avoiding low-level flight and maintaining higher altitude if possible

during the descent phase for greater fuel saving. When a delay condition occurs, aircraft

holding at a higher altitude could save substantial fuel rather than holding the flight at a lower

altitude. The results demonstrate that performing low-level flights at an altitude 1000 ft higher

could substantially decrease fuel consumption. Chati and Balakrishnan (2017) studied the

impact of takeoff weight (TOW) on aircraft fuel consumption and proposed a Gaussian process

regression (GPR) statistical approach to determine TOW using observed data from takeoff

round roll. The estimated TOW model was used as an input to study its effect on fuel

consumption during ascent. In particular, GPR-estimated TOW was averaged over eight

different types of aircraft for 874 flight instances. The predicted TOW was used as an input

feature to estimate the fuel flow rate during the ascent phase. Ryerson et al. (2011) studied the

impact of three performance metrics, namely schedule padding, airborne delay, and departure

delay, on two aircrafts ̶ Boeing 757-200 and 737-800 ̶ on fuel consumption. They found that

airborne delays burn 50-60 lbs/minute of fuel compared with schedule padding, which is 4.5-

12 lbs/minute, and a departure delay, which is 2.3-4.6 lbs/minute. Additionally, a congested

airport terminal area increases fuel consumption by 16%. Improvement in performance

19

measurements by eliminating the above three delays by econometric methods can reduce

airborne fuel consumption.

2.2 Neural networks learning algorithms and optimization

techniques

The importance of FNN is increasing every day due to its property of processing big nonlinear

data like in the human brains. It can discover a hidden pattern in the data by entering raw data

in the input, passing layer by layer until it arrives at the output in the forward direction. The

model is trained to correctly estimate unseen data (also known as test data) referred to as the

generalization performance of FNN. The ideal FNN is considered to have better generalization

performance and may require less learning time (also known as convergence rate) to train the

model. Generalization performance can be defined as the ability of an algorithm to accurately

predict values on previously unseen samples (Yeung et al., 2007), whereas, learning time can

be defined as the ability of the algorithm to train a model quickly. Both “generalization

performance” and “learning time” are key performance indicators for FNNs and are used by

researchers to demonstrate the effectiveness of their proposed algorithms. Some of the

drawbacks that may affect the generalization performance and learning speed of FNNs include

local minima, saddle points, plateau surfaces, hyperparameters adjustment, trial and error

experimental work, tuning connection weights, deciding hidden units and layers, and many

others. The drawbacks that limit FNN applicability may become worse with inappropriate user

expertise and insufficient theoretical information. For instance, how to define network size,

hidden units, hidden layers, connection weights, learning rate, topology, and many others.

In the literature, the answers to the above problems are not so straightforward. Researchers

have proposed several learning algorithms and optimization techniques, that benefit in

20

improving the FNN, with the main motivation to get a better generalization performance in the

shortest possible network training time. In the existing literature surveys, several authors have

reviewed FNN algorithms by performing a comparative study of different algorithms within

the same class (for instance: constructive algorithms comparison based on data, and many

others), studying the application area (for instance: business, engineering, and many others)

and specific class surveys (for instance: network ensemble survey and many others). For

instance, Zhang (2000) focused on and surveyed the recent development of neural networks

for classification problems. The review included the link between the neural and conventional

classifiers and demonstrated that neural networks are a competitive alternative to the traditional

classifiers. Other contributions include examining the issues of posterior probability

estimation, feature selection, and the trade-off between learning and generalization. Hunter et

al. (2012) performed a comparative study among different types of learning algorithms and

network topology so as to select a proper neural network size and architecture for better

generalization performance. LeCun et al. (2015) reviewed deep learning and provided in-depth

knowledge of backpropagation, convolutional neural network, and recurrent neural networks.

The success in deep learning is that it requires little engineering by hand and new algorithms

will accelerate its progress much more. Tkáč and Verner (2016) provided a systematic review

of neural network applications over two decades and disclosed that most of the application

areas included financial distress and bankruptcy. Cao et al. (2018) presented a survey on tuning

free random weights neural networks from the perspective of deep learning. The traditional

deep learning iterative algorithms are far slower and have the problem of local minima. The

survey suggests that the computing efficiency of deep learning increases due to the combination

of traditional deep learning and tuning free random weights neural networks.

In the above studies, the focus is only on the specific type of algorithms or their applications

which limits their scope. The existing studies are more focused on comparing and selecting a

21

suitable algorithm within their class which is solely based on expertise and available

application data. It does not clearly identify the research directions over the past decades.

Researchers have made efforts to reduce the drawbacks of FNNs, however, a comprehensive

review is missing, and an open challenge is to gather the answers for the drawbacks in one

platform. Therefore, we carried out a comprehensive literature review and classified it into six

categories based on the algorithms proposed, and investigated their applications in real-world

management, engineering, and health sciences problems, so as to understand the researchers’

current interests and directions in overcoming FNN drawbacks. Our review contributes to the

existing literature not only by summarizing the recent developments in FNN algorithms and

classifying them into six categories according to the nature of algorithms, but also by exploring

the applications of the proposed algorithms in solving real-world management, engineering,

and health sciences problems and demonstrating the great potential for their practical

utilization. Moreover, we propose several interesting and crucial future research directions

regarding FNNs which are believed to be useful for the development of the area.

2.2.1 Survey methodology

2.2.1.1 Source of literature

The objective of the study is to identify and classify the learning algorithms and optimization

techniques that have contributed to improving the generalization performance and learning

speed of FNNs. Therefore, a comprehensive review has been conducted to get in-depth

knowledge of the existing work and to understand researchers’ contributions and work

directions. Furthermore, the authors discuss the future research directions so as to contribute in

strengthening the literature. To accomplish these objectives, the literature surveyed in the study

was explored from seven different sources: IEEE Xplore -IEEE, ScienceDirect- Elsevier,

22

Emerald Insight, arXiv- Cornell University, SpringerLink- Springer, Taylor & Francis and

Google Scholar. The survey is based on articles in journals, conference proceedings, archives,

technical reports, books, and academic lectures. The focus was to select articles published in

the last three decades mainly in the period 1986-2018. However, the articles that contribute

significant knowledge to the existing literature and one out of time frame (For instance, the

year 1985 and earlier) are also included to support and deepen the review.

The research contributions and its applications in FNN are numerous and cannot be covered in

one study. Four keywords that are related to FNN were used to search for articles in the above-

mentioned databases: “generalization performance”, “learning rate”, “overfitting”, and “fixed

and cascade architecture”. Moreover, the combination of keywords was also used to get

relevant articles. The duplicated articles in the databases, non-English language, and matched

keywords but out of scope were discarded. The screening process was limited to articles

belonging to the Q1 category of ranked journals, issued by either “Scientific Journal Rankings

- SJR” or “Journal Citation Reports - JCR” in the year 2018. However, to strengthen the review,

a small number of highly cited conference papers and articles belonging to Q2, Q3, and Q4

journals with more than 500 citations, and articles from other sources (such as online achieves,

books, technical reports, and websites) with more than 100 citations were also considered. All

the searched article’s abstracts and the conclusions were completely reviewed along with the

full text to screen high-quality relevant literature. This resulted in a total of 80 articles. Figure

Figure 2.1. Articles distribution

Journal Articles

63 Nos.; 78.75%

Conference Proceedings

10Nos.; 12.50%

arXiv online Archives

3 Nos.; 3.75%

Books

2 No.; 2.50%

Technical Report

1No.; 1.25%

Academic Lecture

1No.; 1.25%

23

2.1 shows the distribution of the articles along with the number and percentage in each

category. In the 80 articles, 63 (78.75%) are journal papers, 10 (12.50%) conference papers, 3

(3.75%) online arXiv archives, 2 (2.50%) books, 1 (1.25%) technical report, and 1 (1.25%)

Table 2.1. Articles source description

Source Papers

(Nos.)

Citation

(Nos.)

1) Journal Article 63 88144

Artificial Intelligence 42 48707

IEEE Transactions on Neural Networks and Learning Systems 19 18778

Journal of Machine Learning Research 2 11513

Neurocomputing 6 8123

Neural Networks 6 5650

IEEE Transactions on Pattern Analysis and Machine Intelligence 3 4411

Artificial Intelligence Review 1 140

Neural Computing and Applications 3 62

Information Sciences 2 30

Multidisciplinary 2 25654

Nature 2 25654

Applied Mathematics 6 6219

Mathematics of Computation 1 3093

Technometrics 1 1752

SIAM Review 1 636

Applied Mathematics and Computation 2 543

Mathematical Programming 1 195

Arts and Humanities (Miscellaneous) 1 3491

Neural Computation 1 3491

Computer Science Applications 7 3226

IEEE Transactions on Cybernetics 2 2762

IEEE Transactions on Industrial Informatics 1 180

IEEE Transactions on Industrial Electronics 1 134

Journal of Chemical Information and Computer Sciences 1 88

IEEE Access 1 32

Industrial Management & Data Systems 1 30

Engineering (Miscellaneous) 1 434

Advances in Engineering Software 1 434

Computer Networks and Communication 2 315

24

online academic lecture. Table 2.1 lists the journals, conferences, archives, books, technical

reports, and academic lectures used in the literature along with a description of the type,

publisher, the number of papers extracted, and the number of citations. The content of the table

illustrates the importance of screened articles not only in journals but also in the conference

and other sources. The main idea was to include highly cited articles published in reputable

journals. However, a small number of articles from conferences and other sources with a high

citation rate and unique ideas are also considered as a part of a survey to enrich the content.

IEEE Intelligent System 1 261

Neural Processing Letters 1 54

Statistics and Probability 1 76

American Statistician 1 76

Electrical and Electronics Engineering 1 22

IEEE Transactions on Circuits and Systems I: Regular Papers 1 22

2) Conference Proceedings 10 41712

IEEE 4 32574

International Symposium on Micro Machine and Human Science 1 13257

IEEE International Conference on Evolutionary Computation 1 11520

IEEE International Conference on Neural Networks 1 4793

International Joint Conference on Neural Networks 1 3004

MIT Press 4 7492

Advances in neural information processing systems 4 7492

IMLS 1 1054

International Conference on Machine Learning 1 1054

Morgan Kaufmann 1 592

Proceedings of the 1988 connectionist models summer school 1 592

3) arXiv Archive 3 21519

Cornell University Library 3 21519

4) Book 2 14323

The MIT Press 1 14009

Morgan & Claypool Publishers 1 314

5) Report 1 1238

School of Computer Science, Carnegie Mellon University 1 1238

6) Webpage 1 118

Coursera 1 118

Grand Total 80 167054

25

The 80 articles published over time are shown in Figure 2.2. It illustrates that in the year 1989

and earlier, the FNN was not the main research area because of the unavailability of efficient

computational resources. In 1989-1994, it gained importance because of the explanation of the

theory of backpropagation (BP) (Hecht-Nielsen, 1989). This created a significant interest in

the topic and researchers identified the new research gaps to improve the existing BP by

proposing new learning algorithms, for instance: cascade correlation learning (Fahlman and

Lebiere, 1990), probabilistic neural network (Specht, 1990) and general regression neural

network (Specht, 1990). Although FNN dates back before the ’50s, but it gained importance in

the ’90s. In the modern era, the development of more efficient computational resources and the

availability of big data make it a more promising research area which is evident by the growth

rate increase from 2001 onwards (Choi et al., 2018; Shen, Choi and Chan, 2019; Shen, Choi

and Minner, 2019; Shen and Chan, 2017).

2.2.1.2 The philosophy of the review work

The review work was conducted in five steps:

Figure 2.2. Articles published over time

2 2 8 6 6 6 3 4 7 10 12 14

3% 3%

10%8% 8% 8%

4%5%

9%

13%15%

18%

0%2%4%6%8%10%12%14%16%18%20%

02468

10121416

Bef

ore

1986

19

86-1

98

8

1989-1

991

1992-1

994

1995-1

997

1998-2

000

2001-2

003

2004-2

006

2007-2

009

2010-2

012

2013-2

015

2016-2

018 Per

centa

ge

Contr

ibuti

on

Num

ber

of

Art

icle

s

Years (Three years as a period)

26

Step-1) Relevant literature explaining the learning algorithms and optimization techniques

proposed to improve the generalization performance and learning speed of FNN was

identified based on the popular keywords used in FNN.

Step-2) The algorithms were classified into six categories. The algorithms were assigned to a

category based on problem identification, mathematical model, technical reasoning,

and proposed solution.

Step-3) The six categories were further classified into two main Parts. Part I review the two

categories mainly explores learning algorithms, whereas, the remaining four categories

developing optimization algorithms (techniques) are reviewed in Part II.

Step-4) The algorithms are explained with their merits and technical limitations to suggest

future research directions in FNN.

Step-5) The applications of the proposed algorithms in the real-world are identified to show the

success of FNN in management, engineering, and health sciences problem-solving.

2.2.1.3 Classification schemes

The classification scheme in the existing literature surveyed in FNN is mainly focused on the

comparative study of different algorithms within the same class (for instance: constructive

algorithms comparison based on data, etc), studying the application area (for instance: business,

engineering, etc) and specific class surveys (for instance: network ensemble survey, etc). This

study classification is unique as its focus is on learning algorithms and optimization techniques

recommended in the last three decades for improving the generalization performance and

learning speed of FNN. The algorithms are classified into six categories and further divided

into two main parts. The six categories are:

Part I: Learning algorithms (Gradient learning algorithm, Gradient-free learning algorithm)

27

Table 2.2. Classification of FNN published algorithms

No. Category Algorithms Published References

1 Gradient

learning

algorithms for

Network

Training

Gradient descent, stochastic

gradient descent, mini-batch

gradient descent, Newton

method, Quasi-Newton method,

conjugate gradient method,

Quickprop, Levenberg-

Marquardt Algorithm, Neuron by

Neuron

Hecht-Nielsen (1989), Bianchini

and Scarselli (2014), LeCun et

al. (2015), Wilamowski and Yu

(2010), Rumelhart et al. (1986)*,

Wilson and Martinez (2003),

Wang et al. (2017), Hinton et al.

(2012)*, Ypma (1995), Zeiler

(2012)*, Shanno (1970), Lewis

and Overton (2013), Setiono and

Hui (1995)*, Fahlman (1988),

Hagan and Menhaj (1994),

Wilamowski et al. (2008),

Hunter et al. (2012)*

2 Gradient-free

learning

algorithms

Probabilistic Neural Network,

General Regression Neural

Network, Extreme learning

machine (ELM), Online

Sequential ELM, Incremental

ELM (I-ELM), Convex I-ELM,

Enhanced I-ELM, Error

Minimized ELM (EM-ELM),

Bidirectional ELM, Orthogonal

I-ELM (OI-ELM), Driving

Amount OI-ELM, Self-adaptive

ELM, Incremental Particle

Swarm Optimization EM-ELM,

Weighted ELM, Multilayer

ELM, Hierarchical ELM, No

propagation, Iterative

Feedforward Neural Networks

with Random Weights

Huang et al. (2015), Ferrari and

Stengel (2005), Specht (1990),

Specht (1991), Huang, Zhu and

Siew (2006), Huang, Zhou, Ding

and Zhang (2012), Liang et al.

(2006), Huang, Chen and Siew

(2006), Huang and Chen (2007),

Huang and Chen (2008), Feng et

al. (2009), Yang et al. (2012),

Ying (2016), Zou et al. (2018),

Wang et al. (2016), Han et al.

(2017), Zong et al. (2013),

Kasun et al. (2013), Tang et al.

(2016), Widrow et al. (2013),

Cao et al. (2016)

3 Optimization

algorithms for

learning rate

Momentum, Resilient

Propagation, RMSprop,

Adaptive Gradient Algorithm,

AdaDelta, Adaptive Moment

Estimation, AdaMax

Gori and Tesi (1992), Lucas and

Saccucci (1990), Rumelhart et

al. (1986)*, Qian (1999),

Riedmiller and Braun (1993),

Hinton et al. (2012)*, Duchi et

al. (2011), Zeiler (2012)*,

Kingma and Ba (2014)

28

Part II: Optimization techniques (Optimization algorithms for learning rate, Bias and Variance

(Underfitting and Overfitting) minimization algorithms, Constructive topology FNN,

Metaheuristic Search Algorithms)

Categories one and two belong to Part I and are considered as learning algorithms. The first

category considered gradient learning algorithms that need first-order or second-order gradient

information to build FNNs and the second category contained gradient-free learning algorithms

that analytically determine connection weights rather than first or second-order gradient tuning.

Categories three to six belongs to Part II and are considered as optimization techniques. The

4 Bias and

Variance

(Underfitting and

Overfitting)

minimization

algorithms

Validation, n-fold cross-

validation, weight decay

(𝑙1and 𝑙2 regularization),

Dropout, DropConnect,

Shakeout, Batch

normalization, Optimized

approximation algorithm

(Signal to Noise Ratio Figure),

Pruning Sensitivity Methods,

Ensembles Methods

Geman et al. (1992), Zaghloul et

al. (2009), Reed (1993), Seni and

Elder (2010), Setiono and Hui

(1995)*, Krogh and Hertz (1992),

Srivastava et al. (2014), Wan et al.

(2013), Kang et al. (2017), Ioffe

and Szegedy (2015), Liu et al.

(2008), Karnin (1990),

Kovalishyn et al. (1998)*, Han et

al. (2015), Hansen and Salamon

(1990), Krogh and Vedelsby

(1995), Islam et al. (2003)

5 Constructive

topology Neural

Networks

Cascade Correlation Learning,

Evolving cascade Neural

Network, Orthogonal Least

Squares-based Cascade

Network, Faster Cascade

Neural Network, Cascade

Correntropy Network

Hunter et al. (2012)*, Kwok and

Yeung (1997), Fahlman and

Lebiere (1990), Lang (1989),

Hwang et al. (1996), Lehtokangas

(2000), Kovalishyn et al. (1998)*

Schetinin (2003), Farlow (1981),

Huang, Song and Wu (2012),

Qiao et al. (2016), Nayyeri et al.

(2018)

6 Metaheuristic

Search

Algorithms

Genetic algorithm, Particle

Swarm Optimization (PSO),

Adaptive PSO, Whale

optimization algorithm

(WOA), Lévy flight WOA

Mitchell (1998), Mohamad et al.

(2017), Liang and Dai (1998),

Ding et al. (2011), Eberhart and

Kennedy (1995), Shi and

Eberhart (1998), Zhang et al.

(2007), Shaghaghi et al. (2017),

Mirjalili and Lewis (2016), Ling

et al. (2017)

29

third category contains optimization algorithms for varying learning rate at different iterations

to improve generalization performance by avoiding divergence from a minimum of loss

function; the fourth category contains algorithms to avoid underfitting and overfitting to

improve generalization performance with the additional need for training time; the fifth

category contains constructive topology learning algorithms to avoid the need for trial and error

approaches for determining the number of hidden units in fixed-topology networks, and the

sixth category contains metaheuristic global optimization algorithms to search for loss

functions in global search space instead of local space.

Figure 2.3 illustrates the classification of algorithms into six categories. We identified a total

of 54 unique algorithms proposed in 80 articles. Among the 54 unique algorithms, 27 belongs

to learning and the remaining 27 belongs to optimization algorithms. Figure 2.4 illustrates the

number of algorithms identified in each category over time. Other than the proposed

algorithms, a small number of articles that support or criticize identified algorithms are also

included to widen review. The unique algorithms, supportive, and criticized articles result in a

total of 80 articles. Figure 2.5 illustrates the total number of articles reviewed in each category.

Table 2.2 provides a detailed summary of the algorithms identified in each category along with

Figure 2.3. Algorithms distribution categories wise

9 18 7 10 5 5

17%

33%

13%19%

9% 9%

0%5%10%15%20%25%30%35%

0

5

10

15

20

Gradient

learning

algorithms

for Network

Training

Gradient free

learning

algorithms

Optimization

algorithms

for learning

rate

Bias and

Variance

(Underfitting

and

Overfitting)

minimization

algorithms

Constructive

topology

Neural

Networks

algorithms

Metaheuristic

Search

Algorithms

per

centa

ge

Con

trib

uti

on

Num

ber

of

Alg

ori

thm

s

Categories

30

references to the total number of papers reviewed. In the table, the references of the article

reviewed in each specific category are given, however, articles referenced in more than one

category are identified with an asterisk (*). For instance, the article (Rumelhart et al., 1986),

(Setiono and Hui, 1995), (Hinton et al., 2012), (Zeiler, 2012) and (Hunter et al., 2012) appears

in the first, third and fifth category. The major contribution of (Rumelhart et al., 1986) and

(Setiono and Hui, 1995) is in the first category, (Hinton et al., 2012) and (Zeiler, 2012) is in

the third category, and (Hunter et al., 2012) is in the fifth category. To avoid repetition, Figure

2.5 is plotted showing the major contribution of articles in the respective categories. The

Figure 2.4. Algorithms proposed over time

5

12

1

23

5

8

2

3

2

4

2

4

11

1 21

2 2

0

1

2

3

4

5

6

7

8

9

before 1989 1989-1994 1995-2000 2001-2006 2007-2012 2013-2018

Num

ber

of

Alg

ori

thm

s

Years (Six years as a period)

Gradient learning algorithms for

Network Training

Gradient free learning algorithms

Optimization algorithms for

learning rate

Bias and Variance (Underfitting

and Overfitting) minimization

algorithms

Constructive topology Neural

Networks algorithms

Metaheuristic Search Algorithms

Figure 2.5. Papers distribution categories wise

14 21 8 15 12 10

18%

26%

10%

19%15%

13%

0%5%10%15%20%25%30%

0

5

10

15

20

25

Gradient

learning

algorithms for

Network

Training

Gradient free

learning

algorithms

Optimization

algorithms for

learning rate

Bias and

Variance

(Underfitting

and

Overfitting)

minimization

algorithms

Constructive

topology

Neural

Networks

algorithms

Metaheuristic

Search

Algorithms

Num

ber

of

Art

icle

s

Categories

31

distribution in Figures 2.3-2.5 and content in Table 2.2 shows the researchers interest and trend

in specific categories. The distribution and content explain that the research directions are

changing from complex gradient-based algorithms to gradient-free algorithms, fixed topology

to constructive topology, hyperparameter initial guess work to analytically calculation, and

converging algorithms at a global minimum rather than the local minimum. In another sense,

the categories may be considered as an opportunity for discovering research gaps and further

improvement can bring a significant contribution.

2.2.2 FNN: An overview

FNN is a parallel information processing structure consisting of a processing element known

as neurons (hidden units), interconnected together with unidirectional distributed channels

known as connections. Each processing neuron receives an incoming connection from all input

features, sum and activate it using nonlinear activation function, and branch it to as many

connections as desired. The processing neuron output which can be of any mathematical type

depends upon input features with its weighted sum and activation function (Hecht-Nielsen,

1989). The FNN concept originated by motiving from the human brain neuron functioning

system. The human brain has approximately 100 billion neurons that communicate through

thousands of electro-chemical connections and send signals to other neurons if the sum of

connections exceeds a certain threshold. The number of hidden layers in FNN determines its

architecture. The input layer consisting of input features 𝑥 with an added bias 𝑏𝑢 are connected

to the hidden layers ℎ through the input connection weights 𝑤𝑖𝑐𝑤. The hidden layer sums the

product (𝑤𝑖𝑐𝑤𝑥) and squash it through the nonlinear activation function 𝑓ℎ𝑢(𝑧). The hidden

layer with an added bias 𝑏𝑜 and the output layer are connected by the output connection weights

𝑤𝑜𝑐𝑤. The output layer sums the product (𝑤𝑜𝑐𝑤𝑓ℎ𝑢(𝑧)) and squash it through the nonlinear

activation function 𝑓𝑜𝑢(𝑧) to estimate output vector �̂�. The values of �̂� at the output layer are

32

compared with target vector 𝑦 and the loss function 𝐸 is determined. These all steps proceed

in a forward phase and known as the forward propagation. This can be expressed

mathematically as:

�̂� = 𝑓𝑜𝑢(𝑤𝑜𝑐𝑤𝑓ℎ𝑢(𝑤𝑖𝑐𝑤𝑥 + 𝑏𝑢) + 𝑏𝑜) (2.1)

Such that:

𝑓(𝑧) =

1

1 + 𝑒−𝑧

(2.2)

Equation (2.2) is a commonly used type of nonlinear activation function known as the sigmoid

activation function. Other various types of activation functions are differentiable such as

hyperbolic tangent, rectified linear unit, leaky rectified linear unit, SoftMax and many others,

and nondifferentiable such as a threshold and many others. The suitable choice of activation

function in the hidden unit changes with the application problem under consideration. The FNN

attempt to minimize 𝐸 of the network by accomplishing �̂� approximately equal to 𝑦:

𝐸 =1

𝑚∑(�̂�ℎ − 𝑦ℎ)

2

𝑚

ℎ=1

(2.3)

At each instance ℎ the error 𝑒ℎ can be expressed as:

𝑒ℎ = �̂�ℎ − 𝑦ℎ (2.4)

If 𝐸 is larger than predefined expected error ε, the connection weights are backpropagated by

taking the derivative of 𝐸 with respect to each weight in the direction of descending gradient.

This update of the connection weights in the gradient descent direction is such that �̂� starts to

become closer to 𝑦. The backward steps to calculate gradient information and the updated

weight to minimize the error function is known as backpropagation (BP). The forward

33

propagation and backpropagation complete one iteration 𝑖 and are known as FNN training.

After each iteration, the error function 𝐸𝑖 is recalculated and compared with 𝐸𝑖−1. If 𝑖 reaches

to its predefined maximum limit or 𝐸𝑖 converges/starts increasing the training is stopped, else

it is continued. The training of FNN is influenced by several reasons as highlighted in the

introduction section and may be overcome by various experimental trial and error works with

appropriate learning algorithms and optimization techniques for fast and efficient convergence.

2.2.3 Learning algorithms and applications

In this section, the authors made an effort to explain FNN learning algorithms merits and

technical limitations in order to understand the FNN drawbacks and needed user expertise, so

as to identify research gaps and future directions. The proposed algorithms in the selected

literature were reviewed to understand the inspiration, research gaps, merits, limitations, and

application areas. In current practice, a single algorithm in FNN is not sufficient for all types

of applications. There is always a trade-off between network generalization performance and

learning speed. Some algorithms have the advantage of being more efficient than others but

may be constrained by memory requirements, complex architecture, and/or more learning time.

Since the early FNN development, many improvements have been made and many of them are

mentioned below:

2.2.3.1 Gradient learning algorithms for network training

The first category is the learning algorithms proposed based on the BP gradient information

concept, which is considered the reason for creating significant interest in the FNN topic. The

BP gradient learning algorithms can be further subcategorized into two types: first-order and

second-order. The first-order gradient trains the FNN by calculating the gradient information

and updated weight to reach a minimum of a loss function. The first-order derivative of the

34

error with respect to weight is calculated at the output layer at each iteration and is distributed

back to the whole network. However, the first order BP is considered slow because of the

computation of the first-order gradient information at each iteration. This increases the learning

time and possibility of the algorithm to be stuck in a local minimum. Researchers made efforts

to improve the learning speed by incorporating second-order gradient information to reach the

loss function faster. Wilamowski and Yu (2010) explained that first-order learning methods

might need an excessive number of hidden units and iterations for convergence which can

reduce the generalization performance for unseen data. Whereas second-order learning

algorithms are powerful in learning, but the complexity increases with increasing network size.

A lot of computational memory is needed to store the Jacobian 𝐉 and Hessian Matrix 𝐇𝐧 along

with their inverses, which can make it difficult for large training datasets.

a) First-order gradient algorithms

The popular and well known first-order learning algorithms among the BP family class is

Gradient descent (GD). It backpropagates the error with respect to connection weights 𝑤

through layers to minimize a loss function 𝐸 until it converges to minimal error (Rumelhart et

al., 1986) :

∇𝐸 =𝜕𝐸

𝜕𝑤𝑖=

𝜕𝐸

𝜕𝑜𝑢𝑡𝑖

𝜕𝑜𝑢𝑡𝑖𝜕𝑛𝑒𝑡𝑖

𝜕𝑛𝑒𝑡𝑖𝜕𝑤𝑖

(2.5)

where ∇𝐸 is a partial derivative of the error with respect to weight 𝑤, 𝑜𝑢𝑡𝑖 is the activation

output and 𝑛𝑒𝑡𝑖 is the weighted sum of inputs of the hidden unit. The updated weight 𝑤𝑖+1

needed for the next iteration can be expressed as:

𝑤𝑖+1 = 𝑤𝑖−∝ ∇𝐸 (2.6)

35

where ∝ is a learning rate hyperparameter. For the sake of simplicity in this study, the

connection weights 𝑤 in the equations refer to all types of connection weights in the network,

unless otherwise specified. The GD convergence is considered to be slow because the loss

function is computed based on the whole training dataset. An increase in data size makes it

slower and more time consuming for large application datasets (Wilson and Martinez, 2003).

Therefore, an improved form of GD known as Stochastic GD (SGD) was proposed to compute

the gradient on each training data instance. If the loss function is convex, it converges faster

than GD to a global minimum, otherwise, a local minimum is guaranteed (Wang et al., 2017).

The issue with SGD is that its convergence is not smooth compared to GD and overshoots

during training on some iterations, which may be controlled to some extent by adjustment of

learning rate. The drawbacks of both GD and SGD can be overcome by Mini Batch GD that

takes equal size mini-batches 𝑁 of training data instead of whole training data or single training

instance. Mini batch SD is more favorable due to its hybrid characteristics: a stable and better

convergence rate (Hinton et al., 2012).

For GD:

𝑤𝑖+1 = 𝑤𝑖−∝ ∇𝐸, (𝑥, 𝑎) (2.7)

For SGD:

𝑤𝑖+1 = 𝑤𝑖−∝ ∇𝐸, (𝑥ℎ, 𝑎ℎ) (2.8)

For Mini batch GD:

𝑤𝑖+1 = 𝑤𝑖−∝ ∇𝐸, (𝑥ℎ:ℎ+𝑁 , 𝑎ℎ:ℎ+𝑁) (2.9)

36

The drawbacks of first-order GD and its variant algorithms (SGD and Mini batch GD) are that

the number of iterations comparatively increases which makes them far slower and liable to be

stuck at a local minimum.

b) Second-order gradient algorithms

The convergence problem of the first-order gradient was improved by applying the second-

order form. The Newton method (NM), a second-order derivative, was proposed to increase the

convergence by modifying GD to the second-order Hessian inverse matrix 𝐇𝐧−𝟏 along with

the first-order ∇𝐸 to take larger steps towards the minimum of an objective function (Ypma,

1995):

𝑤𝑖+1 = 𝑤𝑖−∝ 𝐇𝐧−𝟏∇𝐸 (2.10)

NM makes use of the second-order 𝐇𝐧 and its inverse 𝐇𝐧−𝟏 to minimize the loss function

which makes it computationally expensive and unfeasible for real large model applications

(Zeiler, 2012). The small networks with fewer parameters may be trained with NM to take

advantage of the better convergence speed compared to GD. The Quasi-Newton method (quasi

NM) was proposed to address the drawback of NM and simplified by approximating the inverse

of the 𝐇𝐧 from the first order derivative. It updates the approximated 𝐇𝐧 and its inverse 𝐇𝐧−𝟏

after each iteration which makes it computationally less expensive compared to NM (Shanno,

1970). Among several techniques Broyden–Fletcher–Goldfarb–Shanno (BFGS) has gained

much popularity in applications (Lewis and Overton, 2013). It computes 𝐇𝐧 and 𝐇𝐧−𝟏

expressed as:

𝐇𝐧𝐢+𝟏 = 𝐇𝐧𝐢 +𝑜𝑖𝑜𝑖

𝑇

𝑜𝑖𝑇𝜎𝑖

−𝐇𝐧𝐢𝜎𝑖𝜎𝑖

𝑇𝐇𝐧𝐢𝑻

𝜎𝑖𝑇𝐇𝐧𝐢𝜎𝑖

(2.11)

Similarly, it's inverse 𝐇𝐧−𝟏can be calculated from the Sherman-Morrison formula:

37

𝐇𝐧𝐢+𝟏−𝟏 = (𝐼 −

𝜎𝑖𝑜𝑖𝑇

𝑜𝑖𝑇𝜎𝑖)𝐇𝐧𝐢

−𝟏 (𝐼 −𝑜𝑖𝜎𝑖

𝑇

𝑜𝑖𝑇𝜎𝑖) +

𝜎𝑖𝜎𝑖𝑇

𝑜𝑖𝑇𝜎𝑖

(2.12)

Such that:

𝑜𝑖 = ∇𝐸𝑖+1 − ∇𝐸𝑖 (2.13)

𝜎𝑖 = 𝑤𝑖+1 − 𝑤𝑖 (2.14)

𝑤𝑖+1 = 𝑤𝑖+∝ 𝑑𝑖 (2.15)

𝑑𝑖 = −𝐇𝐧𝐢−𝟏∇𝐸𝑖 (2.16)

The algorithm is initialized by the initial value of 𝑤0 and 𝐇𝐧0. Mostly, 𝐇𝐧0 is initially given

the value of the identity matrix 𝐇𝐧0 = 𝐼. The algorithm first computes the position 𝑑0, as

shown in Equation (2.16), from the initial inverse 𝐇𝐧𝟎−𝟏 and ∇𝐸0, then determines new weights,

as shown in Equation (2.15), based on optimal step size ∝. The change in weights 𝜎𝑖, as shown

in Equation (2.14), and change in the first-order derivative 𝑜𝑖, as shown in Equation (2.13), are

used to approximate 𝐇𝐧 and its inverse 𝐇𝐧−𝟏, as shown in Equation (2.11) and (2.12),

respectively. This algorithm continues until 𝑤𝑖 converges. The quasi NM is more efficient than

the NM but requires computational memory which limits its applicability to medium-sized

problems. To overcome the memory problem and contribute to improving the convergence rate

(Setiono and Hui, 1995), a conjugate gradient method (CG) was proposed. For a conjugate, the

gradient of the two vectors needs to be orthogonal to reach a minimum of the cost function. If

not orthogonal, it means the second vector needs to travel along the previous vector to be nearer

to a minimum point. Mostly, in GD, it takes slightly larger steps and deviates from the minimal

point. The objective of conjugate gradient descent is to take a step along the gradient so that

38

the next gradient vector should be orthogonal and nearer to zero error. The first orthogonal

vector 𝑑0 can be computed from the initial guess such that:

𝑑0 = ∇𝐸0 (2.17)

The next orthogonal vector 𝑑𝑖+1 can be expressed as:

𝑑𝑖+1 = ∇𝐸𝑖+1 + 𝛽𝑑𝑖 (2.18)

where 𝛽 is used to calculate new orthogonal vector direction and is known as a conjugate

hyperparameter. The weights are updated as per below rule:

𝑤𝑖+1 = 𝑤𝑖+∝ 𝑑𝑖 (2.19)

The conjugate gradient descent iteratively finds the best orthogonal gradient vector based on

the previous vector with an inner product equal to zero and then updates the weight parameters.

The advantage of CG over GD is that it converges faster but compared to other methods such

as quasi NM and Levenberg Marquardt (LM), its convergence rate is lower. In terms of

memory, due to its second-order, it needs more memory as compared to GD and less memory

compared to quasi NM and LM (Hagan and Menhaj, 1994). To overcome the problem of both

GD slow convergence and second-order gradient algorithms memory, Quickprop (QP) was

proposed. QP is a second-order iterative learning algorithm based on Newton’s method used

to find a minimum of the loss function. The purpose of QP development was to speed up the

convergence process by taking much larger steps to reach a minimum of the loss function rather

than GD infinitesimal small steps in the weight space (Fahlman, 1988). The algorithm proceeds

like GD, but for each weight, it keeps a copy of ∇𝐸𝑖−1and ∆𝑤𝑖−1:

∆𝑤𝑖 =∇𝐸𝑖

∇𝐸𝑖−1 − ∇𝐸𝑖∆𝑤𝑖−1 (2.20)

39

It shows that error vs. weight can be approximated by an upward parabola and change in the

slope of the error curve by each weight is not affected by all the other weights that are changing

at the same time. For each weight, it computes the gradient change and weight changes to

determine the parabola: and rapidly jumps to a minimum of this parabola.

To make learning faster, the Levenberg-Marquardt Algorithm (LM) was proposed by

combining both Gauss-Newton (GN) and GD to compute the best gradient direction. It is a

method to solve nonlinear least-squares problems to minimize the sum of the squared error

(Hagan and Menhaj, 1994). Instead of directly computing the 𝐇𝐧 it works with gradient vector

and 𝐉. The gradient vector ∇𝐸 of the loss function can be computed as:

∇𝐸 = 2𝐉𝑇𝑒 (2.21)

The 𝐇𝐧 can be approximated from the equation below:

∇2𝐸 = 𝐇𝐧 = 2𝐉𝑇𝐉 (2.22)

The weight parameter improvement process in LM is iterative such as:

∆𝑤𝑖 = [𝐉𝑇𝐉 + 𝜇𝐼]−1𝐉𝑇𝑒 (2.23)

where 𝐼 is an identity matrix and 𝜇 is damping hyperparameter factor. The 𝜇 is adjusted in each

iteration to create a balance between the GN and GD methods. If the objective function

achievement is fast, 𝜇 is divided by some factor to bring the algorithm closer to GN and if

objective function achievement is slow in each iteration, 𝜇 is multiplied by some factor so as

to move towards GD. In many applications, LM is very fast and converges to the local

minimum rapidly, which may not be the global minimum. The LM has the disadvantage that it

cannot be used with other loss functions such as cross-entropy and cannot be applied to

constructive types of neural networks (Hunter et al., 2012). The 𝐉 becomes large and needs a

40

lot of memory with an increase in network size, limiting its application in a large dataset.

Therefore the learning speed of LM is less evident compared to GD when the network size

increases (Wilamowski and Yu, 2010). The LM limits its application to a fixed topology neural

network. Therefore, Neuron by Neuron (NBN) was proposed to compute the gradient vector

and the 𝐉 for arbitrarily constructive neural networks. Wilamowski et al. (2008) highlighted

that several improvements have been proposed in second-order learning algorithms, but much

better results can be achieved from Newton and LM methods. The NBN was proposed to

simplify the Jacobian calculation and make it workable for constructive algorithms. The

Jacobian is expressed in a square matrix of the first-order partial derivative:

𝐉 =𝜕𝑒

𝜕𝑤 (2.24)

NBN calculates the Jacobian in gradient vector form instead of the matrix by performing: 1)

Forward computation, 2) Backward computation, and finally, 3) Calculating the Jacobian

elements. In the forward computation, the inputs are processed to get the neuron output, which

is further processed to get the target output. During forward computation, the value of the slope

of the neuron activation function is stored for the backward stage. In the backward step, the

Jacobian element is computed by multiplying the neuron delta with its slope and input weights.

Finally, instead of using the 𝐉 to store values, they are summed into a gradient vector:

∇𝐸 =𝜕𝑒

𝜕𝑤𝑒 (2.25)

This enables NBN to be used with the constructive algorithm but with the additional cost of

more memory requirement compared to LM. Hunter et al. (2012) argued that NBN is not

perfect, but it can compete with other similar algorithms. Their experimental work showed that

NBN achieved much better results than GD and almost similar results to LM.

41

c) Application of Gradient Learning Algorithms

Various applications of gradient learning algorithms are explained and listed in Appendix A.

2.2.3.2 Gradient free algorithms

The gradient learning algorithms randomly assign connection weights to the network and

iteratively tune them to get the optimal weight for a network with better error minimization

capability. The disadvantages associated with gradient learning algorithms are that they require

user expertise to build an optimal FNN. A better choice of hyperparameters such as learning

rate, momentum, regularization, and initial weight may improve the convergence but needs a

lot of trial and error approach to get the best possible parameters. This may cause gradient

learning algorithms to trap the FNN at a local minimum which may be far away from the global

minimum and may affect the generalization performance. In terms of convergence rate, it can

be increased by assigning a large learning rate, however, the algorithm will become unstable,

whereas for a low learning rate it may converge slowly and may take even hours, days, or

months to solve a large dataset with complex patterns.

Researchers made an extensive effort to improve the gradient learning algorithms. Huang et al.

(2015) comment that improvement in gradient learning leads to faster learning speed and better

generalization performance, but still most of them cannot guarantee a global solution.

Researchers proposed that the issues associated with gradient learning algorithms can be fixed

in a state-of-the-art way by learning FNN without gradient learning algorithms, known as

gradient-free learning in the forward steps. The gradient-free algorithms eliminate the need for

chain delta rule to calculate the derivative of the loss function with respect to each weight and

updating them for the next iteration. Some of the advantages of gradient-free algorithms are:

1) They are simple, fast, and do not need complex hyperparameter adjustments, 2) no need for

42

backpropagating error, and 3) can work directly with both differentiable and non-differentiable

activation functions (Ferrari and Stengel, 2005). The gradient-free algorithms are useful in that

they reduce training time significantly, however, a drawback is that it increases network

complexity. The number of hidden unit’s generation increases many times which may cause

overfitting. The following are popular algorithms that eliminate the use of gradient information.

a) Probability and General Regression

Probabilistic Neural Network (PNN) was proposed for classification problems (Specht, 1990),

and, General Regression Neural Network (GRNN) for regression problems (Specht, 1991).

Both algorithms were proposed to address the slowness of the backpropagation FNN (BPNN)

by doing one pass learning. They are similar in structure and do not need iterative tuning and

work in a highly parallel structure. PNN finds the decision boundaries between pattern, whereas

GRNN estimates the continuous dependent variable. The network architecture consists of four

layers: input, pattern, summation, and output. The output layer can be expressed as:

�̂�ℎ =∑𝑦ℎ 𝑒

−(𝑑ℎ2

2𝜎2)

∑𝑒−(𝑑ℎ2

2𝜎2)

(2.26)

where 𝑑ℎ2 is Euclidean distance and 𝑒

−(𝑑ℎ2

2𝜎2) is the activation function. The best value of kernel

𝜎 is estimated by a holdout or validation method. PNN and GRNN are considered faster than

the BPNN and learn in one pass, which means the forward direction. The major drawback is

that it requires more memory space compared to BPNN and separate algorithms are built for

classification and regression problems.

b) Extreme Learning Machine

43

Huang, Zhu, et al. (2006) mentioned that the learning speed of traditional FNN is far slower

which limits its applicability in many application areas. Two possible reasons are: 1) Slow

learning ability of gradient-based BP algorithms, and 2) Iteratively tuning of all parameters in

the network by gradient learning algorithms. They proposed a simple algorithm known as the

Extreme learning machine (ELM) for a single layer FNN (SLFN). It randomly chooses hidden

units 𝐻, and analytically determines only the output connection weights 𝑤𝑜𝑐𝑤. 𝐻 is considered

in a linear relationship to the output unit 𝑦 and 𝑤𝑜𝑐𝑤 are calculated from the expression:

𝐻𝑤𝑜𝑐𝑤 = 𝑦ℎ (2.27)

𝑤𝑜𝑐𝑤 = 𝐻ϯ𝑦ℎ (2.28)

where 𝐻ϯ= (𝐻𝑇𝐻)−1𝐻𝑇 is Moore-Penrose generalized inverse of 𝐻. It steps forward, and no

BP gradient information is needed to compute weights. Their experimental results based on

artificial and real-world regression and classification problems demonstrated that ELM can

achieve better generalization performance in most cases and learn many times faster than

traditional learning algorithms of FNN. In addition, more stable results and better

generalization can be achieved by adding positive value (1/𝜆) in the diagonal of (𝐻𝐻𝑇) or

(𝐻𝑇𝐻) such that (Huang, Zhou, et al., 2012):

when ℎ < 𝑟:

𝑤𝑜𝑐𝑤 = 𝐻𝑇(1

𝜆+ 𝐻𝐻𝑇)−1𝑦ℎ (2.29)

when ℎ > 𝑟:

𝑤𝑜𝑐𝑤 = (1

𝜆+ 𝐻𝑇𝐻)−1𝐻𝑇𝑦ℎ (2.30)

44

where 𝑟 is the number of hidden units. ELM limits its applicability to batch learning, whereas,

one by one or chunk by chunk data (mini-batches) of fixed or varying size can be learned

through Online Sequential ELM (OS-ELM) algorithm (Liang et al., 2006). The OS-ELM

working methodology idea is similar to ELM. The hidden units’ parameters are randomly

generated, and output weights are analytically calculated. Unlike other GD sequential

algorithms with many hyperparameters, OS-ELM only specifies the number of hidden units.

Like BP, the limitation of ELM is that the optimal number of hidden units are selected based

on the trial and error approach. The ELM is learned with some initial guess of the hidden units

and then experimental trials are performed with different hidden units to select the best optimal

network having a hidden unit capable of maximum error reduction. Although the learning

speed of ELM is much faster than the traditional BP, however the initial setup to find optimal

hidden units may increase the total trial and error time (Han et al., 2017). Incremental ELM (I-

ELM), an extension of ELM, was proposed to solve the problem of hidden units allocation

(Huang, Chen, et al., 2006). The key difference is that I-ELM is a constructive topology type,

whereas ELM is a fixed topology type FNN. I-ELM initializes with one hidden unit and adds

the hidden unit one by one until the error converges or maximum hidden units are achieved.

The output weight for the new hidden unit can be computed from the expression:

𝑤𝑟𝑜𝑐𝑤 =

𝐸𝐻𝑟𝑇

𝐻𝑟𝐻𝑟𝑇 (2.31)

Initially, the error 𝐸 is set to 𝑦ℎ such that 𝐸=𝑦ℎ, and after adding a new hidden unit, the error

is recalculated as:

𝐸 = 𝐸 − 𝑤𝑟𝑜𝑐𝑤𝐻𝑟 (2.32)

45

Adding hidden units one by one results in redundant hidden units in I-ELM which will make

network size large and complex. Some of the hidden unit’s contribution to error reduction

might be very low and can be omitted. Efforts were made to compact the I-ELM network size

without losing generalization accuracy. Convex I-ELM (CI-ELM) was proposed to recalculate

the 𝑤𝑜𝑐𝑤 based on Barron’s convex optimization technique to improve the convergence rate

of I-ELM (Huang and Chen, 2007). In CI-ELM, 𝑤𝑟𝑜𝑐𝑤 for the randomly generated hidden unit

is calculated as:


𝐸. [𝐸 − (𝐹 − 𝐻𝑟]𝑇

[𝐸 − (𝐹 − 𝐻𝑟]. [𝐸 − (𝐹 − 𝐻𝑟]𝑇 (2.33)

where 𝐹 = 𝑦 is the target vector. It recalculates the 𝑤𝑜𝑐𝑤 of all existing hidden units if 𝑟 > 1,

and error as expressed below:

𝑤𝑖𝑜𝑐𝑤 = (1 − 𝑤𝑟

𝑜𝑐𝑤)𝑤𝑖𝑜𝑐𝑤 (2.34)

𝐸 = (1 − 𝑤𝑟𝑜𝑐𝑤)𝐸 − 𝑤𝑟

𝑜𝑐𝑤(𝐹 − 𝐻𝑟) (2.35)

CI-ELM can converge faster with more compact architecture while maintaining I-ELM

simplicity and efficiency. Similarly, Enhanced I-ELM (EI-ELM) was proposed to compact I-

ELM by adding some sets of candidate units and selecting a candidate unit as a hidden unit

having a maximum capability of error reduction (Huang and Chen, 2008). The hidden unit

addition in I-ELM might take it nearer or farther away from the loss function. The hidden units

farther away from the loss function may not contribute to error reduction and can be omitted.

The EI-ELM adds some candidate units and one nearer to the loss function is selected as a new

hidden unit and added to the network. In such a case, the number of hidden units in EI-ELM

46

will be less and the network size will be more compact compared to I-ELM with the same

amount of training time.

Feng et al. (2009) addressed two main issues of ELM: 1) How to choose optimal hidden units

in ELM, and 2) whether ELM computation complexity can be further reduced given the large

training examples requiring many hidden units. The issues were addressed by proposing Error

Minimized ELM (EM-ELM) to automatically determine the number of hidden units rather than

the trial and error approach. It works by adding hidden units one by one or group by group

(with varying group size) and updates the output weights in a fast-recursive way. The advantage

of EM-ELM is that it reduces the computational complexity by only updating the output

weights incrementally each time rather than ELM, which needs to recalculate the entire output

weights when the architecture is changed. Yang et al. (2012) added that the learning speed of

ELM and I-ELM are faster, however, there are two major unsolved problems: 1) For ELM, the

selection of an optimal number of hidden units is still unknown and trial and error approaches

are adopted, and 2) I-ELM has solved the problem of ELM by adding hidden units one by one.

However, the learning speed of I-ELM increases many times compared to ELM. Yang et al.

(2012) proposed an incremental learning algorithm known as Bidirectional ELM (B-ELM) to

compact the I-ELM architecture without affecting learning effectiveness. In B-ELM, some of

the hidden units are not randomly generated, and it tries to find the best hidden unit parameters

(𝑤𝑖𝑐𝑤, 𝑏𝑢) to reduce 𝐸 as quickly as possible. In B-ELM, with the hidden unit 𝑟𝜖 {2𝑛 +

1, 𝑛𝜖𝒁}, the hidden units parameters are randomly generated similar to I-ELM, whereas, when

the hidden unit 𝑟𝜖 {2𝑛, 𝑛𝜖𝒁}, the hidden units parameters are calculated instead of being

randomly generated to converge faster.

Ying (2016) highlighted that the I-ELM merits are obvious but have four drawbacks which

need to be improved: 1) Generation of redundant units, 2) some hidden units are sometimes

47

larger than training examples, 3) the solution is not a least-squares type indicating that it is not

optimal, and 4) it is rarely used to solve multiclass classification problems. The proposed CI-

ELM and EI-ELM may learn faster and build a more compact architecture; however, the

drawbacks are not settled. Ying (2016) proposed Orthogonal I-ELM (OI-ELM) by

incorporating a Gram-Schmidt orthogonalization method in I-ELM to obtain the least-squares

solution. It randomly generates one hidden unit similar to I-ELM and calculates its output 𝐻𝑟.

The Gram-Schmidt orthogonalization method is applied to the hidden unit output to determine

the orthogonal vector 𝑉𝑟 and if its norm is greater than the predefined value, it is added, else

eliminated. For the 𝑤𝑟𝑜𝑐𝑤 calculation, the basic idea is similar to I-ELM with the replacement

of 𝑉𝑟 with the 𝐻𝑟 vector in Equation (2.31). Ying (2016) experimental work demonstrates that

OI-ELM achieved more a compact network and faster convergence compared to ELM, I-ELM,

CI-ELM, and EI-ELM. Inspired from the idea of (Ying, 2016), Zou et al. (2018) proposed a

new algorithm called OI-ELM based on the driving amount (DAOI-ELM) to obtain a better

generalization performance with more compact architecture. DAOI-ELM determines 𝑉𝑟 similar

to OI-ELM with modification in 𝑤𝑟𝑜𝑐𝑤. It adds 𝐸𝑟−1 to 𝑉𝑟 while calculating 𝑤𝑟

𝑜𝑐𝑤. The

comparison of DAOI-ELM with I-ELM, OI-ELM and B-ELM on several benchmarking and

real-world dataset demonstrated the effectiveness of DAOI-ELM.

Similarly, for ELM, Wang et al. (2016) explained that ELM is sensitive to the selection of an

optimal number of hidden units in the layer and improper hidden units can lead to suboptimal

accuracy. They proposed Self-adaptive ELM (SaELM) to find the best possible number of

hidden units for the network. SaELM initializes by defining the minimum and maximum

possible hidden units with its interval, width factor 𝑄 and scale factor 𝐿. Han et al. (2017) argue

that much effort has been dedicated to the convergence accuracy of I-ELM, whereas its

numerical stability (condition) is generally ignored. The numerical stability is directly related

to the input weight and hidden biases. The issue was addressed by combining particle swarm

48

optimization (PSO) and EM-ELM called IPSO-EM-ELM. The algorithm proposed to add one

by one hidden units to the existing network. PSO is recommended to optimize the input weight

and hidden bias in the new hidden unit. The optimal hidden unit is selected based on not only

the minimum error of training data but also considering the condition value of the hidden unit

output matrix. The output weight needs to be incrementally updated similar to EM-ELM.

Zong et al. (2013) highlighted that ELM provides better performance, however, none of the

work in ELM mentioned the problem of unbalanced data distribution. Typically, imbalance

class distributions are balanced by adopting either sampling techniques (oversampling or

undersampling) or algorithmic approaches. They proposed an algorithm named as Weighted

ELM (W-ELM) to handle both binary and multi-class imbalance data problems. Unlike, ELM

which considers all training examples as equal, W-ELM adds a penalty term to the errors

corresponding to different inputs. Similar to Equations (2.29) and (2.30), derived two versions

of 𝑤𝑜𝑐𝑤 were:

when ℎ < 𝑟:

𝑤𝑜𝑐𝑤 = 𝑈𝑇(1

𝜆+𝑊𝐻𝐻𝑇)−1𝑊𝑦ℎ (2.36)

when ℎ > 𝑟:

𝑤𝑜𝑐𝑤 = (1

𝜆+ 𝐻𝑇𝑊𝐻)−1𝐻𝑇𝑊𝑦ℎ (2.37)

where 𝑊 is a diagonal weight matrix defined for every training example. It determines what

degree of re-balance the user concerns and how much the boundary can be further pushed

towards the majority class. When the training example comes from a minority class, it is

assigned a relatively higher value of 𝑊 than the others. The experimental work demonstrates

that W-ELM not only obtains better generalization performance compared to ELM on the

49

imbalanced dataset by allocating importance to minority class compared to majority class but

also maintains good performance on the well-balanced dataset.

ELM and its variant are mainly focused on classification and regression problems and still

encounter difficulties in natural scenes (e.g., signals and visual) and practical applications

(voice recognition and image classification) due to its shallow architecture which is unable to

learn features even with a large number of hidden units. In many cases, a multilayer solution

is required for feature learning before classification is performed (Tang et al., 2016). Kasun et

al. (2013) proposed Multilayer ELM (ML-ELM) for classification based on the ELM

autoencoder (ELM-AE). ELM was modified to ELM-AE by keeping the output the same as

the input for autoencoder and estimating the weight for the hidden layer. The number of

successive hidden layers was calculated in the same manner as the ELM methodology to create

layer weights for ML-ELM. Finally, the output layer weight for ML-ELM is calculated using

regularized least squares. Tang et al. (2016) highlighted that the encoded output from ELM-

AE is directly fed into the last layer for decision making before applying the least-squares,

without random feature mapping which violates the ELM universal approximation-based

theories. Tang et al. (2016) proposed a new Hierarchical ELM (H-ELM) consisting of two

parts: unsupervised feature encoding based on the new 𝑙1 regularized ELM autoencoder to

extract multilayer sparse features of input data, and supervised feature classification based on

ELM is applied for decision making. H-ELM is based on the universal approximation

capability theories of ELM and the results demonstrate its superior performance over ELM and

other FNN autoencoders.

c) Semi Gradient and Iterative Algorithms

The No-propagation (No-Prop) simplifies the learning mechanism of multi-layer BPFNN by

randomly generating 𝑤𝑖𝑐𝑤 and hidden connection weights 𝑤ℎ𝑐𝑤 and only iteratively train 𝑤𝑜𝑐𝑤

50

by the BP learning algorithm (Widrow et al., 2013). The algorithm cannot be considered as

complete gradient-free learning because it uses gradient information in its last layer. However,

due to its random generation of 𝑤𝑖𝑐𝑤 and 𝑤ℎ𝑐𝑤, and only tuning the last layer 𝑤𝑜𝑐𝑤, it is named

as No-Prop. The No-Prop guarantee to minimize the loss function when the number of training

patterns is less than or equal to 𝑤𝑜𝑐𝑤 connecting the last hidden layer to the output units. This

criterion is referred to as the least mean square error capacity (LMS capacity). The No-Prop

algorithm explains that when the training pattern is under or at LMS capacity, the output unit

will deliver the desired output pattern perfectly and the generalization performance will be like

BP with much faster results. However, if the training pattern is overcapacity, BP works better

than No-Prop. In such a case, increasing the number of hidden units in the last layer will

increase the number of output weights and again the training pattern will become under or at

capacity, and the performance of No-Prop will increase.

Cao et al. (2016) argued that the random generation of hidden units’ parameters, and the

analytical calculation of output weights become infeasible and the generalization performance

drops when the dataset is extremely large. Iterative Feedforward Neural Networks with

Random Weights (IFNNRWs) was proposed to overcome the issues generated from random

generation. The iterative algorithm IFNNRWs is developed which iteratively tunes the output

connection weight based on 𝑙2 model. It randomly generates the input weight and hidden units

but iteratively calculates 𝑤𝑜𝑐𝑤 as expressed below:

𝑤𝑖+1𝑜𝑐𝑤 =

1

1 + 𝜆((𝐼 − 𝐻𝑇𝐻)𝑤𝑖

𝑜𝑐𝑤 + 𝐻𝑇𝐻) (2.38)

The advantages of this algorithm are that 𝑙2 regularization improves its generalization

performance. Unlike algorithms which are based on Moore-Penrose generalized Inverse of

51

hidden units, which consumes a lot of memory with the number of examples increasing,

IFNNRWs is more stable and unaffected by increasing the number of hidden units.

d) Application of Gradient free learning Algorithms

Various applications of gradient-free learning algorithms are explained and listed in Appendix

A.

2.2.4 Optimization techniques and applications

This section covers the remaining four categories (i.e., Optimization algorithms for learning

rate, Bias and Variance (Underfitting and Overfitting) minimization algorithms, Constructive

topology Neural Networks, Metaheuristic search algorithms) explaining mainly the

optimization techniques as summarized in Table 2.2. The optimization algorithms (techniques)

are explained along with their merits and technical limitations in order to understand the FNN

drawbacks and needed user expertise, so as to identify research gaps and future directions. The

list of applications mentioned in each category gives an indication of the successful

implementation of optimization techniques in various real-world management, engineering,

and health sciences problems. The sections below explain the algorithms categories and their

subcategories:

2.2.4.1 Optimization algorithms for learning rate

BP gradient learning algorithms are powerful and have been extensively studied to improve

the convergence. The important global hyperparameter that sets the position of new weight for

BP is the learning rate ∝. If ∝ is too large, the algorithms may diverge from a minimum loss

function and will oscillate around it, however, if ∝ is too small it will converge very slowly.

The suboptimal ∝ can cause FNN to trip into a local minimum or saddle point. A saddle point

52

can be defined as a point where the function has both minimum and maximum directions (Gori

and Tesi, 1992).

The algorithms in this category, that improve the convergence of BP by optimizing ∝, are based

on knowledge of exponential moving weighted average (EMA) statistical technique (Lucas and

Saccucci, 1990) and its bias correction (Kingma and Ba, 2014). EMA is best suitable when

the data is noisy, and it helps to make it denoise by taking a moving average of the previous

values to define the next sequence in the data. With an initial guess of ∆𝑤𝑜 = 0, the next

sequence can be computed from the expression:

∆𝑤𝑖 = 𝛽∆𝑤𝑖−1 + (1 − 𝛽)∇𝐸𝑖 (2.39)

where 𝛽 is a moving exponential hyperparameter with value [0,1]. Hyperparameter 𝛽 plays an

important role in EMA. Firstly, it approximates the moving average ∆𝑤𝑖 by 1/(1 − 𝛽) which

means higher the value of 𝛽, the more values will be averaged and the data trend will be less

noisy, whereas, a lower value of 𝛽 will average less values and the trend will fluctuate more.

Secondly, for a higher value of 𝛽, more importance will be given to the previous weights as

compared to the derivative values. Moreover, it is discovered that during the initial sequence,

the trend is biased and is further away from the original function due to the initial guess ∆𝑤𝑜 =

0. This results in much lower values which can be improved by computing the bias such that:

∆𝑤𝑖 =𝛽∆𝑤𝑖−1 + (1 − 𝛽)∇𝐸𝑖

1 − 𝛽𝑖 (2.40)

The effect of 1 − 𝛽𝑖 in the dominator decreases with increasing iteration 𝑖. Therefore, 1 − 𝛽𝑖

has more influence on the starting iteration and can generate a sequence with better results.

When the ∝ is small, the gradient slowly moves towards the minimum, whereas, a large ∝ will

oscillate the gradient along the long and short axis of the minimum error valley. This reduces

53

gradient movement towards the long axis which faces towards the minimum. The momentum

minimizes the short axis of oscillation while adding more contributions along the long axis

moves the gradient in larger steps towards the minimum with fewer iterations (Rumelhart et

al., 1986). Equation (2.39) can be modified for momentum:

∆𝑤𝑖𝑎 = 𝛽𝑎∆𝑤𝑖−1

𝑎 + (1 − 𝛽𝑎)∇𝐸𝑖 (2.41)

𝑤𝑖+1 = 𝑤𝑖−∝ ∆𝑤𝑖𝑎 (2.42)

where 𝛽𝑎 is the momentum hyperparameter controlling the exponential decay. The new weight

is a linear function of both the current gradient and weight change during the previous step

(Qian, 1999). Riedmiller and Braun (1993) argued that in practice, it is always not true that

momentum will make learning more stable. The momentum parameter 𝛽𝑎 is equally

problematic due to the learning parameter ∝ and no general improvement can be achieved.

Despite ∝, the second factor that effects the ∆𝑤𝑖 is the unforeseeable behaviour of the

derivative ∇𝐸𝑖 whose magnitude will be different for different weights. The Resilient

Propagation (RProp) algorithm was developed to avoid the problem of blurred adaptivity of

∇𝐸𝑖 and changes the size of ∆𝑤𝑖 directly without considering ∇𝐸𝑖. For each weight 𝑤𝑖+1, the

size of the update value ∆𝑖 is computed based on the local gradient information:

∆𝑖= {𝜂+ ∗ ∆𝑖−1 , 𝑖𝑓 ∇𝐸𝑖−1 ∗ ∇𝐸𝑖 > 0 𝜂− ∗ ∆𝑖−1 , 𝑖𝑓 ∇𝐸𝑖−1 ∗ ∇𝐸𝑖 < 0

∆𝑖−1 , 𝑒𝑙𝑠𝑒 (2.43)

where 𝜂+ and 𝜂− are increasing and decreasing hyperparameters. The basic intention is that

when the algorithm jumps over the minimum, ∆𝑖 is decreased by the 𝜂− factor and if it retains

the sign it is accelerated by the 𝜂+ increasing factor. The weight update ∆𝑤𝑖 is decreased if the

derivative is positive and increased if the derivative is negative such that:

54

∆𝑤𝑖 = {−∆𝑖 , 𝑖𝑓 ∇𝐸𝑖 > 0 +∆𝑖 , 𝑖𝑓 ∇𝐸𝑖 < 00 , 𝑒𝑙𝑠𝑒

(2.44)

𝑤𝑖+1 = 𝑤𝑖 + ∆𝑤𝑖 (2.45)

RProp has an advantage over BP in that it gives equal importance to all weights during learning.

In BP the size of derivative decreases for weights that are far from the output and they are less

modified compared to other weights, which make it slower. RProp is dependent on the sign

rather than derivative magnitude which gives equal importance to far away weights. Hinton et

al. (2012) addressed that RProp does not work with mini-batches because it treats all mini-

batches as equivalent. They proposed RMSProp by combining the robustness of RProp, the

efficiency of mini-batches, and gradient average over mini-batches. RMSprop is a minibatch

version of RProp by keeping a moving average of square gradient such that:

∆𝑤𝑖𝑏 = 𝛽𝑏∆𝑤𝑖−1

𝑏 + (1 − 𝛽𝑏)∇𝐸𝑖2 (2.46)

𝑤𝑖+1 = 𝑤𝑖 −

(

∝

√∆𝑤𝑖𝑏

)

∇𝐸𝑖

(2.47)

Like 𝛽𝑎, 𝛽𝑏 is a hyperparameter that controls the exponential decay. When the gradient is in

the short axis, the moving average will be large which will slow down the gradient, whereas,

in the long axis, the moving average will be small which will accelerate the gradient. For

optimal results, Hinton et al. (2012) recommended using 𝛽𝑏 = 0.9.

On the contrary, the Adaptive Gradient Algorithm (AdaGrad) performs informative gradient-

based learning by incorporating knowledge of the geometry of the data observed during each

iteration (Duchi et al., 2011). The features that occur infrequently are assigned a larger ∝

55

because they are more informative, while features that occur frequently are given a small ∝.

AdaGrad computes matrix 𝐺 as the 𝑙2 norm of all previous gradients:

∆𝑤𝑖 = (∝

√𝐺𝑖)∇𝐸𝑖 (2.48)

𝑤𝑖+1 = 𝑤𝑖 − ∆𝑤𝑖 (2.49)

This algorithm is more suitable for high spare dimensionality data. The above equations

increase the learning rate for more spare data and decrease the learning rate for low spare data.

The ∝ is scaled based on 𝐺 and ∇𝐸𝑖 to give a parameter specific learning rate. The main

drawback of AdaGrad is that it accumulates the squared gradient which grows after each

iteration. This shrinks the ∝ and it becomes infinitesimally small and limits the purpose of

gaining additional information (Zeiler, 2012). Therefore, AdaDelta was proposed based on the

idea extracted from AdaGrad to improve two main issues in learning: 1) shrinkage of ∝, and

2) manually selecting the ∝ (Zeiler, 2012). Accumulating the sum of the squared gradients

shrinks the ∝ in AdaGrad and this can be controlled by restricting the previous gradient to a

fixed size 𝑠. This restriction will not accumulate gradient to infinity and more importance will

go to the local gradient. The gradient can be restricted by accumulating an exponential decaying

average of the square gradient. Assume at 𝑖 the running average of the gradient is 𝐸[∇𝐸2]𝑖

then:

𝐸[∇𝐸2]𝑖 = 𝛽𝑐𝐸[∇𝐸2]𝑖−1 + (1 − 𝛽

𝑐)∇𝐸𝑖2 (2.50)

Like AdaGrad, taking the square root 𝑅𝑀𝑆 of the parameter 𝐸[∇𝐸2]𝑖:

𝑅𝑀𝑆𝑖 = √𝐸[∇𝐸2]𝑖 + 𝜖 (2.51)

56

where the constant 𝜖 is added to condition the denominator for approaching zero. The update

parameter ∆𝑤𝑖 can be expressed as:

∆𝑤𝑖 = (∝

𝑅𝑀𝑆𝑖) ∇𝐸𝑖 (2.52)

𝑤𝑖+1 = 𝑤𝑖 + ∆𝑤𝑖 (2.53)

The performance of this method on FNN is better with no manual setting of the ∝, insensitivity

to hyperparameters, and robust to large gradients and noise. However, it requires some extra

computation per iteration over GD and requires expertise in selecting the best hyperparameters.

Similarly, the concept of both first and second moments of gradients was incorporated in

Adaptive Moment Estimation (Adam). It requires first-order gradient information to compute

the individual ∝ for parameters from an exponential moving average of first and second

moments of gradients (Kingma and Ba, 2014). For parameter updating, it combines the idea of

both the exponential moving average of gradients like momentum and an exponential moving

average of square gradients like RMSProp and AdaGrad. The moving average of the gradient

is an estimate of the 1st moment (mean) and the square gradient is an estimate of the 2nd moment

(the uncentered variable) of the gradient. Like momentum and RMSProp, 1st-moment

estimation and 2nd-moment estimation can be expressed in Equations (2.41) and (2.46)

respectively. The above estimations are initialized with 0’s vectors which make it biased

towards zero during the initial iterations. This can be corrected by computing the bias

correction such as:

∆𝑤𝑖𝑎′ =

∆𝑤𝑖𝑎

1 − (𝛽𝑎)𝑖 (2.54)

57

∆𝑤𝑖

𝑏′ =∆𝑤𝑖

𝑏

1 − (𝛽𝑏)𝑖

(2.55)

The parameters are updated by:

∆𝑤𝑖 =

∝

√∆𝑤𝑖𝑏′ + 𝜖

∆𝑤𝑖𝑎′

(2.56)

𝑤𝑖+1 = 𝑤𝑖 − ∆𝑤𝑖 (2.57)

The parameter update rule in Adam is based on scaling their gradient to be inversely

proportional to the 𝑙2 norm of their individual present and past gradients. In AdaMax, the 𝑙2

norm based update rule is extended to the 𝑙𝑝 norm based update rule (Kingma and Ba, 2014).

For a more stable solution and avoiding the instability of large 𝑝, let 𝑝 → ∞ then:

∆𝑤𝑖𝑑 = max(𝛽𝑑∆𝑤𝑖−1

𝑑 , |∇𝐸𝑖|) (2.58)

where ∆𝑤𝑖𝑑 is exponentially weighted infinity norm. The updated parameter can be expressed

as:

∆𝑤𝑖 = (∝

1 − (𝛽𝑎)𝑖)∆𝑤𝑖

𝑎

∆𝑤𝑖𝑑 (2.59)

𝑤𝑖+1 = 𝑤𝑖 − ∆𝑤𝑖 (2.60)

The advantages of Adam and AdaMax are that it combines the characteristics of both AdaGrad

while dealing with sparse gradient and RMSProp in dealing with non-stationary objectives.

Adam and its extension AdaMax require less memory and are suitable for both convex and

non-convex optimization problems.

a) Application of optimization algorithms for learning rate

58

Various applications of optimization algorithms for learning rate are explained and listed in

Appendix A.

2.2.4.2 Bias and variance (underfitting and overfitting) minimization

algorithms

The best FNN should be able to truly approximate the training and testing data. Researchers

extensively studied FNN to avoid high bias and variance to improve convergence

approximations. The high bias is referred to a problem known as underfitting and high variance

is referred to as overfitting. Underfitting occurs when the network is not properly trained and

the patterns in the training data are not fully discovered. It occurs before the convergence point

and, in this case, the generalization performance (also known as test data performance) does

not deviate much from the training data performance. That is the reason that the underfitting

region has high bias and low variance. In contrast, overfitting occurs after the convergence

point when the network is overtrained. In this scenario, the training error is in a continuous

state of decreasing whereas the testing error starts increasing. The overfitting is known as

having the properties of low bias and high variance. The high variance in training and the

testing error occurs because the network trains the noise in the training data which might be

not part of the actual model data. The best choice is to choose a model that balances bias and

variance, which is known as bias and variance trade-off. Bias and variance trade-off allows

FNN to discover all patterns in the training data and simultaneously give better generalization

performance (Geman et al., 1992).

For simplicity, the bias-and-variance trade-off point is referred to as the convergence point.

The FNN should be stopped from further training when it approaches the convergence point to

balance bias (underfitting) and variance (overfitting). The most popular method to identify and

stop the FNN at the convergence point is to select suitable validation data. Validation datasets

59

and test datasets are used interchangeably and are unseen data held back from the training data

to check the network performance. The error estimate by the training data is not a useful

estimator and comparing its performance with the validation dataset generalization

performance can help to identify the convergence point (Zaghloul et al., 2009). One of the

drawbacks associated with the validation technique is that the dataset should be large enough

to split it (Reed, 1993). This makes it unsuitable for small datasets especially complicated data

with few instances and many variables. When sufficient data is not available, the n-fold cross-

validation technique can be considered as an alternative to the validation technique (Seni and

Elder, 2010). The n-fold cross-validation technique splits data randomly into n-equal sized data

subsamples. The FNN is repeated n times by allocating one subsample for testing and the

remaining n-1 for the training model. The n-subsample is allocated once for each testing.

Finally, the n-results are averaged to get a single accurate estimation. The suitable selection of

n-fold size is based on dataset complexity and network size which make this technique

challenging for complicated applications.

The underfitting problem is easily detectable and therefore the literature has focused on

proposing techniques for improving overfitting. A large size network has a greater possibility

to overfit as compared to a smaller size (Setiono and Hui, 1995). Similarly, overfitting chances

increase when the network degree of freedom (such as weights) increases significantly from

the training sample (Reed, 1993). These two problems have gained greater attention and led to

the development of the techniques, explained below, to avoid overfitting analytically rather

than trial and error approaches. The algorithms for this category are subcategorized into

regularization, pruning, and ensembles.

a) Regularization Algorithms

60

Overfitting occurs when the training examples information does not match with network

complexity. The network complexity depends upon the free parameters such as the number of

weights and added bias. Often it is desirable to lessen the network complexity by using less

weight to train networks. This can be done by limiting weight growth by techniques known as

weight decay (Krogh and Hertz, 1992). It penalizes the large weights in growing too large by

adding a penalize term in the loss function:

∑(𝑦ℎ − �̂�ℎ)2

𝑚

ℎ=1

+ 𝜆∑(𝑤𝑗)2

𝑝

𝑗=1

(2.61)

where ∑ (𝑦ℎ − �̂�ℎ)2𝑚

ℎ=1 is the error function and 𝑤𝑗 is the weight in position 𝑗 which is

penalized by parameter 𝜆. The selection of 𝜆 depends on the largeness of the penalty weight.

The above equation shows that if the training examples are more informative than the network

complexity, the regularization influence will be weak and vice versa. The above weight decay

method is also known as 𝑙2 regularization. Similarly, for 𝑙1 regularization:

∑(𝑦ℎ − �̂�ℎ)2

𝑚

ℎ=1

+ 𝜆∑|𝑤𝑗|

𝑝

𝑗=1

(2.62)

𝑙1 regularization adds absolute value, while 𝑙2 regularization adds the squared value of the

weight to the loss function. Krogh and Hertz (1992) experimental work demonstrated that FNN

with the weight decay technique can avoid overfitting and improve generalization performance.

The improvement mentioned in their paper was less significant. Several high-level techniques,

as discussed below, were proposed to achieve better results.

A renowned regularization technique Dropout was proposed for large fully connected networks

to overcome the issues of overfitting and the need for FNN ensembles (Srivastava et al., 2014).

With a limited dataset, the estimation of the training data with noise does not approximate the

61

testing data which leads to overfitting. Firstly, this happens because each parameter during the

gradient update influences the other parameters, which creates a complex coadaptation among

the hidden units. This coadaptation can be broken by removing some hidden units from the

network. Secondly, during FNN ensembles, combining the average of many outputs of a

separately trained FNN is an expensive task. Training many FNN requires many trial and error

approaches to find the best possible hyperparameters, which makes it a daunting task and needs

a lot of computation efforts. Dropout avoids overfitting by temporarily dropping the units

(hidden and visible) along with all connections with certain random probability during the

network forward steps. Training FNN with dropout means training 2n thinned networks where

n is a number of units in FNN. During testing time, it is not recommended to take an average

of the prediction from all thinned trained FNNs. It is best to use a single FNN during the testing

time without dropout. The weights of this single FNN provide a scaled down version of all the

trained weights. If the unit output is denoted by 𝑦𝑖′, input by 𝑥𝑖

′ with weight by 𝑤′, than the

dropout is expressed as:

𝑦𝑖′ = 𝑟 ∗ 𝑓(𝑤′𝑥𝑖

′) (2.63)

where 𝑟 is a vector of the independent Bernoulli random variable with * denoting an element

wise product. The downside of dropout is that its training time is 2-3 times longer than the

standard FNN for the same architecture. However, the experimental results achieved by

dropout compared to FNN are remarkable. DropConnect, a generalization of Dropout, is

another regularization technique for a large fully connected layer FNN to avoid overfitting

(Wan et al., 2013). The key difference between Dropout and DropConnect is in their dropping

mechanism. The Dropout temporarily drops the units along with their connection weights,

whereas, DropConnect randomly drops a subset of weight connections with a random

probability. The technique is like Dropout and the key difference is that DropConnect drops

62

connection weights during forward propagation of the network. The output 𝑦𝑖′ in Dropconnect

can be expressed as:

𝑦𝑖′ = 𝑓((𝑟 ∗ 𝑤′)𝑥𝑖

′) (2.64)

The training time of DropConnect is slightly higher than No-Drop and Dropout due to the

feature extractor bottleneck in large models. The advantage of DropConnect is that it allows

training a large network without overfitting with better generalization performance. The

limitation of both Dropout and DropConnect are that they are suitable only for fully connected

layer FNN. Other regularization techniques include Shakeout that was proposed to remove

unimportant weights to enhance FNN prediction performance (Kang et al., 2017). The

performance of FNN may not be severely affected if most of the unimportant connection

weights are removed. One technique is to train a network, prune the connections and finetune

the weights. However, this technique can be simplified by imposing sparsity-inducing penalties

during the training process. During implementation, Shakeout randomly chooses to reverse or

enhance the unit contribution to the next layer in the training forward stage to avoid overfitting.

Dropout can be considered a special case of shakeout by keeping the enhancement factor to

one and the reverse factor to zero value. Shakeout induces 𝑙0 and 𝑙1 regularization which

penalizes the magnitude of the weight and leads to a sparse weight that truly represents a

connection among units. This sparsity induced penalty (𝑙0 and 𝑙1 regularization) is combined

with 𝑙2 regularization to get a more effective prediction. It randomly modifies the weight based

on 𝑟 and can be expressed as:

{

𝑤𝑖+1′ = −𝑐𝑠 𝑖𝑓 𝑟 = 0

𝑤𝑖+1′ =

𝑤𝑖′ + 𝑐𝜏𝑠

1 − 𝜏 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

(2.65)

63

where 𝑠 = sgn(𝑤𝑖′) with ±1 or 0 if 𝑤𝑖

′=0, constant 𝑐 > 1 and 𝜏 ∈ [0,1]. When hyperparameter

𝑐 is set to zero, shakeout can be considered a special case of Dropout.

Beside dropping hidden units or weights, the batch normalization method avoids overfitting

and improves the learning speed by preventing the formulation of the internal covariant shift

problem, which occurs due to changes in the parameters during the training process of FNN

(Ioffe and Szegedy, 2015). Updating the parameters during training affects the layers of the

input distribution in that it makes it deeper and slower with a requirement for more careful

hyperparameters initialization. The batch normalization technique eliminates the internal

covariant shift and accelerates network training. This simple technique improves the layers

distribution by normalizing activation of each layer with the mean zero and unit variance.

Unlike BP with a higher learning rate which results in gradient overshoot or diverges and gets

stuck in local minima, the stable distribution of the activation values by batch normalization

during training can tolerate larger learning rates.

Similarly, besides using the validation and its variant techniques that may not capture

overfitting, it might go undetected, which costs more. The Optimized approximation algorithm

(OAA) avoids overfitting without any need for a separate validation set (Liu et al., 2008). It

uses stopping criteria formulated in their original paper known as Signal to Noise Ratio Figure

(SNRF) to detect overfitting during the training error measurement. It calculates SNRF during

each iteration and if it is less than the defined threshold value, the algorithm is stopped.

b) Pruning FNN

Training a network with larger size requires more time because of many hyperparameters

adjustments which can lead to overfitting if not initialized and controlled properly, whereas,

the smaller network will learn faster but may converge to poor local minima or be unable to

learn data. Therefore, it is often desirable to get an optimal network size with no overfitting,

64

better generalization performance, fewer hyperparameters, and fast convergence. An effective

way is to train a network with a large possible architecture and remove the parts that are

insensitive to performance. This approach is known as pruning because it prunes unnecessary

units from the large network and reduces its depth to the most optimal possible small network

with few parameters. This technique is considered effective in eliminating the overfitting

problem and in improving generalization performance, but at the same time, it requires a lot of

effort and training time to construct a large network and then to reduce it. The simplest

technique of pruning involves training a network, pruning the unnecessary parts and retraining.

The training and retraining of weights in pruning will sustainably increase the FNN training

time. The training time can be reduced to some extent by calculating the sensitivity of each

weight with respect to the global loss function (Karnin, 1990). The weight sensitivity 𝑆 can be

calculated from the expression:

𝑆𝑖 = ∑[∆𝑤𝑖]2

𝑛−1

𝑖=0

𝑤𝑖+1∝ (𝑤𝑖+1 − 𝑤𝑖)

(2.66)

The idea is to calculate sensitivity concurrently during the training process without interfering

with the learning process. Reed (1993) explained that besides sensitivity methods, penalty

methods (weight decay or regularization) are also other types of pruning. In sensitivity

methods, weight or units are removed, whereas, in the penalty method, weights or units are set

to zero which is like removing them from the FNN. Kovalishyn et al. (1998) applied the pruning

technique to the constructive FNN algorithm and demonstrate that constructive algorithms

(also known as cascade algorithms) generalization performance improves significantly

compared to fixed topology FNNs. Han et al. (2015) demonstrated the effectiveness of the

pruning method by performing experimental work on AlexNet and VGG-16. Pruning method

65

reduced the parameters from 61 million to 6.37 million and 138 million to 10.3 million for

AlexNet and VGG-16 respectively without losing accuracy.

c) Ensembles of FNNs

Another effective method of avoiding overfitting is the ensembles of FNNs, which means

training multiple networks and combining their prediction as opposed to single FNNs. The

combination is done through averaging in regression and majority in classification or weighted

combination of FNNs. The more advanced techniques include stacking, boosting and bagging,

and many others. The error and variance obtained by the collective decision of ensembles are

considered to be less as compared to individual networks (Hansen and Salamon, 1990). Many

ensembles combine three stages: defining the structure of FNN in ensembles, training each

FNN, and combining the results. Krogh et al. (1995) explained that the generalization error of

ensembles is less than network individual errors for uniform weights. Islam et al. (2003)

explained that the drawback with existing ensembles is that the number of FNNs in the

ensemble and the number of hidden units in individual FNN need to be predefined for FNN

ensemble training. This makes it suitable for applications where prior rich knowledge and FNN

experts are available. The above-mentioned drawbacks can be overcome by the constructive

NN ensemble (CNNE) by determining both the number of FNNs in the ensemble and their

hidden units in individual FNNs based on negative correlation learning. FNN ensembles are

still being used in many applications, with different strategies, by researchers. The above

discussion covers some of the techniques for effective ensembles and others can be more found

in the literature.

d) Applications of bias and variance minimization algorithms

Various applications of bias and variance minimization algorithms are explained and listed in

Appendix A.

66

2.2.4.3 Constructive topology FNN

FNN with a fixed number of hidden layers and units are known as fixed-topology networks

which can be either shallow (single) or deep (more than one) depending on the application task.

Whereas, the FNN that starts with the simplest network and then adds hidden units until the

error convergence is known as a constructive (or cascade) topology network. The drawbacks

associated with fixed typology FNNs are that the hidden layers and units need to be predefined

before training initialization. This needs a lot of trial and error to find the optimal hidden units

in the layers. If too many hidden layers and units are selected, the training time will increase,

whereas, if fewer layers and hidden units are selected, it might result in poor convergence.

Constructive topology networks have advantages over fixed topologies that in it start with a

minimal simple network and then adds hidden layers with some predefined hidden units until

error convergence occurs. This eliminates the need for trial and error approaches and

automatically constructs the network. Recent studies have shown that constructive algorithms

are more powerful than fixed algorithms. Hunter et al. (2012) in their FNNs comparative study

explained that a constructive network can solve the Parity-63 problem with only 6 hidden units

compared to a fixed network that required 64 hidden units. The training time is proportional to

the hidden unit’s quantity. Adding more hidden units will cause the network to be slow because

of more computational work. The computational training time of constructive algorithms is

much better than in a fixed topology. However, it does not mean that the generalization

performance of constructive networks will be always superior to a fixed network. A more

careful approach is needed to handle hidden units with a lot of connections and parameters to

stop the addition of further hidden units to avoid decreasing generalization performance (Kwok

and Yeung, 1997).

67

The most popular algorithm for constructive topology networks in the literature is the Cascade-

Correlation neural network (CasCor) (Fahlman and Lebiere, 1990). CasCor was developed to

address the slowness of BPNN. The factor that contributes to the slowness of BPNN is that

each hidden unit faces a constantly changing environment when all weights in the network are

changed collectively. CasCor is initialized with a simple network by linearly connecting the

input 𝑥 to the output �̂� by the output connection weight 𝑤𝑜𝑐𝑤. Most often, the QP learning

algorithm is applied for repetitively tuning of 𝑤𝑜𝑐𝑤. When training converges, and the 𝐸 is

more than ε, then a new hidden unit ℎ is added receiving all input connections 𝑤𝑖𝑐𝑤 from the

𝑥 and any pre-existing hidden units. The ℎ is trained to maximize its covariance 𝑆𝑐𝑜𝑣 magnitude

with 𝐸, as expressed below:

𝑆𝑟𝑐𝑜𝑣 = ∑|∑(ℎ𝑟 − ℎ̅)(𝐸𝑟,𝑜 − 𝐸𝑜̅̅ ̅)

𝑟

|

𝑜

(2.67)

where ℎ̅ and 𝐸𝑜̅̅ ̅ are mean values of hidden unit ℎ𝑟 and output error 𝐸𝑝,𝑜, 𝑜 and 𝑟 are the network

output and the hidden unit respectively. The magnitude is maximized by computing the

derivative of 𝑆𝑐𝑜𝑣 with respect to 𝑤𝑖𝑐𝑤 and performing gradient ascent. When there is no

further increase in 𝑆𝑐𝑜𝑣, the 𝑤𝑖𝑐𝑤is kept frozen and ℎ is connected to �̂� through the 𝑤𝑜𝑐𝑤.

Again, the 𝑤𝑜𝑐𝑤 is retrained, and 𝐸 is calculated. If 𝐸 is less than ε, the algorithm is stopped,

otherwise, 𝑢 is added one by one till the error is acceptable. CNN suggests using QP learning

algorithm for training because of its characteristics to take larger steps toward the minimum

rather than GD infinitesimal small steps. Experimental work on nonlinear classification

benchmarking of two spiral problems (Lang, 1989) demonstrated the training speed

effectiveness of CNN over BPFNN. The major advantage of CNN is that it learns quickly and

determine its own network size. However, studies show that CNN is more suitable for

classification tasks (Hwang et al., 1996). This narrows the scope of CNN in other application

68

areas. Some studies have shown that the generalization performance of CNN may not be

optimal and even larger networks may be required (Lehtokangas, 2000). Kovalishyn et al.

(1998) experimental work on quantitative structure activity relationship (QSAR) data did not

find any model where the performance of CNN was significantly better than BPFNN. They

optimized the performance of CNN by introducing a pruning algorithm.

The Evolving cascade Neural Network (ECNN) was proposed to select the most informative

features from the data to resolve the issue of overfitting in CCN (Schetinin, 2003). Overfitting

in CNN results from the noise and redundant features in the training data which affects its

performance. ECNN selects the most informative features by adding initially one input unit

and the network is evolved by adding a new input unit as well as a hidden unit. The final ECNN

has a minimal number of hidden units and the most informative input features in the network.

The selection criteria for a neuron is based on the regularity criterion 𝐶 extracted from the

Group Method of Data Handling (GMDH) algorithm (Farlow, 1981). The higher the value of

𝐶, the less informative the hidden unit will be. The selection criteria for hidden units can be

expressed as:

𝑖𝑓 𝐶𝑟 < 𝐶𝑟−1, 𝑎𝑐𝑐𝑒𝑝𝑡 𝑟𝑡ℎ ℎ𝑖𝑑𝑑𝑒𝑛 𝑢𝑛𝑖𝑡 (2.68)

The selection criteria illustrate that if 𝐶𝑟 calculated for the current hidden unit is less than the

previous hidden unit 𝐶𝑟−1, it means that the current hidden unit is more informative and

relevant than the previous unit and will be then added in the network.

Huang, Song, et al. (2012) explained that the CNN covariance objective function is maximized

by training 𝑤𝑖𝑐𝑤 which cannot guarantee a maximum error reduction when a new hidden unit

is added. This will slow down the convergence and more hidden units will be needed which

will reduce the generalization performance. In addition to this, the repetitive tuning of

69

𝑤𝑜𝑐𝑤after each hidden unit addition is more time consuming. They proposed an algorithm

named as Orthogonal Least Squares-based Cascade Network (OLSCN), which used an

orthogonal least squares technique and derived a new objective function 𝑆𝑜𝑙𝑠 for input training,

expressed as below:

𝑆𝑟𝑜𝑙𝑠 = 𝐸𝑟−1 − 𝐸𝑟 = (𝛾𝑟

𝑇𝑦)2/(𝛾𝑟𝑇𝛾𝑟) (2.69)

where 𝛾 are the elements of the orthogonal matrix obtained by performing QR factorization of

the output of the 𝑟 hidden unit. This objective function is optimized iteratively by using the

second-order algorithm as expressed below:

𝑤𝑖+1 (𝑟)𝑖𝑐𝑤 = 𝑤𝑖 (𝑟)

𝑖𝑐𝑤 − [𝐇𝐧𝑟 − 𝜇𝐼𝑟]−1𝑔𝑟 (2.70)

where 𝐼 is an identity matrix and 𝜇 is damping hyperparameter factor, 𝑔𝑟 is the gradient of the

new objective function 𝑆𝑜𝑙𝑠 with respect to 𝑤𝑖𝑐𝑤 for hidden unit 𝑟. The information generated

from input training is further continued to calculate 𝑤𝑜𝑐𝑤 using the back substitution method

for linear equations. The 𝑤𝑖𝑐𝑤 values, after randomly initializing, are trained based on the

above objective function, however, 𝑤𝑜𝑐𝑤 values are calculated after all 𝑢 are added and thus

there is no need for any repeatedly training. The benefit of OLSCN is that it needs less hidden

units as compared to original CCN for the same training example, with some improvement in

generalization performance.

In addition, the Faster Cascade Neural Network (FCNN) was proposed to improve the

generalization performance of CCN and improve the existing OLSCN drawbacks: Firstly, the

linear dependence of candidate units with the existing ℎ can cause mistake in the new objective

function in OLSCN. Secondly, the 𝑤𝑖𝑐𝑤 modification by the modified Newton Method is based

on a second-order 𝐇𝐧 which may result in a local minimum and slow convergence due to heavy

computation (Qiao et al., 2016). In FCNN, hidden nodes are generated randomly and remain

70

unchanged as inspired by randomly mapping algorithms. The 𝑤𝑜𝑐𝑤 connecting both input and

hidden units to output units are calculated after the addition of all the necessary input and

hidden units. The FCNN initializes, with no input and hidden units in the network. The bias

unit is added to the network and input units are added one by one with error calculation for

each input unit. When there are no input units to be added, a pool of candidate units is randomly

generated. The candidate unit with the maximum capability to error reduction computed from

the reformulated modified index is added as a hidden unit to the network. When a maximum

number of hidden units or defined thresholds are achieved, the addition of hidden units is

stopped. Finally, the output units are calculated by back substitution computing like OLSCN.

The experimental comparison among FCNN, OLSCN, and CNN demonstrated that FCNN

achieved better generalization performance and fast learning, however, the network

architecture size increases many times.

Nayyeri et al. (2018) proposed to use a correntropy based objective function with a sigmoid

kernel to adjust the 𝑤𝑖𝑐𝑤 of the newly added hidden unit rather than a covariance (correlation)

objective function. The success of correlation heavily relies on the Gaussian and linearity

assumptions, whereas, the properties of correntropy are to make the network more optimal in

nonlinear and non-Gaussian terms. The new algorithm with a correntropy objective function

based on a sigmoid kernel is named as Cascade Correntropy Network (CCOEN). Similar to

other cascade algorithms described above, 𝑤𝑖𝑐𝑤 is optimized by using correntropy objective

function 𝑆𝑐𝑡 with sigmoid:

𝑆𝑟𝑐𝑡 = max

𝑈𝑟(∑(tanh(а𝑟⟨𝐸𝑟−1, 𝐻𝑟⟩ + 𝑐)) − 𝐶‖𝑤𝑟

𝑖𝑐𝑤‖2

𝑟

𝑖=1

) (2.71)

71

where tanh(а𝑟⟨. , . ⟩ + 𝑐) is defined as a sigmoid kernel with hyperparameters а and 𝑐. 𝐶 ∈

{0,1} is a constant. If the new hidden unit and residual error are orthogonal then 𝐶 = 0 ensures

algorithm convergence, otherwise 𝐶 = 1. The 𝑤𝑜𝑐𝑤 is adjusted as:


𝐸𝑟−1𝐻𝑟𝑇

𝐻𝑟𝐻𝑟𝑇 (2.72)

An experimental study was performed on regression problems, with and without noise and

outliers, by comparing CCOEN with six other objective functions as defined in (Kwok and

Yeung, 1997), the CasCor covariance objective function as shown in Equation (2.67), and one

hidden layer FNN. The study demonstrates that the CCOEN correntropy objective function, in

most cases, achieved better generalization performance and increase the robustness to noise

and outliers compared to other objective functions. The network size of CCOEN showed

slightly less compactness compared to other objective functions, however, no specific objective

function was found to have better compact size, in general.

a) Application of Constructive FNN

Various applications of constructive FNN are explained and listed in Appendix A.

2.2.4.4 Metaheuristic search algorithms

The major drawback of FNN is that it stuck at a local minimum with a poor convergence rate

and it becomes worse at the plateau surface where the rate of change in error is very slow at

each iteration. This increases the learning time and coadaptation among the hyperparameters.

Therefore, trial and error approaches are applied to find optimal hyperparameters, and this

makes gradient learning algorithms even more complex and it becomes difficult to select the

best FNN. Moreover, the unavailability of gradient information in some applications makes

FNN ineffective. To solve the two main issues of FNN that need to find the best optimal

72

hyperparameters and to make it useable with no gradient information, the application of

metaheuristic algorithms such as genetic algorithm, particle swarm optimization, and whale

optimization algorithm is implemented in combination with FNN. The major contribution of

the applications of a metaheuristic algorithm is that it may converge at a global minimum rather

than the local minimum by moving from a local search to a global search. Therefore, these are

more suitable for global optimization problems. The metaheuristic algorithms are good for

identifying the best hyperparameters and convergence at a global minimum, but they have

drawbacks in the high memory requirements and processing time.

a) Genetic Algorithm

Genetic Algorithm (GA) is a metaheuristic technique belonging to the class of evolutionary

algorithms (EA) inspired by Darwinian theory about evolution and natural selection and was

invented by John Holland in the 1960s (Mitchell, 1998). It is used to find a solution for

optimization problems by performing chromosome encoding, fitness selection, crossover, and

mutation. Like BP, it is used to train FNNs to find the best possible hyperparameters.

Researchers demonstrated that the generalization performance of GA is better than BP

(Mohamad et al., 2017) with the additional need for training time. Liang and Dai (1998)

highlighted the same issue by applying GA to FNN. They demonstrated that the performance

of GA is better than BP, but the training time increased. Ding et al. (2011) proposed that GA

performance can be improved by integrating the concept of both BP gradient information and

GA genetic information to train an FNN for superior generalization performance. GA is good

for global searching, whereas, BP is appropriate for local searching. The algorithm first uses

GA to optimize the initial weights by searching for better search space and then uses BP to fine

tune and search for an optimal solution. Experimental results demonstrated that FNN trained

on the hybrid approach of GA and BP achieved better generalization performance than

73

individual BPNN and GANN. Moreover, in terms of learning speed, a hybrid approach is better

than GANN compared to BPNN.

b) Particle Swarm Optimization

Eberhart and Kennedy (1995) argued that a GA used to find hyperparameters may not be

suitable for the crossover operator. The two chromosomes selected with high fitness values

might be very different from one another and reproduction will be not effective. The Particle

Swarm Optimization (PSO) based FNNs (PSONN) can overcome this issue in GA by searching

for the best optimal solution through the movement of particles in space. PSO is an iterative

algorithm in which a particle 𝑤 adjusts its velocity 𝑣 during each iteration based on momentum,

its best possible position achieved 𝑝𝑝𝑏𝑒𝑠𝑡 so far and the best possible position achieved by

global search 𝑝𝑔𝑏𝑒𝑠𝑡. This can be expressed in mathematically as:

𝑣𝑖+1 = 𝑣𝑖 + 𝑐1 ∗ 𝑟 ∗ (𝑝𝑖𝑝𝑏𝑒𝑠𝑡 − 𝑤𝑖) + 𝑐2 ∗ 𝑟 ∗ (𝑝𝑖

𝑔𝑏𝑒𝑠𝑡− 𝑤𝑖) (2.73)

𝑤𝑖+1 = 𝑤𝑖 + 𝑣𝑖+1 (2.74)

where 𝑟 is a random number with value [0,1] and 𝑐 is an acceleration constant. A more robust

generalization may be achieved by balancing the global search and local search, and PSO being

modified to Adaptive PSO (APSO) by multiplying the current velocity with the inertia weight

𝑤𝑖𝑛𝑡𝑒𝑟𝑖𝑎 such that (Shi and Eberhart, 1998):

𝑣𝑖+1 = 𝑤𝑖𝑛𝑡𝑒𝑟𝑖𝑎 ∗ 𝑣𝑖 + 𝑐1 ∗ 𝑟 ∗ (𝑝𝑖

𝑝𝑏𝑒𝑠𝑡 − 𝑤𝑖) + 𝑐2 ∗ 𝑟 ∗ (𝑝𝑖𝑔𝑏𝑒𝑠𝑡

− 𝑤𝑖) (2.75)

Zhang et al. (2007) argued that PSO converges faster in a global search but around the global

optimum, it becomes slow, whereas, BP converges faster at a global optimum due to efficient

local search. The faster property of APSO in a global search at the initial stages of learning and

the BP of local search motivates the use of a hybrid approach known as particle swarm

74

optimization backpropagation (PSO-BP) to train the FNN hyperparameters. Experimental

results indicate that PSO-BP generalization performance and learning speed are better than

both PSONN and BPNN. The critical point in PSO-BP is to decide when to shift learning from

PSO to BP during learning. This can be done by a heuristic approach, and if the particle has

not changed from in iterations then the learning should be shifted to gradient BP. One of the

difficulties associated with PSO algorithms is the selection of optimal hyperparameters such

as 𝑟, 𝑐 and 𝑤𝑖𝑛𝑡𝑒𝑟𝑖𝑎. The performance of the PSO algorithm is highly influenced on

hyperparameter adjustment and is currently being investigating by many researchers.

Shaghaghi et al. (2017) detailed a comparative analysis of a GMBH neural network optimized

by GA and PSO, and concluded that GA is more efficient than PSO. However, in the literature,

the performance advantage of GANN over PSONN and vice versa is still unclear.

c) Whale Optimization Algorithm

The application of GAs and APSO have been investigated widely in the literature, however,

according to the no free lunch (NFL) theorem, there is no algorithm that is superior for all

optimization problems. The NFL theorem is the motivation for proposing the whale

optimization algorithm (WOA) to solve the problem of local minima, slow convergence, and

dependency on hyperparameters. The WOA is inspired by the bubble net hunting strategy of

humpback whales (Mirjalili and Lewis, 2016). The idea of humpback whale hunting is

incorporated in WOA, which creates a trap using bubbles, moving in a spiral path around the

prey. It can be expressed mathematically as:

𝑤𝑖+1 = {𝑤𝑖∗ − 𝐴𝐷 𝑝 < 0.5

𝐷′𝑒𝑏𝑙cos(2𝜋𝑖) + 𝑤𝑖

∗ 𝑝 ≥ 0.5 (2.76)

where 𝐴 = 2𝑎𝑟 − 𝑎, 𝐶 = 2𝑟 such that 𝑎 is linearly decreased from 2 to 0 and 𝑟 is a random

number in [0,1], 𝑏 is constant and represent the shape of the spiral, 𝑙 is a random number in [-

75

1,1], and 𝑝 is a random number in [0,1]. 𝐷′ = |𝑤𝑖∗ − 𝑤𝑖| is the distance between the best

solution obtained so far and current position and 𝐷 = |𝐶𝑤𝑖∗ − 𝑤𝑖|. The first part of the above

equation represents the encircling mechanism and the second part represents the bubble net

technique. Like GAs and APSO, the search process starts with candidate solutions and then

improves it until defined criteria are achieved. In WOA, search agents are assigned and their

fitness functions are calculated. The position of the search agent is updated by computing

Equation (2.76). The process is repeated until the maximum number of iterations is achieved.

Ling et al. (2017) mentioned that WOA disadvantageous of slow convergence, low precision,

and being trapped in a local minimum because of lack of population diversity. They argued

that WOA is more useful for solving low dimensional problems, however, when dealing with

high dimensional and multi-model problems, its solution is not quite optimal. To overcome the

WOA drawbacks, an extension known as the Lévy flight WOA was proposed to enhance the

convergence by avoiding the local minimum.

d) Application of metaheuristic search algorithms

Various applications of metaheuristic search algorithms are explained and listed in Appendix

A.

2.3 Discussion

The application of NNs is gaining much popularity in the airline sector to improve its various

operations and enhance services. For instance, Lin and Vlachos (2018) proposed an analytical

framework, known as “Importance-Performance-Impact-Analysis”, to improve customer

satisfaction from airlines by incorporating techniques such as BPNNs, decision-making trail,

and evaluation laboratory. Cui and Li (2017) studied airline efficiency measures under “Carbon

Neutral Growth from 2020” and used a BPNN to predict the CO2 emission volume for an

76

individual airline. Khanmohammadi et al. (2016) studied the issue of nominal variables in data

and proposed to use a multilevel input layer BPNN to predict incoming flight delays for the

John F. Kennedy (JFK) International Airport in New York. The list of applications of neural

networks in the airline domain is long, thereby motivating to apply NN-based machine learning

methods to study fuel estimation for flight trips.

We limit the scope of our study to EAs and BPNNs. EAs are disadvantageous in that they

require a lot of effort to perform flight testing, and they involve expertise for complex

mathematical formulation as well as cost to determine coefficients. All these aspects make EAs

difficult to implement (Pagoni and Psaraki-Kalouptsidi, 2017; Senzig et al., 2009; Yanto and

Liem, 2018). The efforts based on BPNNs to provide an alternative for EAs can be considered

as an improvement; however, existing research works are restricted to trial-and-error

approaches to determine the optimal hyperparameters, the number of hidden units and layers,

and the activation functions. The major drawbacks of the fixed topology of BPNNs are that it

involves iterative tuning of connection weights. This tuning may converge at local minima

when global minima are far away. As a result, learning becomes slow when the learning rate

is low and unstable when the learning rate is high (Huang, Zhu, et al., 2006). Other issues

include local and global hyperparameter adjustment and decision, for instance, setting the

learning rate, connection weights, hidden units, hidden layer, and gradient learning algorithm.

Too many hyperparameters and their adjustment with iterative tuning influence between each

other make the network complex, which leads to the problem known as weak generalization

performance and slow convergence (Huang, Chen, et al., 2006; Kapanova et al., 2018; Krogh

and Hertz, 1992; Liew et al., 2016; Srivastava et al., 2014). The aforementioned drawbacks

need to be resolved in a more innovative way to get an accurate trip fuel estimation for each

flight with less expertise requirement and faster learning speed. In-depth literature review

results in below research gaps:

77

1 EAs are based on complex energy balance mathematical formulation which needs a lot

of expertise and cost to determine many coefficients making it difficult to implement.

A lot of flight testing is needed to generate information about parameters and to

determine coefficients. The EAs needed some of the information about aircrafts types

and engines to determine the amount of fuel needed for a journey. The unavailability

of such information and usage of an outdated existing database having global

parameters with default values rather than local parameters may result in inaccurate

estimation of fuel by EAs. In the literature, an insufficient attempt has been made to

study the effect of using global parameters on the deviation of fuel from estimation

using a practical example.

2 The popularity of machine learning BPNN in the aviation sector and particularly in

estimating fuel consumption has gained much popularity to provide as an alternative to

EAs. Regarding methodology, in exiting studies, the pitfall is that no clear direction

exists that has explained the reason for recommended fixed topology architecture for

estimating fuel. The selection of hidden units, hidden layers, connection weight, and

learning rate requires much human intervention and inappropriate hyperparameter

selection may cause the algorithm to converge at suboptimal solution leading to higher

fuel deviation. A lot of trial and error experimental need to be performed to find the

best hyperparameters and which even become more difficult when dealing with high

dimensional real data. Another technical drawback of BPNN is that it converges at local

minima when global minima are far away, and computational time is highly dependent

on gradient derivation and learning speed.

3 It is observed that the selection of operational parameters was considered based on prior

experience and knowledge, whereas, their relationship to fuel consumption was

unnoticed. Moreover, existing BPNN based estimation models are trained by

78

considering low-level operational parameters. For instance, Schilling (1997) model was

based on a low-level operational parameter such as altitude and velocity with the future

recommendation of incorporating more parameters. However, Trani et al. (2004) also

used low-level operational parameters such as Mach number, weight, temperature, and

altitude, and according to a recent study, Baklacioglu (2016) also used low-level

operational parameters such as altitude and velocity for estimating fuel consumption.

In the literature, a gap still exists to add high-level operational parameters that can

contribute to minimizing fuel deviation which was ignored previously.

4 Most studies formulated to overcome fuel estimation limitations are proposed taking

into consideration either take-off, cruise or descent phase. The suggested models may

fail to work on all phases efficiently. For instance, the model proposed for take-off may

not perform accurately for decent because of a change in operational parameters. Some

models proposed to estimate fuel for phases separately and accumulate them to know

the final fuel needed for the journey. This approach can even result in higher deviation

because the amount of overestimation or underestimation in each phase may

cumulatively lead to higher fuel deviation. Therefore, it is important to have a model

that can efficiently estimate fuel for all phases collectively to ensure profitability and

safety.

5 Once developed, the performance of the same architecture machine learning NN may

no longer be suitable for learning the same application if data structure and size

changes. How to improve existing machine learning algorithms that may give the same

and accurate result regardless of a change in data structure and size is gaining attraction.

In the literature, fuel is estimated based on available small dimensional data and no

effort has been made to check the performance of an algorithm on varying nature of

data. The comparison of the difference in fuel deviation by considering all sectors high

79

dimensional data and reducing the dataset into small chunks (portions) need

considerable attention. The algorithm having a minimal difference in performance

regardless of a change in data structure and size means that algorithm estimation

accuracy does not decrease with the change in data dimensional.

An important question arises: Which machine learning NN algorithm will be more suitable to

address the above-mentioned research gaps? A comprehensive review of FNNs in Section 2.2

gives insight that existing improvement is not straightforward, and researchers are making

continuous efforts to propose algorithms that are computationally efficient and have better

generalization performance. By analyzing the existing research, we consider that following

seven research gaps that need considerable attention in order to improve existing FNNs:

1 Activation function: The importance of using nonlinear activation function in hidden

layers of FNNs is clear, however, it is still unclear in the literature which activation

function is more suitable for a particular application and data structure.

2 Efficient and compact Algorithm with fewer hyperparameters: Gradient free learning

algorithms are simple to apply because of no backpropagation, whereas, gradient

learning algorithms have compact architecture. Efforts are needed to design algorithm

has both characteristics of gradient-free learning and compact architecture.

3 Connection weight initialization: The purpose of FNN is to find the best optimal

connection weights that can generate optimal results. The question arises: What should

be the best possible initial weights for the network? Traditional FNN is dependent on

the initial weight value because it calculates the derivative of the total error with respect

to weight to get a minimum of the objective function. Assigning suboptimal weights

will cause the network to have more iterations and subsequently decreases its

performance. The issue is resolved to some extent by a new approach that analytically

80

calculates the connection weights on the output side by randomly generating hidden

units. However, existing neural networks can be further strengthened by calculating all

connection weights (including input connection and output connection) analytically to

generate hidden units, explaining maximum variance in the dataset. This may also help

to further improve the generalization and learning speed by compacting the size of the

network.

4 Data structure: Future research may include designing FNN architecture with less

complexity with the characteristics of learning noisy data without overfitting. Few

attempts have been made to study the effect of data size and structure on FNN

algorithms. The availability of training data in many applications (for instance, medical

science and so on) is limited (small data rather than big data) and costly (Choi et al.,

2018; Shen, Choi and Minner, 2019). Once trained, the same model may become

unstable for less or more data within the same application and may result in a loss of

generalization ability. Each time, for the same application area and algorithm, a new

model is needed to be trained with a different set of parameters. Future research on

designing an algorithm that can approximate the problem task equally, regardless of

data size (instance) and shape (features), will be a breakthrough achievement.

5 Eliminating the need for data pre-processing and transformation: Wrong application

of pre-processing and transformation techniques may result in suboptimal results.

Future algorithms are needed that may be less sensitive to outliers and noisy data and

do not need to reduce the magnitude of data.

6 A hybrid approach for real-world applications: Researchers most often use publicly

available real-world data to compare their results with other popular algorithms.

Frequently using the same real-world applications data may cause specific data to

become a benchmark. The best practice may be to use a hybrid approach of well-known

81

data in the field along with new data applications during the comparative study of

algorithms. This may create and maintain users’ interest in FNNs over the passage of

time

7 A number of hidden units needed: The analytical calculation of hidden units to deal

with high dimensional large datasets, rather than trial and error, is gaining interest. The

number of hidden units needed in either single or multiple layers such that no

dependency exists can ensure maximum error reduction of a network.

Addressing all the above FNNs research gaps in current work is difficult. The work is more

focused on two important research gaps “Connection weight initialization” and “Data

structure” that need researchers' attention.

In summary, in order to address the above research gaps of fuel estimation and FNNs, the trip

fuel model is formulated, and a novel machine learning algorithm is proposed. A computational

comparison of the proposed algorithm is made with other known popular algorithms using

artificial benchmarking and real-world datasets to demonstrate its effectiveness. Lastly, the

proposed algorithm is applied to estimate trip fuel for airlines using historical real data. The

comparison of the proposed algorithm with existing AEA and BPNN based fuel models are

also made to enrich the literature.

82

Chapter 3: Trip fuel Model Formulation for Minimizing

Fuel Deviation

3.1 Introduction

In this chapter, our main focus is to formulate a model for estimating trip fuel. Prior to

formulating the model and defining an objective function, the framework of fuel consumption

and its deviation from estimation is concisely explained. Before the commencement of each

flight, various amounts of fuel are loaded in aircraft to meet certain requirements. The trip fuel,

comprising a large portion of various other fuels, is loaded to ensure the smooth journey. The

difference in trip fuel estimated and consumed after flight enables us to understand the reasons

causing the overestimation and underestimation of fuel. The major reason for causing higher

fuel deviation from estimation is because of unreliable estimation techniques and ignoring

some of the operational parameters that may highly contribute to fuel consumption. The

methods (or estimation techniques) that are popular to estimate fuel for aircraft are of two types:

mathematical method and a data-driven method. The mathematical methods help to provide a

physical understanding of the system. However, in a real application, mathematical methods

might be difficult to implement and inaccurate because a lot of information needed to represent

the relationship among variables (Kuo and Kusiak, 2019). Machine learning (data-driven) are

considered ideally suitable for learning from historical data to make a better-informed decision,

time and cost-saving. Among machine learning, NNs are more popular to solve problems

because of their universal approximation capability (Ferrari and Stengel, 2005; Hornik et al.,

1989; Huang, Chen, et al., 2006). It can solve any complex nonlinear problem more accurately

which is difficult using classical statistical techniques (Kumar et al., 1995; Tkáč and Verner,

2016; Tu, 1996). The NN range of applications is numerous and some areas include regression

83

estimation (Chung et al., 2017; Deng et al., 2019; Kummong and Supratid, 2016; Teo et al.,

2015), image processing (Dong et al., 2016; Mohamed Shakeel et al., 2019), image

segmentation (Chen et al., 2018), video processing (Babaee et al., 2018), speech recognition

(Abdel-Hamid et al., 2014), text classification (Kastrati et al., 2019; Zaghloul et al., 2009), face

classification and recognition (Yin and Liu, 2018), human action recognition (Ijjina and

Chalavadi, 2016), risk analysis (Nasir et al., 2019) and many others. With modern information

technology advancement, the challenging issue of high dimensional, non-linear, noisy and

unbalanced data is continuously growing and varying at a rapid rate so that it demands efficient

learning algorithms and optimization techniques. The data may become a costly resource if not

analysed properly. Machine learning is gaining popularity in all aspects from data gathering to

discovering knowledge and its role in enhancing business decisions is gaining significant

interest (Bottani et al., 2019; Hayashi et al., 2010; Kim et al., 2019; Lam et al., 2014; Li et al.,

2018; Mori et al., 2012; Wang et al., 2005; Wong et al., 2018).

Efforts are being made to overcome the challenges by building optimal NNs that may extract

useful patterns from the data and generate information for better-informed decision making.

Extensive knowledge and theoretical information are required to build NNs having the

characteristics of better generalization performance and learning speed. The generalization

performance and learning speed are the two criteria that play an essential role in deciding on

the use of learning algorithms and optimization techniques to build optimal NNs. Depending

upon the application and data structure, the user might prefer either better generalization

performance or faster learning speed, or a combination of both. In its simplest form, NN with

a single hidden layer is powerful in solving many problems, given that there are a sufficient

number of hidden units in the layer (Nguyen and Widrow, 1990). The application of NNs in

diverse topics is not simple and expertise is needed to build an optimal network to achieve the

84

intended results in the shortest possible time. More specifically, in our work, mathematical

methods can be referred to as EAs and data-driven methods can be referred to as NNs.

The use of global parameters rather than local parameters and the need for complex

mathematical calculations to determine the coefficient may cause a higher deviation of fuel

from estimation in EAs. BPNN is a suitable alternative, however, the weak generalization

performance and slow learning speed with needed user expertise may limit its application for

estimating fuel consumption. Unlike existing estimation methods in which operational

parameters are selected based on experience and knowledge, this research work segregates and

incorporates all those parameters which have a relationship to fuel consumption. This work

also considered some other important operational parameters that were ignored in the existing

literature. The statistical analysis and explanation show that extracted operational parameters

have a strong effect on fuel consumption and cannot be ignored. In the end, a model is

formulated, and the objective function of minimizing fuel deviation is defined.

The Chapter is organized as follows. Section 3.2 concisely explains the framework of fuel

consumption and its deviation from estimation. Section 3.3 is about complex high-level

operational parameters extraction and showing their relationship to fuel consumption,

constructing the relationship between mathematical and data-driven methods. Furthermore, the

significance of the proposed method is also elaborated before formulating a model for trip fuel.

Section 3.4 summarizes the chapter.

3.2 Fuel consumption and deviation problem framework

In this study, we work on a dataset obtained from an international airline operating in Hong

Kong. Currently, it experiences the usual problem of excess fuel consumption that increases its

fuel expenses. Before each flight operation, the flight plan is prepared such that it contains

85

details about the amount of fuel to be loaded in the tank reservoirs. Figure. 3.1 illustrate the

framework by showing various amounts of takeoff fuel (𝑡𝑜𝑓) needed for flight trips with fuel

deviation, including

𝑡𝑜𝑓 = 𝑓(𝑡𝑎𝑥𝑖, 𝑡𝑟𝑖𝑝 𝑓𝑢𝑒𝑙, 𝑐𝑜𝑛𝑡𝑖𝑛𝑔𝑒𝑛𝑐𝑦, 𝑒𝑥𝑡𝑟𝑎, 𝑑𝑖𝑠𝑐𝑟𝑒𝑝𝑎𝑛𝑐𝑦,

𝑎𝑙𝑡𝑒𝑟𝑛𝑎𝑡𝑒, 𝑓𝑖𝑛𝑎𝑙 𝑟𝑒𝑠𝑒𝑟𝑣𝑒) (3.1)

1) Taxi Fuel: Amount of fuel needed to operate an auxiliary power unit, start the engine,

and cover the ground distance before starting the takeoff.

2) Trip Fuel: Amount of fuel needed for normal flight operation from the takeoff at the

departure airport to landing at the arrival airport.

3) Contingency Fuel: Additional fuel loaded to meet holding and insufficient block fuel.

4) Extra Fuel: Additional fuel loaded to manage bad weather conditions and/or airport

congestion.

5) Discrepancy Fuel: Additional fuel loaded according to experience to meet unforeseen

conditions and/or account for aircraft deterioration.

6) Alternate Fuel: Additional fuel loaded to fly to an alternate airport if required.

7) Final Reserve Fuel: Last emergency reservoir to handle any uncertain situation.

Figure 3.1. Fuel estimation before flight and consumption after the flight

86

The trip fuel consists of a large portion of the aircraft reservoir, whereas, other fuels are loaded

based on requirements and to meet international regulation for a safe journey. Currently, the

airline trip fuel estimation is based on various factors such as flight time, weight, wind speed,

altitude, speed, etc. This estimated trip fuel is loaded to aircraft with other additional reserves

for smooth flight operations. As illustrated in Figure 3.1, after flight operations, the actual trip

fuel usually deviates from estimated fuel. The difference between fuel consumed and estimated

is known as fuel deviation. The fuel deviation may be positive or negative. When actual fuel

consumed is less than fuel estimated then it is considered that fuel was overestimated, and fuel

deviation is positive. Contrary, when actual fuel consumed is more than fuel estimated that it

is considered that fuel was underestimated, and fuel deviation is negative. Both situations are

undesirable for the smooth operation of aircraft. Fuel comprises a major portion of the aircraft

weight and its overestimation can cause an increase in the total weight of the aircraft, which

needs more thrust and force to balance the increase in weight and drag with other features,

resulting in more fuel consumption. Similarly, underestimation can create safety issues.

Underestimated fuel may create problems in reaching the destination safely and may need

utilization of some fuel from the supplementary reservoirs which are held back for other

reasons and emergency purposes. This creates less confidence in the fuel estimation system

and additional fuel is loaded based on experience in the discrepancy reservoir.

Other than fuel deviation, suboptimal fuel loading may cause some other major drawbacks:

adding extra fuel in a discrepancy reservoir may result in worse management control and may

need more frequent aircraft maintenance than planned which may shorten the life of engine

causing huge economic losses to airlines. The major reason for causing higher fuel deviation

from estimation is because of ignoring some of the operational parameters that may highly

contribute to fuel consumption and unreliable estimation methods.

87

3.3 Model formulation

This section is divided into three major subsections. In Subsection 3.3.1, the real historical data

is statistically analysed to extract complex high-level operational parameters that significantly

influence fuel consumption. In Subsection 3.3.2, the mathematical and data-driven methods are

elaborated, and their relationship is constructed. Furthermore, the significance of the proposed

method is also explained to understand the purpose and need of the work. Subsection 3.3.3 is

about trip fuel model formulation. The model is formulated by defining an objective function

of minimizing fuel deviation and to overcome the existing limitations.

3.3.1 Complex high-level operational parameters

3.3.1.1 Statistical analysis and extraction

To achieve the objectives of better fuel estimation with less expertise requirement and faster

learning speed, our study focuses on analysing high-dimensional data (Choi et al., 2018; Chung

et al., 2015). These high-dimensional data were provided by the airline and contain details

about historical real flights operated from April 2015 to March 2017. The useful operational

parameters are extracted from data that contributes to fuel consumption. The extracted data are

applied to estimate the trip fuel before each flight and compare it with actual consumed fuel to

measure fuel deviation in absolute percentage error. The comparative study is performed

among existing AEA, a BPNN, and the proposed CNN (discussed in Section 4.4). These

methods are used to estimate fuel for each flight. The one leading to a lesser fuel deviation is

considered for its effectiveness.

During each flight operation, the aircraft generates a considerable amount of high-dimensional

data with many operational parameters. The selection of relevant operational parameters plays

88

an important role in model performance (Guo et al., 2018). Instead of putting all the operational

parameters from the collected data into algorithms, a correlation analysis was performed among

operational parameters and the consumed fuel to identify and select the most relevant

operational parameters that significantly contribute to fuel consumption. Table 3.1 shows the

selected operational parameters in such conditions. The abbreviations are further explained in

Section 3.3.3. The runway direction is only a categorical variable included because it also

influences fuel consumption. Statistical work shows that takeoff from a runway that is opposite

in direction to the destination consumes more fuel because of the loop the aircraft must perform

to face towards the destination airport.

3.3.1.2 Influence of extracted parameters on the fuel consumption

The relationship between each extracted operational parameter and fuel consumption is clear:

flight time (𝑡) and distance travel (𝑑) are directly related to fuel consumption because the more

Table 3.1. Correlation analysis of selected operational parameters with trip fuel

consumption

Operational

Parameters 𝑺𝑹𝑯

𝑼𝑺𝑰 𝑺𝑹𝑯𝑫 𝑺𝑹𝑼𝑺𝟏

𝑯 𝑺𝑹𝑼𝑺𝟐𝑯 𝑺𝑹𝑼𝑲

𝑯 𝑺𝑹𝑯𝑼𝑺𝟐 𝑺𝑹𝑯

𝑼𝑲 𝑺𝑹𝑯𝑨

All

Sectors

𝑡 0.364 0.863 0.735 0.940 0.645 0.782 0.524 0.046 0.766

𝑚 0.897 0.681 0.536 0.850 0.825 0.713 0.812 0.888 0.783

𝑤 -0.379 -0.876 -0.741 -0.922 -0.726 -0.712 -0.616 -0.226 -0.295

𝑡𝑒𝑚𝑝𝑎 -0.585 0.770 0.454 0.548 0.015 -0.416 0.157 0.131 0.021

𝑡𝑒𝑚𝑝𝑔 -0.081 0.802 0.283 0.455 -0.053 -0.293 0.173 0.641 -0.059

ℎ𝑙1 -0.833 -0.596 -0.551 -0.222 -0.570 -0.448 -0.111 -0.541 -0.577

ℎ𝑙2 -0.812 -0.506 -0.604 -0.177 -0.305 -0.451 -0.412 -0.584 -0.687

ℎ𝑙3 -0.589 0.056 -0.080 -0.041 -0.331 -0.457 -0.382 -0.506 -0.368

ℎ𝑙4 -0.141 0.331 0.114 0.112 -0.293 -0.287 -0.171 -0.052 0.151

ℎ𝑙5 -0.089 0.261 0.041 -0.152 0.071 -0.012 0.086 0.175 0.178

𝑑 0.046 0.456 0.176 -0.279 0.111 0.180 0.098 0.073 0.687

𝑝𝑒𝑟𝑓𝑎𝑐 -0.144 -0.014 -0.227 0.138 -0.131 -0.250 -0.047 -0.211 -0.317

𝑚𝑎𝑐𝑧𝑓𝑤 -0.047 -0.025 0.046 0.043 -0.033 0.067 -0.067 -0.468 -0.047

𝑚𝑎𝑐𝑡𝑜𝑤 -0.469 -0.190 -0.436 -0.309 -0.240 -0.375 -0.268 -0.538 -0.344

𝑚𝑎𝑐𝑙𝑎𝑤 -0.029 0.205 0.056 0.042 -0.031 0.065 -0.066 -0.404 -0.118

𝑣 0.155 -0.012 -0.165 -0.220 0.045 0.107 -0.099 0.105 0.422

89

the aircraft is airborne, the more the fuel it will consume. Flight duration and covered distance

significantly contribute to aircraft operational expenses. The cost index (CI) is used to adjust

the speed (𝑣) of aircraft to a trade-off between higher operational expenses or more fuel-saving

(Edwards et al., 2016). The operational expenses can include crew time, leasing rate, plan

maintenance, landing and take-off fees, and ground services. All of them play major roles in

deciding whether to keep the aircraft airborne to make the airline profitable. The CI ranges

from 0 to 999 and is expressed as

𝐶𝐼 =𝐹𝑙𝑦𝑖𝑛𝑔 𝐶𝑜𝑠𝑡

𝐹𝑢𝑒𝑙 𝐶𝑜𝑠𝑡 (3.2)

When fuel is expensive, the CI will be lower, which means a slower speed of the aircraft.

Operating an aircraft at low-speed results in a higher climb rate due to excessive engine thrust

and, simultaneously, it is recommended to fly at a higher altitude to lower fuel consumption.

Similarly, when fuel is cheap, the CI will be higher, and a faster aircraft will operate to be less

airborne to reduce the amount of operational cost at the cost of more consumed fuel. Note that

the CI shortens or lengthens the airborne phase by changing the aircraft speed depending on

fuel prices and operational expenses to cut overall expenses.

The ramp weight (𝑚) can be expressed according to the following combination (Pagoni and

Psaraki-Kalouptsidi, 2017):

𝑚 = 𝑧𝑓𝑤 + 𝑡𝑜𝑓 (3.3)

Zero fuel weight (𝑧𝑓𝑤) is the total weight of an aircraft including crew, passengers, and

unusable fuel, minus the total weight of the usable fuel. During flight operation, the 𝑧𝑓𝑤

remains constant whereas 𝑡𝑜𝑓 weight decreases over time because of its continuous

90

consumption. This reduces the value of 𝑚 for an aircraft and ultimately decreases fuel

consumption over time.

A suitable choice of wind direction (𝑤) has a significant influence on fuel consumption. The

headwind that blows in a direction opposite to the aircraft is more favorable during takeoff and

landing. The aerofoils can generate more lift during takeoff and more induced drag during

landing. Tailwind that blows in the direction of the aircraft flight path is helpful during the

cruise phase as the aircraft travels faster and saves fuel at the same ground speed (Irrgang et

al., 2015). The ground speed is determined by the vector sum of aircraft speed, wind speed,

and direction. Headwind subtracts from the ground speed, while tailwind is added to the ground

speed.

The altitude (ℎ) can be used to assess the aerodynamic performance of an aircraft at certain

atmospheric conditions. Flying at higher altitudes can significantly save fuel consumption

because of less drag (Turgut and Rosen, 2012). Simultaneously, it may face a problem of low-

density air for fuel combustion that may result in more fuel flow to the engine. Therefore, the

altitude is adjusted for a non-standard temperature known as density altitude and can be

expressed as

𝐷𝐴 = 𝑃𝐴 + (118.8 𝑓𝑡/ C𝑜 )(𝑂𝐴𝑇 − 𝐼𝑆𝐴 𝑇𝑒𝑚𝑝) (3.4)

𝑃𝐴 = ℎ + (30𝑓𝑡/𝑚𝑖𝑙𝑙𝑖𝑏𝑎𝑟)(1013𝑚𝑖𝑙𝑙𝑖𝑏𝑎𝑟 − 𝑄𝑁𝐻) (3.5)

where 𝐷𝐴 is the density altitude in feet, 𝑃𝐴 is the pressure altitude in feet, 𝑄𝑁𝐻 is the

atmospheric pressure in millibar, 𝑂𝐴𝑇 is the outside air temperature in degree Celsius, and

𝐼𝑆𝐴 𝑇𝑒𝑚𝑝 is the international standard atmospheric temperature in degree Celsius. Aircrafts

are designed for specific optimal altitude, what minimizes fuel consumption. A change in that

altitude may cause aircrafts to burn more fuel (Diao and Chen, 2018).

91

The weight and balance of aircrafts can be expressed in terms of the percentage of the mean

aerodynamic chord (𝑚𝑎𝑐) (Dancila et al., 2013).

𝑚𝑎𝑐% =

((∑ 𝑤𝑖𝑎𝑖

𝑁𝑖=1 )𝑚 − 𝑙𝑒𝑚𝑎𝑐)

𝑡𝑒𝑚𝑎𝑐 − 𝑙𝑒𝑚𝑎𝑐

(3.6)

where (∑ 𝑤𝑖𝑎𝑖

𝑁𝑖=1 )

𝑚 is defined as the center of gravity (COG) with 𝑤 denoting component weights

and 𝑎 denoting the arm value, defined both as the moment arm (𝑤𝑎). 𝑙𝑒𝑚𝑎𝑐 and 𝑡𝑒𝑚𝑎𝑐 are

the leading-edge mean aerodynamic chord and tail-edge mean aerodynamic chord,

respectively. Improper distribution of aircraft weight may shift its COG forward and may need

more tail lift force for stable flight. The tail force ultimately increases the aircraft angle of

attack, which may cause aircrafts to face more induced drag. Therefore, the aircraft weight

balance (at the center of aircraft) should be maintained at zero fuel weight during takeoff and

landing phases to avoid the generation of excess induced drag and reduce fuel consumption.

The location and direction of the runway have clear effects on fuel consumption (Singh and

Sharma, 2015). Runway directions are identified by a number within the range 1 to 36, which

are magnetic azimuths in decadegrees. A runway number 09 means pointing toward the east

(90°), 18 means pointing toward the south (180°), 27 means pointing toward the west (270°),

and 36 means pointing toward the north (360°). Furthermore, if there is more than one runway

in the same direction, then the location is differentiated with a letter as follows: L for left, C

for a center, and R for right. It is more favorable to take off from a runway that faces the

headwind direction. However, it can also be a source of excess fuel consumption. If that

favorable runway is in the opposite direction to the allocated flight path, the aircraft must take

a loop to head towards the allocated path.

3.3.2 Mathematical and data-driven solution methods

92

In this section, the EA mathematical method and the BPNN data-driven method are concisely

explained to understand the complexity of both methods. The purpose is to reduce the

complexity of estimation methods to achieve minimum fuel deviation with less human

intervention.

EAs are based on the extensive mathematical formulation and are derived from the basic

concept of energy balance by accepting static constants of aircraft performance and the

dynamic input of the path profile (Collins, 1982). The energy balance can be expressed as

energy gain and loss by the aircraft as it travels along with the path profile:

𝑒𝑛𝑒𝑟𝑔𝑦 𝑔𝑎𝑖𝑛 − 𝑒𝑛𝑒𝑟𝑔𝑦 𝑙𝑜𝑠𝑠 = 𝑒𝑛𝑒𝑟𝑔𝑦 𝑐ℎ𝑎𝑛𝑔𝑒 (3.7)

𝐸𝑇 − 𝐸𝐷 = △ 𝐾𝐸 + △ 𝑃𝐸 (3.8)

During each flight operation, the change in kinetic △𝐾𝐸 and potential △ 𝑃𝐸 energies, the

aircraft suffers energy losses because of drag 𝐸𝐷 is adjusted by the amount of thrust 𝐸𝑇 to

maintain the energy balance. Thus, aircraft can be considered a system with energy losses and

gains that should be continuously balanced by the consumption of fuel energy. Collins (1982)

derived an algorithm for fuel estimation based on the aircraft configuration, weight and path

profile. The path profile may be described by considering a change in true airspeed, altitude

and time, with the following expressions for fuel estimation:

�̂� =

𝑡𝑣�̅�𝐹𝑁𝑣𝑡𝑃

+ 𝑡(𝐿𝐴𝑀1) (3.9)

Such that

𝐹𝑁 =

𝑅1𝐾12

�̅�𝑆𝑤𝑣𝑡2̅̅ ̅ +

2𝑅2𝐾2𝑚2

�̅�𝑆𝑤𝑣𝑡2̅̅ ̅+𝑚

𝑔𝑡(𝑣𝑡2 − 𝑣𝑡1) +

𝑚

𝑡𝑣𝑡(ℎ2 − ℎ1)

(3.10)

93

𝑃 = 𝐾10𝑒

𝐾11𝑣𝑡̅̅ ̅ + 𝐾12 (ℎ12 + ℎ1ℎ2 + ℎ2

2

3) + 𝐾13 (

ℎ1 + ℎ23

) + 𝐾14 (3.11)

𝐿𝐴𝑀1 = {

0 𝑖𝑓 𝐹𝑁 ≤ 𝐾7𝐾8𝐹𝑁 + 𝐾9 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

(3.12)

where �̂� is fuel estimation, 𝑡 is time, 𝑣 is true velocity, �̅� is average velocity, 𝐹𝑁 is thrust, 𝑃 is

thrust energy, 𝐿𝐴𝑀1 is a relationship between fuel flow and thrust, �̅� is average atmospheric

density, 𝑆𝑤 is wing area, 𝑚 is weight, 𝑔 is the gravitational acceleration and ℎ is altitude. In

the Equations (3.10), (3.11) and (3.12), Collins (1982) explained that 𝐾1, 𝐾2, 𝑅1, 𝑅2,

𝐾7, 𝐾8, 𝐾9, 𝐾10, 𝐾11, 𝐾12, 𝐾13, 𝐾14 and 𝑆𝑤 are aircraft specific constant and need to be

determined for each aircraft. Constants 𝐾1, 𝐾2 and 𝑆𝑤 determine the relationship between drag

and lift/weight coefficient and constants 𝐾10, 𝐾11, 𝐾12, 𝐾13 and 𝐾14 determine the relationship

between fuel consumption and energy gain from the thrust as a function of velocity and altitude.

The drag increases with the change in a configuration such as gear up or gear down and can be

expressed as:

𝑅1 = 𝐺𝑈1𝐹3 + 𝐺𝑈2𝐹

2 + 𝐺𝑈3𝐹 + 1 (𝑔𝑒𝑎𝑟 𝑢𝑝) (3.13)

𝑅1 = 𝐺𝐷1𝐹3 + 𝐺𝐷2𝐹

2 + 𝐺𝐷3𝐹 + 1 (𝑔𝑒𝑎𝑟 𝑑𝑜𝑤𝑛) (3.14)

𝑅2 = 𝐹𝐷𝑀1𝐹3 + 𝐹𝐷𝑀2𝐹

2 + 𝐹𝐷𝑀3𝐹 + 1 (3.15)

where 𝐹 is flap angle, 𝐷 is drag and 𝑀 is Mach number. Experimental work demonstrates that

the algorithm could result in saving more than two million US gallons of fuel annually. The

accuracy check of the algorithm with Eastern Airlines demonstrates its effectiveness with a

difference of less than 3%. This algorithm named as AFBM and was developed by MITRE,

under the FAA, and was incorporated in their simulation software Simmod to predict the fuel

consumption for each flight (Baklacioglu, 2016; Schilling, 1997). Similarly, The BADA

94

developed by Eurocontrol (Nuic, 2014) is another mathematical-based fuel estimation method

that estimates thrust specific fuel consumption (TSFC) 𝜂 as a function of true airspeed 𝑣𝑇𝐴𝑆.

The BADA provide files on performance and operating procedure coefficients of various

aircraft. The coefficients are needed to determine fuel flow, drag and thrust needed to specify

cruise, climb and descent speeds. The fuel for different phases and engines can be expressed

as:

𝜂 =

{

𝐶𝑓1 (1 +

𝑣𝑇𝐴𝑆𝐶𝑓2

) , 𝐽𝑒𝑡 𝐸𝑛𝑔𝑖𝑛𝑒

𝐶𝑓1 (1 +𝑣𝑇𝐴𝑆𝐶𝑓2

) (𝑣𝑇𝐴𝑆1000

) , 𝑇𝑢𝑟𝑏𝑜𝑝𝑟𝑜𝑝 𝐸𝑛𝑔𝑖𝑛𝑒

(3.16)

�̂�𝑛𝑜𝑚 = 𝜂𝑇ℎ𝑟, 𝐽𝑒𝑡 𝑎𝑛𝑑 𝑇𝑢𝑟𝑏𝑜 𝐸𝑛𝑔𝑖𝑛𝑒 (3.17)

�̂�𝑛𝑜𝑚 = 𝐶𝑓1, 𝑃𝑖𝑠𝑡𝑜𝑛 𝐸𝑛𝑔𝑖𝑛𝑒 (3.18)

where �̂�𝑛𝑜𝑚 is nominal fuel estimation and 𝐶𝑓1, 𝐶𝑓2 are fuel flow coefficients for the specific

aircraft type and are derived in OPF of the BADA. Equations (3.16), (3.17) and (3.18) are used

for fuel estimation in all flight phases except during idle descent and cruise. The following

mathematical calculations need to be performed to estimate the fuel consumption for idle

descent phase of flight:

�̂�𝑚𝑖𝑛 = 𝐶𝑓3 (1 +

𝐻𝑝

𝐶𝑓4) , 𝐽𝑒𝑡 𝑎𝑛𝑑 𝑇𝑢𝑟𝑏𝑜 𝐸𝑛𝑔𝑖𝑛𝑒

(3.19)

�̂�𝑚𝑖𝑛 = 𝐶𝑓3, 𝑃𝑖𝑠𝑡𝑜𝑛 𝐸𝑛𝑔𝑖𝑛𝑒 (3.20)

where �̂�𝑚𝑖𝑛 is minimal fuel estimation and 𝐶𝑓3, 𝐶𝑓4 are the fuel flow coefficient. When an

aircraft switches from idle descent and reaches the approach 𝑎𝑝 and landing 𝑙𝑑 phase, the thrust

is increased. The calculation for fuel flow at approach and landing phase can be expressed as:

95

�̂�𝑎𝑝/𝑙𝑑 = 𝑀𝐴𝑋(�̂�𝑛𝑜𝑚, �̂�𝑚𝑖𝑛), 𝐽𝑒𝑡 𝑎𝑛𝑑 𝑇𝑢𝑟𝑏𝑜 𝐸𝑛𝑔𝑖𝑛𝑒 (3.21)

Cruise phase fuel is estimated by using 𝜂, cruise thrust 𝑇ℎ𝑟 and the cruise fuel flow factor 𝐶𝑓𝑐𝑟

for jet and turbo engine, and fuel coefficients 𝐶𝑓1 and 𝐶𝑓3 for piston engine:

�̂�𝑐𝑟 = 𝜂 x 𝑇ℎ𝑟 x 𝐶𝑓𝑐𝑟, 𝐽𝑒𝑡 𝑎𝑛𝑑 𝑇𝑢𝑟𝑏𝑜 𝐸𝑛𝑔𝑖𝑛𝑒 (3.22)

�̂�𝑐𝑟 = 𝐶𝑓1x 𝐶𝑓3, 𝑃𝑖𝑠𝑡𝑜𝑛 𝐸𝑛𝑔𝑖𝑛𝑒 (3.23)

The BADA manual contains detailed mathematical expressions needed to calculate the thrust

𝑇ℎ𝑟 for each flight phase of the climb, take-off, cruise, descent, approach, and landing.

BPNN data-driven method is considered as an alternative for EAs based fuel estimation

models. A BPNN is defined as a network that does not form a cycle. All the connection signals

exist from the input unit 𝑿 to the output unit 𝒀 only in the forward direction. Note that 𝑿 and

𝒀 are connected through a number of hidden units 𝑯 by a series of connection weights without

any cycle or loop (Hecht-Nielsen, 1989). Figure 3.2 illustrates a simple BPNN with three

layers. The number of hidden units and layers in the BPNN determines its topology structure.

A BPNN with one hidden layer is known as shallow type, whereas a BPNN featuring more

than one hidden layer is known as deep type (Bianchini and Scarselli, 2014; LeCun et al.,

2015). The weights 𝑾 connecting 𝑿 to 𝑯 are known as input connection weights whereas the

weights 𝜷 connecting 𝑯 to 𝒀 are known as output connection weights. A BPNN is initialized

by randomly defined hyperparameters, for instance, learning rate, connection weights, hidden

units, and hidden layers. Training a model involves two steps, i.e., forward propagation and

backward propagation. BPNN can be consider a type of the FNN. In the forward propagation

step, the output �̂� is estimated by taking ∅ from the product of 𝑯 and 𝜷. Suppose 𝑿 be m×n, 𝒀

96

be m×q, 𝑯 be m×r, 𝑾𝒊𝒄𝒘 be n×r, and 𝑾𝒐𝒄𝒘 be r×q matrices. The estimation of output �̂� can

be expressed as:

�̂� = ∅(𝑯𝑾𝒐𝒄𝒘) (3.24)

where

𝑯 = ∅(𝑿𝑾𝒊𝒄𝒘 + 𝑏). (3.25)

The objective function is the sum-of-squares error (SSE):

𝑆𝑆𝐸 = 𝐸 =∑(�̂�𝑖 − 𝒀𝑖)2

𝑚

𝑖=1

(3.26)

If 𝐸 < 𝜀, where 𝜀 is the predefined target error, the algorithm stops. Otherwise, a

backpropagation (BP) iterative learning algorithm is applied in the backward propagation step

to train the model. The BP learning algorithm works on either first-order or second-order

gradient information. In its simplest form, for a first-order derivative, the gradient error E with

respect to the connection weights can be calculated by the chain rule:

Figure 3.2. BPNN learning algorithm network

97

𝜕𝐸

𝜕𝑾𝑖𝒐𝒄𝒘 =

𝜕𝐸

𝜕𝑜𝑢𝑡𝑖

𝜕𝑜𝑢𝑡𝑖𝜕𝑛𝑒𝑡𝑖

𝜕𝑛𝑒𝑡𝑖𝜕𝑾𝑖

𝒐𝒄𝒘 (3.27)

where 𝜕𝐸

𝜕𝑾𝑖𝒐𝒄𝒘 is a partial derivative of the error with respect to 𝑾𝑖

𝒐𝒄𝒘 for iteration 𝑖, 𝑜𝑢𝑡𝑖 is the

activation output, and 𝑛𝑒𝑡𝑖 is the weighted sum of the inputs of 𝑯.

Thus, for updating the output connection weight 𝑾𝑖+1𝒐𝒄𝒘, we have

𝑾𝑖+1𝒐𝒄𝒘 = 𝑾𝑖

𝒐𝒄𝒘 − 𝜂𝜕𝐸

𝜕𝑾𝑖𝒐𝒄𝒘 (3.28)

Similarly, for updating the input connection weight 𝑾𝑖+1𝒊𝒄𝒘, we apply

𝑾𝑖+1𝒊𝒄𝒘 = 𝑾𝑖

𝒊𝒄𝒘 − 𝜂𝜕𝐸

𝜕𝑾𝑖𝒊𝒄𝒘

(3.29)

A similar procedure is followed for second-order derivative learning algorithms. The updated

new connection weights are forward propagated and 𝐸 is recalculated. The connection weights

are trained after each iteration to obtain a minimum gradient error. When E converges or starts

increasing, the algorithm is stopped.

3.3.2.1 Relationship between mathematical and data-driven methods

Senzig et al. (2009) highlighted that BADA is an effective method that works better for the

cruise phase; however, it loses accuracy for terminal areas such as takeoff and descent. Besides,

the information needed to determine coefficients for mathematical methods is not always

available in a real scenario and it needs to be generated from other sources, which may limit

their applicability (Pagoni and Psaraki-Kalouptsidi, 2017; Trani and Wing-Ho, 1997; Yanto

and Liem, 2018). The estimation of trip fuel by EA usually results in a higher deviation of fuel

from estimation. The usage of more global parameters, with default values, rather than local

98

parameters and complex mathematical calculation requiring information about many

coefficient results in the suboptimal estimation of fuel. BPNN has gained much popularity in

estimating fuel consumption and its usage may overcome EA limitations to some extents,

however, it has certain problems that can influence its generalization performance and learning

speed causing higher fuel deviation (Huang, Zhu, et al., 2006; Srivastava et al., 2014). The

major drawbacks are:

1. being trapped in a local minimum when the global minimum is far away

2. facing a problem of saddle point

3. the convergence decreases at plateau surface

4. network performance is affected by hyperparameter initialization and adjustment

5. needs trial and error approaches and expert involvement

6. repeatedly tuning of connection weights

7. hidden units and layers adjustments.

In addition to the drawbacks mentioned above, the following human expertise significantly

affects the fuel estimation results of BPNNs:

1. What should be the network size and depth i.e. shallow or deep?

2. How many hidden units should be generated by each hidden layer?

3. How many hidden layers will be sufficient for deep learning?

4. What should be network initial connection weights and learning rate?

5. How should the hyperparameters be adjusted?

6. What should be the size of the dataset during network training?

7. Which learning algorithm should be implemented?

8. Which network topology is more efficient i.e. fixed or cascade?

99

9. What should be the criteria for increasing or decreasing the global and local

hyperparameters?

10. What type of activation function should be used in hidden units?

The relationship between mathematical and data-driven methods is illustrated in Table 3.2. The

drawbacks and user expertise of both EA and BPNN need to be addressed by proposing an

efficient machine learning algorithm that having characteristics of better generalization

performance and learning speed.

Table 3.2. Relationship between the mathematical method and data-driven method

Description EA mathematical method BPNN data-driven method

Approach A mathematical formulation of the

model.

Input operational parameters and

output labels for the network.

Information

source

A lot of flight testing need to

perform to generate information and

calculate parameters such as engine

constants, airspeed, aircraft mass,

atmospheric density, extra thrust

needed for ascending, and wing

area, etc.

Ability to extract information from

historical data.

Performance Performance increases by

incorporating local parameters

rather than global parameters.

Prediction performance increases

with the high dimensional dataset.

Expertise Human expertise having deep

domain knowledge and

understanding of mathematical

formulation are favorable.

A lot of human intervention is needed

to define hyperparameters and select

optimal fixed topology. Training is a

challenging phase.

Suitability More suitable for estimating fuel for

the cruise phase rather than terminal

areas such as take-off and descent

phase.

Performance depends upon the

information in historical data.

100

Complexity The information about various

aircraft is stored in the database.

Less complex compared to EA,

however, may constraint by memory

requirement. The use of high

dimensional data with second-order

derivative learning algorithms

requires high computational

resources.

Flexibility Information about some aircraft may

not be available or the database may

be outdated. It needs to incorporate

global parameters or outdated data

to estimate fuel.

The algorithm can learn and predict

with available data.

3.3.2.2 Significance of the proposed method

The contribution of existing works on the application of BPNN to estimate fuel consumption

is noteworthy. However, existing research works are restricted to a small number of operational

parameters, flight instance, hyperparameters, and adoption of trial and error approaches to

determine the optimal number of hidden units and layers. Table 3.3 illustrates the comparison

among existing BPNN based fuel estimation models and the proposed method. The comparison

shows that in existing literature only fixed topology NNs are proposed and none of the previous

work considered constructive topology NN for fuel estimation. Deciding on the architecture of

fixed topology is more dependent on human expertise and domain knowledge. Much human

expertise, as discussed in Section 3.3.2.1, is required to define an architecture for the fixed

topology to generate optimal results. This means that the architecture needs to be verified by

performing a lot of experimental work based on a trial and error approach. Furthermore, the

architecture might be not suitable for another set of historical data that may be less or greater

in volume than initial data on which network was trained. This gives an intuition that with

every set of historical data the fixed topology needs to be verified based on trial and error

experimental works. The use of constructive topology minimizes human intervention by adding

101

hidden units at each hidden layer and the process continues until error convergence. This makes

CNN more like deep learning having multiple hidden units and hidden layers in the network.

Unlike fixed topology, CNN advantageous in that it is not affected by the change in the

historical data set. For historical data that may be less or greater than initial data, the CNN

network gets to learn in smaller or larger sizes respectively.

It should be noted that low-level operational parameters, e.g. altitude, speed, and weight, are

commonly used for estimation in existing BPNN models and many other important operational

parameters are ignored that have a significant effect on fuel consumption. The importance of

operational parameters in estimating fuel consumption cannot disagree. Machine learning

algorithms depend on historical data and incorporating operational parameters that contribute

to fuel consumption can improve the generalization performance of the network. Studying low

operational parameters in existing works is because of low generalization performance and

slower convergence of BPNN with a lot of expertise needed to define hyperparameters. To

address this issue, the present work incorporates high-level operational parameters in the

proposed method that have a significant effect on fuel consumption.

Another important point that strengthens this work is that we considered high dimensional data

covering large fleet and various sectors to estimate fuel for all phases collectively. The flights

performed by various aircraft in same or different sector assists to generate data covering

different geographical regions. Studying a single aircraft or single sector may make it difficult

for NNs to capture the varying behavior of the operational parameters. For instance, if aircraft-

A is used for sector-A, then the network learned might be unsuitable to predict fuel for sector-

B using the same aircraft. The sector-A geographical conditions might be different from sector-

B. Similarly, if sector-A use aircraft-A, then the network learned might be unsuitable for

102

aircraft-B. The performance of aircraft-A might be different from aircraft-B. One may be newer

or older than another.

The existing methods for fuel estimation are proposed depending on some constraint

conditions. The constraints include proposing methodologies for either take-off, climb, cruise

or descent flight phases with limited data, aircraft types and inadequate consideration of

different sectors. Taking the cumulative effect of distinct flight phases may result in even

higher fuel deviation. In the proposed model, distinct flight phases are considered as one

complete trip. The fuel is estimated for all-sectors and individual sectors to understand the

effect of operational parameters and their varying behavior on fuel consumption.

Table 3.3. Relationship among existing BPNN-based fuel estimation models and the

proposed method

Dataset

Description

(Schilling, 1997) (Trani et al.,

2004)

(Baklacioglu,

2016)

Proposed Work

Methodology Backpropagation

Levenberg

Marquart Neural

Network

Backpropagation

Levenberg

Marquart Neural

Network

Genetic

algorithm

optimized

neural

network

Novel CNN

proposed in

Chapter-4

Operational

Parameters

Altitude, Speed Mach number,

Altitude,

Weight,

Temperature

Altitude,

True Air

Speed

Flight Time,

Weight, Wind,

Temperature

Deviation at the air,

Temperature

Deviation at the

ground, Altitude

(1st level), Altitude

(2nd Level),

Altitude (3rd

Level), Altitude

(4th Level),

Altitude (5th

Level), Distance,

Aircraft

103

Performance, Mean

aerodynamic chord

at zero fuel weight,

Mean aerodynamic

chord at takeoff

weight, Mean

aerodynamic chord

at landing weight,

Speed, Runway

Direction

Aircrafts Boeing 747-

100/767-200,

Bombardier

Dash-7, DC10-

30 and Jetstar

Fokker F-100 Boeing 737-

800

Airbus A330-300

(31 aircrafts) and

Boeing 747-400 (9

aircrafts)/ 747-800

(14 aircrafts)/ 777-

300 (53 aircrafts)

Flight

Instance

600 data points Trained model

on 1,610 flight

of cruise phase

347 for climb

404 for cruise

483 for

descent

Total= 1,234

19,117 flight

instance containing

details of all flight

phases from takeoff

to block-on.

Sectors

(Airports in

Country)

-- -- Turkey Australia, Dubai,

Hong Kong, United

Kingdom, United

States of America

3.3.3 Trip fuel model formulation

From the selected operational parameters, we consider the fuel consumption for each flight in

the form

𝒀 = 𝑓(𝑿) (3.30)

104

𝒀 = 𝑓(𝑡,𝑚,𝑤, 𝑡𝑒𝑚𝑝𝑎, 𝑡𝑒𝑚𝑝𝑔, ℎ𝑙1, ℎ𝑙2, ℎ𝑙3, ℎ𝑙4, ℎ𝑙5,

𝑑, 𝑝𝑒𝑟𝑓𝑎𝑐, 𝑚𝑎𝑐𝑧𝑓𝑤,𝑚𝑎𝑐𝑡𝑜𝑤, 𝑚𝑎𝑐𝑙𝑎𝑤, 𝑣, 𝑟𝑤𝑦)

(3.31)

The objective function is

𝑅𝑀𝑆𝐸 = √1

𝑚∑(�̂�𝑖 − 𝒀𝑖)

2𝑚

𝑖=1

(3.32)

or

𝑀𝐴𝑃𝐸 =100%

𝑚∑|

�̂�𝑖 − 𝒀𝑖𝒀𝑖

|

𝑚

𝑖=1

(3.33)

such that

�̂� = (∅(𝑿𝑾𝒊𝒄𝒘 + 𝑏))𝑾𝒐𝒄𝒘 + 𝜀 (3.34)

∅(𝑧) =1

1 + 𝑒−𝑧 (3.35)

where

𝒀: Output target variable representing actual consumed fuel after

flight operation in Kilogram (Kg)

𝑿: Input variables with m-rows representing flight instances and n-

columns representing operational parameters

𝑡: Flight time duration from takeoff to landing in minutes (min)

105

𝑚: Ramp weight including aircraft, passengers, crew, and usable and

unusable fuel weight in Kilogram (Kg)

𝑤: Wind speed (a negative value means headwind whereas a

positive value means tailwind) in Knots (kt)

𝑡𝑒𝑚𝑝𝑎: Temperature deviation at the air in degree Celsius (oC)

𝑡𝑒𝑚𝑝𝑔: Temperature deviation at the ground in degree Celsius (oC)

ℎ𝑙1: Cruise altitude level 1 in hectofeet (hft)





𝑑: Travel distance in nautical miles (NM)

𝑝𝑒𝑟𝑓𝑎𝑐: Performance of aircraft engine (% compared to a new model

aircraft)

𝑚𝑎𝑐𝑧𝑓𝑤: Mean aerodynamic chord at zero fuel weight (%)

𝑚𝑎𝑐𝑡𝑜𝑤: Mean aerodynamic chord at takeoff weight (%)

𝑚𝑎𝑐𝑙𝑎𝑤: Mean aerodynamic chord at landing weight (%)

𝑣: Speed of aircraft in cost index (CI)

106

𝑟𝑤𝑦: Runway direction (a number between 01 and 36, which is

generally the magnetic azimuth of the runway's heading in

decadegrees)

�̂�: Fuel estimated for flight in Kilogram (Kg)

𝑾𝒊𝒄𝒘: Input connection weights connecting input units (operational

parameters) to hidden units

𝑏: Bias for hidden units (+1 value)

𝑾𝒐𝒄𝒘: Output connection weight connecting hidden units to the output

unit (fuel estimation)

∅: Nonlinear sigmoid activation function

𝜀: Predefined target error

𝑅𝑀𝑆𝐸: Root mean square error between estimated fuel and actual

consumed fuel (Kg)

𝑀𝐴𝑃𝐸: Mean absolute percent error between estimated fuel and actual

consumed fuel (%)

𝑆𝑅𝐻𝑈𝑆𝐼: Sector route from departure airport-one at the United States of

America and arrival airport at Hong Kong

𝑆𝑅𝐻𝐷: Sector route from departure airport at Dubai and arrival airport at

Hong Kong

107

𝑆𝑅𝑈𝑆1𝐻 : Sector route from departure airport at Hong Kong and arrival

airport-one at the United States of America

𝑆𝑅𝑈𝑆2𝐻 : Sector route from departure airport at Hong Kong and arrival

airport-two at the United States of America

𝑆𝑅𝑈𝐾𝐻 : Sector route from departure airport at Hong Kong and arrival

airport at the United Kingdom

𝑆𝑅𝐻𝑈𝑆2: Sector route from departure airport-two at United States of

America and arrival airport at Hong Kong

𝑆𝑅𝐻𝑈𝐾: Sector route from departure airport at the United Kingdom and

arrival airport at Hong Kong

𝑆𝑅𝐻𝐴: Sector route from departure airport at Australia and arrival airport

at Hong Kong

The objective functions (3.32) or (3.33) correspond to fuel deviation and their purpose is to

achieve the smallest deviation by estimating a value of �̂� that approximately represents 𝒀. The

RMSE objective function (3.32) measures the performance of the estimation method by

determining the difference between the estimated �̂� and the actual consumed 𝒀. The

measurement scale is identical to that of 𝒀. The RMSE tends to give more importance to a

higher deviation by computing its square error. Actually, this is a more useful metric for

comparison given that a higher deviation is undesirable. The MAPE objective function (3.33)

measures the performance in percentage accuracy by determining the absolute difference

between the estimated �̂� and actual consumed 𝒀. The percentage accuracy helps to make a

more informed decision by comparing the deviation of the estimation methods on an absolute

108

scale. In the NN, the value of �̂�, as shown in Equation (3.34), can be estimated by determining

coefficients of connection weights 𝑾𝒊𝒄𝒘 and 𝑾𝒐𝒄𝒘. Hidden units can be generated in-between

connection weights by using a nonlinear sigmoid activation function ∅ , as shown in Equation

(3.35), to truly approximate 𝒀 from 𝑿.

3.4 Summary

In this chapter, we explained the framework of fuel consumption and its deviation from the

estimation, extracted complex high-level operational parameters that contribute to fuel

estimation and explained their relationship, and formulated a model with an objective function

to minimize a fuel deviation. The trip fuel deviation arises from the overestimation or

underestimation of fuel because of less confidence in existing estimation techniques and

ignoring some of the operational parameters that may contribute to fuel consumption.

Overestimation increases the weight of an aircraft and hence more fuel is needed to balance

thrust and drag. Underestimation may create a safety issue by utilizing some of the fuel reserved

in supplementary reservoirs that may be needed for other purposes or emergency landing.

The two methods used to estimate fuel consumption for aircraft, known as a mathematical

method and data-driven method, are explained. The mathematical methods are helpful in

physical understanding of the system; however, it is equally challenging to practice. It is

difficult to apply because information about aircraft is not always available and needs expert

knowledge. The EA mathematical methods are best suitable for cruise phases and loss accuracy

for the terminal areas. There are separate mathematical equations for estimating fuel

consumption for each distinct flight phase which needed a lot of flight testing to generate

information about parameters. Some of the information about aircraft such as engine design,

wing area, aerodynamic properties, thrust needed, etc and the situation at arrival or departure

109

airport such as atmospheric condition may not be available in a real scenario and need to apply

global parameters with default values rather than local parameters.

Compared to the mathematical method, data-driven methods are generally favorable because

of their ability to learn from historical data. Among various machine learning data-driven

methods, NNs are more popular because of their universal approximation ability. The

prediction performance of BPNNs is highly dependent on the dataset and hyperparameter

initialization and adjustment. This involves human intervention by doing a lot of trial and error

experimental work which makes the training phase a challenging task to accomplish. In some

cases, the use of second-order learning algorithms, which are considered better than first-order

learning algorithms, is selected for training. The second-order learning algorithms are

constraints by memory requirement. This demand for more computational resources when high

dimensional data is available.

Existing work on the application of BPNN to estimate fuel consumption for aircraft is limited

in scope. The work covered the small number of aircraft with limited flight data and operational

parameters with a future recommendation of incorporating high-level operational parameters.

The main reason for this is weak generalization performance and slow convergence of BPNN

with the need for human intervention to decide and adjust hyperparameters.

The limitation needs to be addressed more innovatively by proposing an algorithm that has

cascade architecture and able to calculate connection weights analytically. The cascade

architecture of CNN helps to generate hidden units in hidden layers until error convergence.

This makes it work like deep learning with a less human intervention needed. The objective of

reaching a destination with less fuel deviation is an important research topic to ensure

profitability and safety. A model is formulated, to overcome existing limitations and

complexity, with the objective to minimize fuel deviation by estimating fuel for each flight that

110

may truly approximate actual fuel consumed. The model incorporates extracted operational

parameters to estimate the fuel, and calculate the difference from actual fuel consumed to

measure the fuel deviation.

111

Chapter 4. A Novel Self-Organizing Constructive Neural

Network Algorithm

4.1 Introduction

NNs following the universal approximation theorem may map any complex nonlinear function

more accurately compared to other statistical parametric methods (Ferrari and Stengel, 2005;

Hornik et al., 1989; Kumar et al., 1995). The rapidly growing interest in NNs (Au et al., 2008;

Lin and Vlachos, 2018; Ruiz-Aguilar et al., 2014), and particularly in CNNs (Chung et al.,

2017), motivates the study of their application to the airline sector. CNNs are considered to be

more powerful compared to standard fixed NNs (Hunter et al., 2012; Wilamowski et al., 2008).

The major drawbacks of BPNNs limiting their applicability are their weak generalization

performance and learning speed. The generalization performance is weak in the sense that it

may stop at local minima of the function if global minima are far away. The learning speed

(also known as convergence rate) is slow and dependent on the learning rate and BP learning

algorithms. When the learning rate is small, it converges slowly, and when the learning rate is

large, it becomes unstable. The BP learning can be considered time-consuming because of the

repetitive tuning of the connection weights in forward and backward steps with other

hyperparameters (Huang, Zhu, et al., 2006). This may create a complex co-adaptation among

the parameters and hidden units that demands further attention in regularization (Srivastava et

al., 2014). Moreover, with respect to the drawbacks mentioned above, the human expertise of

defining the structure of the network based on trial and errors significantly affects the

generalization performance and learning speed of BPNNs.

The drawbacks and expertise issues need to be solved in a more innovative way to improve

generalization performance and learning speed. To overcome the above limitations, we propose

112

a self-organizing cascade topology-based CNN capable of analytically calculating connection

weights and requiring less expertise by eliminating the need for complex hyperparameter

initialization and adjustment. The objective is to improve the performance of the estimation

model with less expertise involvement and obtain results with fast learning speed. Inspired

from the gradient-free learning algorithms (Chapter 2, Subsection 2.2.3.2) having the property

of calculating analytically output connection weights and constructive topology FNN (Chapter

2, Subsection 2.2.4.3) have the property of adding hidden units in the hidden layer, this work

proposes new algorithm. The novel algorithm may also assist in overcoming the exiting FNNs

research gaps identified in Chapter 2.

The rest of the Chapter is organized as follows. Section 4.2 discusses CasCor and its extension

with convergence limitations. Section 4.3 briefly explains existing orthogonal linear

transformation and ordinary least squares methodologies. In Section 4.4, we propose a novel

CNN. Section 4.5 is about experimental work to demonstrate the performance of proposed

CNN on artificial and real-world problems. Section 4.6 summarizes the chapter.

4.2 Convergence limitations of CasCor and its extensions

To address the learning issues of BPNNs, the so-called cascade correlation learning algorithm

(CasCor) was proposed to add hidden units sequentially to the network (Fahlman and Lebiere,

1990). CasCor begins with the minimal network by linearly connecting 𝑿 to 𝒀 through

randomly generated 𝑾𝒐𝒄𝒘. The values of 𝑾𝒐𝒄𝒘 are iteratively tuned (or trained) by the QP

learning algorithm to minimize 𝐸. When training converges and 𝐸 is greater than 𝜀, it adds

𝑯𝒌 (𝑘 = 1,2, … . , 𝑙) one at a time to the network, which receives randomly generated 𝑾𝒊𝒄𝒘

from all 𝑿 and any pre-existing 𝑯𝒌−𝟏. The 𝑯𝒌 is not yet connected to 𝒀. The values of 𝑾𝒊𝒄𝒘

are iteratively tuned to maximize the covariance objective function between 𝑯𝒌 and 𝐸. When

113

the covariance objective function converges, 𝑯𝒌 is added to 𝒀 by freezing 𝑾𝒊𝒄𝒘, and 𝑾𝒐𝒄𝒘 is

once again iteratively tuned by QP. This process continues and finally stops when 𝐸 converges

or starts increasing. The QP quickly reaches the loss function by taking a much larger step

rather than infinitesimal steps (Fahlman, 1988). The advantage of CasCor over fixed-topology

BPNNs is that it can learn complex tasks more quickly as well as determine its own network

topology. In addition, it is economical and has no underfitting issue. The network of CasCor is

illustrated in Figure 4.1.

The growing applications of CNN-based algorithms concluded that CasCor learning speed is

better than that of BPNNs, but its generalization performance might be not optimal. Hwang et

al. (1996) and Lehtokangas (2000) highlighted that CasCor works better with classification

tasks compared to regression tasks and this may make it unsuitable for some applications. Liang

and Dai (1998) made an effort to improve the generalization performance of CasCor on

classification problems by using a genetic algorithm, at the additional cost of requiring more

training time. Kovalishyn et al. (1998) studied the application of CasCor on the quantitative

structural activity relationship and did not achieve any dramatic increase in performance

compared to BPNNs. The reasons may be (i) the chaotic behavior and numerical instability of

the iterative QP learning algorithm and (ii) no guarantee of maximum error reduction by the

covariance objective function when a new hidden unit is added.

Firstly, QP during iterative tuning takes a much larger step to move towards the loss function

based on information of past and current gradient. If the current gradient is in the opposite

direction from the past gradient, the QP may cross the minimum of the loss function and moves

towards the opposite direction of the valley, from where it needs to come back (Fahlman, 1988).

This may cause the QP to act chaotically across the minimum valley of the loss function.

Banerjee et al. (2011) explained that QP becomes numerically unstable if the current iteration

114

gradient becomes closer or equal to the previous iteration gradient. In such a case, the weight

difference will become zero and the QP formula will remain zero even if the gradient changes.

Secondly, Huang et al. (2012) argued that the CasCor covariance objective function may not

guarantee a maximum error reduction when a new hidden unit is added. In addition, the

repeatedly iterating tuning of 𝑾𝒐𝒄𝒘 can be more time-consuming. They proposed OLSCN by

reformulating the objective function based on ordinary least squares for the training of 𝑾𝒊𝒄𝒘,

which needs to be further optimized by the second-order Newton’s method. Qiao et al. (2016)

explained that updating weights by Newton’s method may cause OLSCN to converge at a local

minimum with an increase in computational burden. They proposed FCNN to add a

contribution to the CNN. The FCNN works by selecting linearly independent instances of 𝑿

by the Gram–Schmidt orthogonalization method and a hidden unit from a pool of candidate

units by a modified index (MI) method.

In FCNN Theorem 3.1 (Qiao et al., 2016), described that one or many candidate units may

have no multicollinearity with existing input or hidden units. The hidden unit’s column matrix

may not be certainly full rank due to the random initialization of input connection weights.

Therefore, the hidden unit is selected from the candidate pool by MI having maximum error

Figure 4.1. CasCor learning algorithm network

115

reduction capability among the existing pool. This makes sure that the selected hidden unit

from the pool will have maximum error reduction capability in the existing pool, however,

network best error reduction cannot be guaranteed. Firstly, if in the existing pool, some of the

candidate units get linearly dependent than the selection of the best hidden unit by MI also

declines. Secondly, the hidden unit generated in the existing hidden layer may do not guarantee

that the next hidden unit generation will contribute to more error reduction. These drawbacks

may result in the generation of redundant hidden units in the network causing suboptimal

performance. Experimental work (Qiao et al., 2016) having Figure 9 explains the same problem

of not achieving smooth convergence upon the addition of hidden units. The objective of

achieving smooth and better convergence at each subsequent hidden layer is not achieved. To

add a contribution and overcome above discussed limitations of CasCor and its extension, we

propose to generate linearly independent hidden units and calculate all connection weights

analytically rather than the random generation or gradient iterative tuning.

4.3 Orthogonal linear transformation and ordinary least

squares

This section explains two existing methodologies that will assist in calculating analytically the

connection weights for the proposed algorithm. The generated hidden units must be linearly

independent of each other to ensure better convergence.

The orthogonal linear transformation (OLT) method is helpful in linearly transforming one

variable into another variable orthogonally (Jolliffe, 2002). This can be achieved by the

eigendecomposition of the symmetric matrix by calculating the eigenvalue and the

corresponding eigenvector. Suppose 𝑿 be m×n, 𝒀 be m×q, 𝑯 be m×r, 𝑾𝒊𝒄𝒘 be n×r, and 𝑾𝒐𝒄𝒘

be r×q matrices. Initially, OLT determines the covariance 𝑪 symmetric matrix with a number

116

other than diagonal represents the covariance among n-features of variable 𝑿. The

eigendecomposition of 𝑪 generates eigenvalues λ. The largest λ explaining maximum

variability in the datasets are selected and corresponding eigenvectors are calculated. The

calculated 𝑾𝒊𝒄𝒘 linearly transform 𝑿 to a new variable named as hidden unit 𝑯. The new

variable 𝑯 is generated from 𝑿, therefore, 𝑯 dimension will be always less than or equal to 𝑿

i.e. 𝑟 ≤ 𝑛.

Ordinary least squares (OLS) minimizes the error between actual and predicted variables. The

unknown parameter 𝑾𝒐𝒄𝒘 is determined by taking Moore Penrose pseudo-inverse of matrix 𝑯

and taking the product with 𝒀 (Goldberger, 1964). Finally, the �̂� is estimated by taking the

product of 𝑯 and 𝑾𝒐𝒄𝒘. The calculated �̂� is compared with 𝒀 to calculate the error.

Calculating connection weights 𝑾𝒊𝒄𝒘 and 𝑾𝒐𝒄𝒘 analytically may facilitate in generating and

connecting hidden units to an output unit that is linearly independent ensuring smooth and

better convergence. The concept of OLT and OLS may facilitate in calculating analytically

connection weights for the proposed algorithm, and further improvement in cascade

architecture of the existing CasCor and its extensions may result in better convergence.

4.4 Cascade principal component least squares neural network

Motivated by the cascade architecture of CasCor, we propose a new CNN by improving the

original CasCor and its extensions. The key idea is to analytically determine connection

weights on both sides of the network rather than iterative tuning or random generation with

modified cascade architecture. We propose to analytically determine 𝑾𝒊𝒄𝒘 by the OLT of 𝑿,

whereas 𝑾𝒐𝒄𝒘 by the OLS method. The proposed CNN algorithm is named the cascade

principal component least squares (CPCLS) neural network.

117

In the following subsections, lemma and supporting statements with remarks are discussed.

The remarks generated from lemma and statement will facilitate in designing an efficient

algorithm that has the property of calculating connection weights analytically to obtain the best

least-squares solution from linearly independent hidden units.

4.4.1 Lemma and statement

Lemma 1: Given a standard SLFN with 𝑁 hidden nodes and activation function 𝑔: 𝑅 → 𝑅,

which is infinitely differentiable in any interval, for 𝑁 arbitrary distinct samples (xi, yi), where

xiϵ 𝐑𝐧 and 𝑦iϵ 𝐑𝐦, for any 𝑤𝑖𝑖𝑐𝑤 and 𝑏𝑖 randomly chosen from any intervals of 𝐑𝐧 and 𝐑,

respectively, according to any continuous probability distribution, then with probability one,

the hidden layer output matrix 𝑯 of the SLFN is invertible and ||𝑯𝑾𝒐𝒄𝒘 − 𝒀|| = 𝟎 (Huang,

Zhu, et al., 2006).

Remark 1: Lemma 1 suggests that generated hidden units in a layer must be invertible (linearly

independent) to each other to obtain better convergence.

Statement 1: The orthogonal linear transformation of variables into linearly independent

variables can be achieved by calculating eigenvalue and eigenvector from the

eigendecomposition of the symmetric matrix (Jolliffe, 2002).

Remark 2: Statement 1 confirms that new variables generated will be always linearly

independent of each other.

4.4.2 Analytically determining hidden units input connection

weights

118

The architecture of CPCLS is illustrated in Figure. 4.2. Given 𝑿 and 𝒀, CPCLS is initialized

by defining 𝜀 and hidden units 𝑁 of 𝑯 such that 𝑟 ≤ 𝑛 . For input connection weight 𝑾𝒊𝒄𝒘

determination, it transforms the set of correlated 𝑿 n-features orthogonally and linearly into

uncorrelated 𝑯 r-features by eigendecomposition of the 𝑿 covariance square n×n matrix 𝑪:

𝑪 =1

𝑚 − 1(𝑿 − �̅�)𝑇(𝑿 − �̅�) (4.1)

�̅� =1

𝑚∑ 𝑥𝑖

𝑚

𝑖=1

(4.2)

The highest eigenvalue λ generated from 𝑪 corresponds to an eigenvector. These selected 𝑁

eigenvectors are considered as the input connection weight, i.e., 𝑾𝒊𝒄𝒘:

|𝑪 − λ𝑰| = 0 (4.3)

(𝑪 − λ𝑰)𝑾𝒊𝒄𝒘 = 0 (4.4)

We calculate 𝑯 by taking the nonlinear activation ∅ from the product of 𝑿 and 𝑾𝒊𝒄𝒘 by adding

a bias 𝑏:

Figure 4.2. CPCLS learning algorithm network

119

𝑯 = ∅(𝑿𝑾𝒊𝒄𝒘 + 𝑏) (4.5)

The orthogonal linear transformation by eigendecomposition of 𝑪 generates 𝑯, which explains

the maximum variance in the data, which in turn can help to estimate trip fuel more accurately.

4.4.3 Analytically determining hidden units output connection

weights

The maximum variance in 𝑯 becomes more linear in a relationship with 𝒀, and the sum of

squared errors can be minimized through OLS by calculating for output connection weight, i.e.,

𝑾𝒐𝒄𝒘:

𝑾𝒐𝒄𝒘 = (𝑯𝑻𝑯)−𝟏𝑯𝑻𝒀 (4.6)

where (𝑯𝑻𝑯)−𝟏𝑯𝑻 is the Moore-Penrose pseudo-inverse of 𝑯. We estimate the trip fuel �̂� by

linearly transferring 𝑯 (4.5) through 𝑾𝒐𝒄𝒘 (4.6):

�̂� = 𝑯𝑾𝒐𝒄𝒘 (4.7)

We calculate fuel deviation in (3.32) or (3.33) by subtracting the actual consumed fuel 𝒀 from

the estimated trip fuel �̂�.

If (3.32) or (3.33) is less than 𝜀, we stop; otherwise, another hidden layer 𝑯𝒌 with 𝑁′ hidden

units are added with respect to the previously added hidden layer 𝑯𝒌−𝟏 having 𝑁 hidden units

until the desired performance is achieved. The newly added hidden layer 𝑯𝒌 will receive all

the connections from 𝑿 and from any pre-existing hidden layers, i.e., 𝑯𝒌−𝟏. In other words, the

previously calculated 𝑯𝒌−𝟏 becomes a part of 𝑿, such that

120

𝑿 = (𝑿, 𝑯𝒌−𝟏) (4.8)

𝑁 = 𝑁 + 𝑁′ (4.9)

The steps from Equation (4.1) to (4.7) are repeated.

4.4.4 Adding hidden layers

Equation (4.8) implies that only newly added 𝑯𝒌 will connect to 𝒀 and will eliminate previous

connections, i.e., 𝑯𝒌−𝟏, 𝑯𝒌−𝟐, etc. This helps to avoid linear dependencies among hidden units

in the multiple hidden layers. For better illustration, consider an example in Figure 4.2, where

the hidden layers are initialized with 𝑁=2 and 𝑁′=2. It implies that the first hidden layer (hidden

layer 1) is initialized with 𝑁=2 hidden units and the algorithm is then trained. If the error

resulting from training is larger than 𝜀, then another hidden layer (hidden layer 2) is added with

𝑁′=2 hidden units with respect to hidden units of previous layers (𝑁 = 𝑁 + 𝑁′ = 2+2= 4 hidden

units in the hidden layer 2). Again, the algorithm is trained, and the error is calculated by

connecting only a newly added hidden layer (hidden layer 2) to the output unit and by

eliminating a previous hidden layer (hidden layer 1) from the output connections weight. These

steps continue until the targeted value of 𝜀 is achieved.

4.4.5 Hyperparameters

The advantages of CPCLS that makes it superior compared to CasCor and its extensions is the

generation of non-redundant and linearly independent multiple hidden units in the layers. The

linearly independent hidden units generated by an OLT of the operational parameters help to

achieve the best least squares solution. Similarly, to avoid the linear dependency among hidden

layers, a cascade architecture is improved by connecting only the newly added hidden layer to

the output, with all previous connections being eliminated. Besides, the input units are

121

connected to the output unit through hidden units to avoid the input unit’s linear dependency.

Multiple hidden units can be generated in hidden layers to converge faster. These

characteristics help CPCLS to converge faster and ensure maximum error reduction when the

new hidden layer is added. However, the problem of overfitting and slow convergence should

be avoided while generating multiple hidden units. To further elaborate the overfitting and

convergence complexity issues of the algorithm, experimental work in Subsection 4.5.3 is

performed to study the effect of hidden unit generation on network performance. The

achievement of estimation results in the shortest possible computational time along with

minimum hyperparameter initialization, such as 𝑯 and 𝜀, makes CPCLS more favorable. Other

properties include no iterative tuning, no random generation, no trial-and-error to find the best

hyperparameters, and less human intervention.

4.4.6 Convergence theoretical justification

CPCLS proposed to connect only a new added hidden layer to the output unit and eliminate

previous connections to avoid linear dependence among hidden layers. In this Subsection, we

theoretical justified that adding only the newly added layer contributes to error convergence. If

all hidden layers are connected to the output unit then it will create an extra burden on the

network. Besides, if there exists linear dependence among hidden units in hidden layers that

will cause low performance of the network.

Suppose two eigenvectors 𝑤1𝑖𝑐𝑤 and 𝑤2

𝑖𝑐𝑤 are generated from the eigendecomposition of 𝑪.

Both vectors can be considered orthogonal if and only if the inner product is zero, i.e.,

𝑤1𝑖𝑐𝑤. 𝑤2

𝑖𝑐𝑤 = 0 or 𝑤1𝑖𝑐𝑤𝑇

𝑤2𝑖𝑐𝑤 = 0. Where 𝑇 in superscript is the transpose of a matrix.

We have:

122

𝑪𝑤1𝑖𝑐𝑤 = 𝜆1𝑤1

𝑖𝑐𝑤 (4.10)

and

𝑪𝑤2𝑖𝑐𝑤 = 𝜆2𝑤2

𝑖𝑐𝑤 (4.11)

To prove that 𝑤1 and 𝑤2 are orthogonal:

𝜆1(𝑤1𝑖𝑐𝑤. 𝑤2

𝑖𝑐𝑤)

= (𝜆1𝑤1𝑖𝑐𝑤). 𝑤2

𝑖𝑐𝑤

= (𝑪𝑤1𝑖𝑐𝑤). 𝑤2

𝑖𝑐𝑤

= (𝑪𝑤1𝑖𝑐𝑤) 𝑇𝑤2

𝑖𝑐𝑤

= 𝑤1𝑖𝑐𝑤𝑇

𝑪𝑇𝑤2𝑖𝑐𝑤


𝑪𝑤2𝑖𝑐𝑤


𝜆2𝑤2𝑖𝑐𝑤

= 𝜆2(𝑤1𝑖𝑐𝑤𝑇

𝑤2𝑖𝑐𝑤)

= 𝜆2(𝑤1𝑖𝑐𝑤. 𝑤2

𝑖𝑐𝑤) (4.12)

𝑪 = 𝑪𝑻 because 𝑪 is a symmetric matrix. From mathematical work, we have:

𝜆1(𝑤1𝑖𝑐𝑤. 𝑤2

𝑖𝑐𝑤) = 𝜆2(𝑤1𝑖𝑐𝑤. 𝑤2

𝑖𝑐𝑤) (4.13)

(𝜆1 − 𝜆2)(𝑤1𝑖𝑐𝑤. 𝑤2

𝑖𝑐𝑤) = 0 (4.14)

Since 𝜆1 − 𝜆2 ≠ 0, because both are different. So, we have:

123

𝑤1𝑖𝑐𝑤. 𝑤2

𝑖𝑐𝑤 = 0 (4.15)

Equation (4.15) implies that 𝑤1𝑖𝑐𝑤 and 𝑤2

𝑖𝑐𝑤 are orthogonal and linearly independent in

eigenmatrix 𝑾𝒊𝒄𝒘. Suppose if two hidden units ℎ𝑘1 and ℎ𝑘2

are generated from 𝑤1𝑖𝑐𝑤 and 𝑤2

𝑖𝑐𝑤

in the first hidden layer, that will be also orthogonal, i.e., ℎ𝑘1⊥ ℎ𝑘2

. This theoretical

justification supports Lemma 1, that generated hidden units in the layer are linearly

independent (inevitable) and hence ||𝑯𝑾𝒐𝒄𝒘 − 𝒀|| = 𝟎 ensuring convergence of the network.

Now, suppose if more than one hidden layer is added in the network. If ℎ𝑘−11 and ℎ𝑘−12

are

generated in 𝑯𝒌−𝟏 by 𝑤𝑘−1𝑖𝑐𝑤

1 and 𝑤𝑘−1

𝑖𝑐𝑤2, and ℎ𝑘1

and ℎ𝑘2 are generated in 𝑯𝒌 by 𝑤𝑘

𝑖𝑐𝑤1 and

𝑤𝑘𝑖𝑐𝑤

2, then equal chance exists that it may or may not be linearly independent. i.e., 𝑯𝒌−𝟏 ⊥

𝑯𝒌 or 𝑯𝒌−𝟏 ⟂̸𝑯𝒌. In 𝑯𝒌−𝟏 ⟂̸𝑯𝒌 case, it may be not orthogonal and may not support lemma 1

and hence ||𝑯𝑾𝒐𝒄𝒘 − 𝒀|| ≠ 𝟎

Algorithm 4.1. CPCLS algorithm summary

In a training dataset (𝑿, 𝒀), let 𝑿 be input variable matrix m×n, 𝒀 be output variable m×q

matrix.

1. Initialize the network by defining number 𝑁 of hidden units in the first hidden layer

such that the number of hidden units less than or equal to input variable, i.e., r ≤ n.

2. Learn the network by defining the criteria of convergence error 𝜀. Compare 𝜀 with

network estimated error 𝐸:

While 𝐸 > 𝜀

a. Input connection weights 𝑾𝒊𝒄𝒘: Determine eigenvectors 𝑾𝒊𝒄𝒘 from the orthogonal

linear transformation of 𝑿. Calculate symmetric matrix 𝑪 and perform its

124

eigendecomposition. The extracted eigenvalue and its corresponding eigenvector are

selected as 𝑾𝒊𝒄𝒘 of matrix n×r:

𝑪 =1

𝑚 − 1(𝑿 − �̅�)𝑇(𝑿 − �̅�)

|𝑪 − λ𝑰| = 0

(𝑪 − λ𝑰)𝑾𝒊𝒄𝒘 = 0

b. Hidden units 𝑯: Generate a linearly independent 𝑯 of matrix m×r by passing the net

of 𝑿 and 𝑾𝒊𝒄𝒘 through activation function ∅:

𝑯 = ∅(𝑿𝑾𝒊𝒄𝒘)

c. Output connection weights 𝑾𝒐𝒄𝒘: Calculate the pseudo-inverse of 𝑯 and take the

product with 𝒀 to determine 𝑾𝒐𝒄𝒘 matrix of r×q:

𝑾𝒐𝒄𝒘 = (𝑯𝑻𝑯)−𝟏𝑯𝑻𝒀

d. Predicted output �̂�: Take the product of calculated 𝑯 and 𝑾𝒐𝒄𝒘 to estimate the

output:

�̂� = 𝑯𝑾𝒐𝒄𝒘

e. Network Error 𝐸: Take the difference between the �̂� and the 𝒀:

𝐸 = (�̂� − 𝒀)2

f. Column stack 𝑯 with 𝑿:

𝑿 = (𝑿, 𝑯)

125

g. Calculate 𝑁 for the proceeding 𝑯 by defining and adding 𝑁′ to existing 𝑁 such that p

≤ n:

𝑁 = 𝑁 + 𝑁′

end

4.5 Experimental work

The generalization performance and learning speed of the proposed algorithm CPCLS are

compared with the traditional algorithms on artificial benchmarking and real-world dataset

problems. The artificial benchmarking problems include the two-spiral classification task and

the function approximation of the SinC regression task. For comparison of real-world

problems, the classification and regression dataset problems are taken from the UCI machine

learning repository. The numerical work was carried out in Netmaker v0.9.5.2 and Anaconda

Spyder Python v3.2.6. Netmaker is a powerful simulation software designed in C programming

for building neural networks and has a built-in QP CasCor programming code. CPCLS, BPNN,

and SaELM simulation were carried out in Python, whereas, CasCor simulation was carried

out in the Netmaker. The limitation of this research work is that the CasCor programming code

is designed in the C programming language. Generally speaking, simulation in different

programming codes will not imitate the comparison because C programming execution is faster

than python. The dataset input and output has been normalized in the range [0,1] to make it

compatible with the sigmoid nonlinear activation function ∅(𝑧) = 1/(1 + 𝑒−𝑧) used in the

hidden units.

The work is divided into two parts: artificial benchmarking problems and real-world problems.

In artificial benchmarking problems, CPCLS is compared with CasCor on two spiral

classifications and the SinC function regression task. A comparison with two spiral

126

classification problems is important in that it is mentioned in CasCor original paper. In

regression, complex nonlinear SinC function is selected to get more insight into the comparison

of CPCLS and CasCor. The SinC function is also used to study the effect of hidden layers

connection to the output layer in CPCLS. The results of artificial benchmarking problems need

to be further validated by making a comparison of real worlds problem with traditional

algorithms. A comparative study of CPCLS, CasCor, BPNN, SaELM, OLSCN, and FCNN has

been performed to demonstrate the effectiveness of the proposed algorithm. The experimental

work is further enriched by studying the effect of varying the hidden unit’s sizes in hidden

layers of CPCLS.

4.5.1 Artificial benchmarking problems

4.5.1.1 Two-Spiral classification task

The generalization performance and learning speed of CPCLS and CasCor are evaluated on the

extreme nonlinear two-spiral classification task (Huang et al., 2012; Hunter et al., 2012; Qiao

et al., 2016; Wilamowski and Yu, 2010). The task consists of 194 training points twisted in

two spirals with three turns for each spiral. It consists of two inputs units and one output unit

Figure 4.3. Two spiral classification task

127

with two classes as 0’s and 1’s. The algorithms must classify all data points as 0’s and 1’s, as

shown in Figure. 4.3, in the black and white dots respectively. The generalization performance

of the algorithms is evaluated on the newly generated dense spiral of 770 testing data points,

other than training points (Lang, 1989). All the dataset inputs and outputs are normalized in

the range [0,1]. The training dataset is generated from equations below, as proposed by Lang

and Witbrock in their work on learning to tell two spirals apart (Lang, 1989):

For first spiral, 𝑖 = 0,1,2, 3, ….,96:

𝜃 =𝑖𝜋

16 (4.16)

𝑟 =6.5(104 − 𝑖)

104 (4.17)

𝑥 = 𝑟𝑐𝑜𝑠𝜃 (4.18)

𝑦 = 𝑟𝑠𝑖𝑛𝜃 (4.19)

where, 𝜃 is the angle and 𝑟 is the radius of the spiral with 𝑥 and 𝑦 as the coordinates of the

spiral. The second spiral data points can be obtained by taking negative values in (4.18) and

(4.19) for 𝑥 and 𝑦 respectively.

Table 4.1 shows the generalization performance and learning speed simulation results of both

algorithms averaged over 25 trials. The number of candidate units in CasCor is set to 8, as

proposed in the original work. As shown in Table 4.1, CPCLS spends 5.08s to train a network

Table 4.1. Generalization accuracy and learning speed of two spiral classification task

Task

Testing Accuracy %

(Mean, (Stdev))

Training Time (s)

(Mean, (Stdev))

Hidden layers (nos.)

(Mean, (Stdev))

CPCLS CasCor CPCLS CasCor CPCLS CasCor

Two

Spiral

95.58

(0.52)

91.00

(2.80)

5.08

(0.58)

135.84

(46.38)

70.52

(2.40)

20.52

(2.58)

128

with a generalization accuracy of 95.58%, whereas CasCor spends 135.84s to train a network

with a generalization accuracy of 91.00%. The results indicate that CPCLS has a better

generalization performance and faster learning speed as compared to the CasCor.

4.5.1.2 SinC function approximation regression task

In regression, the nonlinear SinC function task (Ebadzadeh and Salimi-Badr, 2018; Huang,

Chen, et al., 2006; Huang, Zhu, et al., 2006) is approximated by CPCLS and CasCor to check

generalization performance and learning speed. To make the task more complex and nonlinear,

the data points of 4,000 observations are generated from the SinC function below in the range

(-20,20):

Figure 4.4. SinC function regression task

Table 4.2. Generalization performance and learning speed of SinC regression task

Testing RMSE (Mean,

(Stdev))

Training Time (s)

(Mean, (Stdev))


(Mean, (Stdev))

Task CPCLS CasCor CPCLS CasCor CPCLS CasCor

SinC 0.00081

(0.00014)

0.01050

(0.00063)

10.07

(2.69)

778.45

(151.19)

52.76

(3.91)

25.72

(2.41)

129

𝑦(𝑥) = {sin(𝑥)

𝑥, 𝑥 ≠ 0

1, 𝑥 = 0 (4.20)

Table 4.2 shows the generalization performance and learning speed simulation results of

CPCLS and CasCor averaged over 25 trials. The dataset is randomly split into training and

testing datasets at a 50:50 ratio during each trial. All the dataset inputs and outputs are

normalized in the range [0,1]. The stopping criteria have been set to 0.01 RMSE for an

algorithm with higher learning time to avoid further increasing the network learning time. The

number of candidate units in CasCor is set to 5. As observed from Table 4.2, CPCLS spends

10.07s to train a network with a generalization performance of 0.00081 RMSE, whereas

CasCor spends 778.45s to train a network with generalization performance of 0.01050 RMSE.

During the simulation, it was observed that CasCor achieved an average 0.02425 RMSE in

318s with 17 hidden units, and because of SinC nonlinear structure, it took twice the time by

CasCor to truly estimate the right-side pulse of SinC function. CPCLS has clearly

approximated all the data points of SinC, as illustrated in Figure. 4.4. In the figure, the

abbreviation LHL means connecting only newly added last hidden layer to output layer and

AHL means connecting all hidden layers to the output layer in CPCLS. The comparison

between LHL and AHL is further explained in Subsection 4.5.4. As discussed earlier, that

original CPCLS suggests adding LHL to network. However, the comparison between LHL and

AHL is also needed to support theoretical justification explained in Section 4.4.6. The figure

also illustrates the generalization ability of CasCor.

4.5.2 Real-world problems

The simulation results on artificial benchmarking problems demonstrate that the CPCLS

generalization performance and learning speed are better than the CasCor. To further validate

the performance, a comparative study of CPCLS, CasCor, BPNN, SaELM, OLSCN, and

130

FCNN had been performed on real-world regression and classification problems extracted from

the UCI machine learning repository (Dua and Taniskidou, 2017). Wang et al. (2016) proposed

SaELM to solve the problem of ELM hidden unit selection. Wang et al. (2016) argue that ELM

is sensitive to the selection of an optimal number of hidden units in the layer and improper

hidden units can lead to suboptimal accuracy. SaELM can get better accuracy and fast learning

speed by finding the best possible number of hidden units for the network. SaELM initializes

by defining the minimum and maximum possible hidden units with its interval, width factor 𝑄

and scale factor 𝐿. The advantage of a self-adaptive mechanism is that it helps SaELM to search

for best possible hidden units with minimum error capability.

The dataset that is extracted from the most popular machine learning repository UCI is shown

and described in Table 4.3. The regression problems include the datasets of abalone (Huang,

Chen, et al., 2006; Qiao et al., 2016), airfoil self-noise (Cheng, 2017; Lu et al., 2018), combined

cycle power plant (Bagirov et al., 2018; Fierimonte et al., 2016), and forest fires (Al_Janabi et

al., 2018). The classification problems include datasets of breast cancer (Taherkhani et al.,

2018; Ying, 2016), occupancy detection (Candanedo and Feldheim, 2016), seismic bumps

(Verma et al., 2017), and wine (Huang et al., 2012; Wang et al., 2016). These eight datasets of

varied sizes and attributes are selected to evaluate the proposed algorithm on different data

Table 4.3. Dataset extracted from UCI

Dataset Total Instances No. of Input, output/classes

Regression Tasks

Abalone 4177 8,1

Airfoil self-noise 1503 5,1

Combined cycle

powerplant

9568 4,1

Forest Fires 517 12,1

Classification Tasks

Breast Cancer 699 10,2

Occupancy Detection 20560 6,2

Seismic bumps 2584 18,2

Wine 178 13,3

131

structures.

CPCLS hidden layers generation is based on defining a number of hidden units in the first layer

and the addition of hidden units in the proceeding hidden layers with respect to the previously

added hidden units in the hidden layer. The number of hidden units in hidden layers was set to

(2,2), (4,3), (2,2), (2,1), (2,2), (2,7), (1,1) and (5,5) for abalone, airfoil self-noise, combined

cycle power plant, forest fires, breast cancer, occupancy detection, seismic bumps, and wine

respectively. For instance (4,3) means that the first hidden layer will have 4 hidden units, the

second hidden layer will have 7 (4+3) hidden units, and the third hidden layer will have 10

Table 4.4. Algorithms comparison on real world regression problems

Performance Algorithm Regression Task Dataset

Abalone Airfoil self-

noise

Combined cycle

powerplant

Forest

Fires

Testing RMSE

(Mean, (Stdev))

CPCLS 0.0750

(0.0006)

0.0835

(0.0014)

0.0545

(0.0006)

0.0562

(0.0097)

CasCor 0.0761

(0.0011)

0.0869

(0.0046)

0.0573

(0.0005)

0.0694

(0.0167)

BPNN 0.0776

(0.0010)

0.0882

(0.0041)

0.0577

(0.0006)

0.0574

(0.0097)

SaELM 0.0755

(0.0009)

0.0854

(0.0037)

0.0545

(0.0004)

0.0584

(0.0108)

Training Time (s)

(Mean, (Stdev))

CPCLS 0.36

(0.27)

1.61

(0.61)

2.96

(1.48)

0.005

(0.01)

CasCor 34.74

(14.92)

119.63

(48.89)

29.69

(6.70)

1.50

(0.28)

BPNN 64.89

(13.14)

209.97

(41.70)

59.54

(14.09)

1.78

(0.42)

SaELM 2.86

(0.14)

2.90

(0.16)

7.19

(0.82)

0.56

(0.06)


(Mean, (Stdev))

CPCLS 15.60

(4.39)

32.08

(4.51)

25.32

(4.57)

1.00

(0.00)

CasCor 7.76

(1.59)

24.08

(4.21)

5.88

(0.33)

5.12

(0.44)

Hidden Units (nos.)

(Mean, (Stdev))

BPNN 15 25 15 5

SaELM 42.92

(13.43)

142.92

(38.68)

215.16

(65.79)

11.16

(5.35)

132

(7+3) hidden units and so on. The candidate pool was assigned 3 Nos to CasCor. The hidden

units for fixed topology BPNN were determined by generating hidden units in the range 5-25

using trial and error. The network with better convergence is reported. The value of 500, 5 and

10 was assigned to maximum, minimum, and interval hidden units for SaELM respectively

with scale factor 𝐿=4 and width factor 𝑄=2. The stopping criteria for all algorithms were

defined as error convergence or start increasing.

Tables 4.4 and 4.5 show the average best results of 25 trials obtained by CPCLS, CasCor,

BPNN, and SaELM. The testing RMSE/accuracy and training time represents the

generalization performance and learning speed of the algorithms, while the mean represents

Table 4.5. Algorithms comparison on real world classification problems

Performance Algorithm Classification Task Dataset

Breast

Cancer

Occupancy

Detection

Seismic bumps Wine

Testing Accuracy %

(Mean, (Stdev))

CPCLS 97.95

(0.34)

99.05

(0.05)

93.83

(0.43)

98.78

(1.98)

CasCor 96.16

(1.09)

98.97

(0.07)

92.98

(0.52)

93.09

(5.40)

BPNN 96.98

(0.50)

98.98

(0.07)

93.83

(0.43)

95.44

(3.75)

SaELM 97.39

(0.53)

99.03

(0.05)

93.44

(0.50)

98.11

(1.55)

Training Time (s)

(Mean, (Stdev))

CPCLS 0.03

(0.03)

3.95

(2.31)

0.01

(0.01)

0.08

(0.07)

CasCor 3.91

(1.14)

31.54

(3.36)

29.86

(15.31)

14.41

(5.10)

BPNN 14.67

(6.09)

75.33

(17.40)

1.06

(0.06)

30.81

(10.61)

SaELM 0.70

(0.05)

17.64

(1.85)

1.87

(0.06)

0.37

(0.06)


(Mean, (Stdev))

CPCLS 6.64

(3.55)

13.28

(3.16)

1.00

(0.00)

10.92

(2.94)

CasCor 6.00

(1.08)

5.00

(0.00)

8.12

(2.76)

14.88

(2.42)

Hidden Units (nos.)

(Mean, (Stdev))

BPNN 10 10 5 10

SaELM 20.36

(11.97)

388.84

(92.93)

10.92

(13.58)

129.08

(161.04)

133

average and stdev represents standard deviation results of 25 trials. The regression problems

were evaluated by RMSE and classification problems were evaluated by accuracy.

As observed in Tables 4.4 and 4.5, that performance in terms of generalization and learning of

proposed CPCLS is better compared to other algorithms. The best results in terms of

generalization and learning speed are highlighted bold with underlined in Tables 4.4 and 4.5.

In Table 4.4, for combined cycle power plant the generalization performance of CPCLS and

SaELM is similar to the advantage of CPCLS is that it took 2.96s compared to SaELM of 7.19s.

In Table 4.5, for seismic bumps classification, the generalization accuracy of CPCLS and

BPNN is similar to the advantage of CPCLS is that it took 0.01s compared to BPNN of 1.06s.

To demonstrate and understand the convergence of CPCLS, the abalone dataset convergence

of 25 trials can be visualized from Figure 4.5. The figure shows that the convergence of CPCLS

is stable and smooth during the addition of hidden units and minimum deviation within each

trial was observed.

The performance comparison of CPCLS with CasCor and other popular machine learning is

clear, however, CPCLS comparison with CasCor extension named as OLSCN and FCNN is

also needed. The simulation results of OLSCN and FCNN are acquired from their primary

articles because of the unavailability of original programs. To ensure a fair comparison, the

Figure 4.5. Smooth and stable convergence of CPCLS

134

CPCLS work was performed by adopting all those test conditions highlighted in their primary

articles. The hyperbolic tangent ∅(𝑧) = (1 − 𝑒−𝑧)/(1 + 𝑒−𝑧) activation function was used in

hidden units and the dataset was normalized in the range [-1,1] for input variable. CPCLS

convergence can be made faster by selecting a maximum combination of hidden units for the

network. For the selected dataset, the maximum combination of hidden units with bias are:

abalone (9,9), housing (14,14), wine (14,14) and glass (10,10). The algorithms comparison

along with dataset description are shown in Table 4.6. The CPCLS performance average over

25 trials shows better generalization and faster learning speed compared to the OLSCN and

FCNN. These findings demonstrate the effectiveness of proposed CPCLS in both

generalization performance and learning speed.

4.5.3 Selection of hidden units in the hidden layers

The results of artificial benchmarking problems and real-world problems demonstrate that

proposed CPCLS has better generalization performance and learning speed compared to other

well-known machine learning neural networks. Determining the number of hidden units in the

hidden layer is the only hyperparameter than need be defined prior to the initialization of the

network. Experimental work was performed to understand the effect of varying hidden unit

Table 4.6. CPCLS and CasCor extensions comparison

Performance Algorithm Regression Task Classification Task

Abalone Housing Glass Wine

Total Instance 4177 506 214 178

No of inputs, outputs/classes 8,1 13,1 9,7 13,3

Testing

RMSE/Accuracy %

CPCLS 2.0572 3.1877 70.05 98.77

FCNN 2.15841 3.87741 67.521 97.541

OLSCN 2.18001,2 3.38001,2 -NA- 98.542

Training Time (s)

(Mean, (Stdev))

CPCLS 0.0143 0.0150 0.0138 0.0117

FCNN 0.06971 0.07181 0.05171 0.04281

OLSCN 0.27211 0.27381 -NA- -NA- 1 original paper results (Qiao et al., 2016) 2 original paper results (Huang et al., 2012)

-NA- No information mentioned in the original papers

135

sizes in first and proceeding hidden layers on generalization performance and learning speed

of CPCLS. The example of a popular dataset known as the abalone dataset was considered for

comparison. In total, 9 input variables with bias exist in the abalone dataset. This means that,

for CPCLS hidden layers, the minimum combination can be (1,1) in the first hidden layer and

proceeding hidden layer, and the maximum combination can be (9,9) in the first hidden layer

and proceeding hidden layer. Inclusive of minimum and maximum combination, in total 81

combination possibility exists, i.e. (1,1), (1,2), ……, (1,9), (2,1), (2,2), …., (9,1), (9,2), ….

(9,9). As discussed in Section 4.4, the selection of hidden units is based on the

eigendecomposition of a symmetric matrix. The highest eigenvalue and corresponding

eigenvectors are selected to generate hidden units. This comparison facilitates to understand

which eigenvalues will be more favorable to generate hidden units in first and proceeding

layers.

The generalization performance, learning speed and hidden layers generation are shown in

Figures 4.6, 4.7 and 4.8 respectively for all combinations. In figures, the hidden unit in the first

layer is represented on the horizontal axis and hidden units in the proceeding layers is

represented in the right legend. The generalization performance, in Figure 4.6, shows that

Figure 4.6. Hidden units generation effect on generalization performance of CPCLS

136

results are uniform for many combinations. A minimum and maximum combination of (1,1)

and (9,9) attained 0.0765RMSE and 0.0755RMSE respectively. These results are also

consistent with minimum and maximum RMSE of 0.0748 and 0.0774 attained by a

combination of (5,2) and (4,2) respectively. The eigenvalue studied shows that all four

combination (1,1), (4,2), (5,2) and (9,9) explains a cumulative percentage variance of

(71.26%,72.92%), (99.19%,96.94%), (99.65%,96.94%) and (100%,100%) respectively. This

gives an understanding that eigenvalues having a cumulative percentage greater than 70% are

effective in generating hidden units in layers to achieve better generalization performance.

Figure 4.7. Hidden units generation effect on learning of CPCLS

Figure 4.8. Hidden units generation effect on network of CPCLS

137

The generalization performance gives insight that results are stable, however, regarding

learning speed as shown in Figure 4.7, (1,1) combination took 2.03s and (9,9) combination

took 0.03s to train a network. This difference in learning speed is because that (1,1)

combination needed 45 Nos. hidden layer to converge, whereas, (9,9) combination needed only

4 Nos. hidden layer to converge. Figure 4.8 shows the number of hidden layers needed for each

combination.

The experimental work gives an important finding that better generalization performance can

be achieved by selecting eigenvalues greater than 70%. However, the maximum variance

should be limit to 99% because many last eigenvalues may have limited effect because of

approximately zero variability which can cause overfitting. Therefore, it is important to select

the best combination of hidden units in the first and proceeding layer to avoid overfitting and

slow convergence.

4.5.4 Role of connecting only newly added hidden layer to the

output layer

For a better convergence rate, Lemma 1 implies that the hidden units need to be linearly

independent to get the best least-squares solution. The linear independence of hidden units can

be achieved by an orthogonal linear transformation of data features as per Statement 1. The

CPCLS during each hidden layer generates hidden units that are highly linearly independent

because of its orthogonal transformation property. However, the generated hidden units may

not be inevitable to each other in multiple hidden layers. The hidden unit in a layer 𝑯𝑘 may

become linearly dependent to hidden units in successive hidden layers i.e. 𝑯𝑘−1, 𝑯𝑘−2 and so

on. In such a case, adding all hidden layers to the output layer with correlated hidden units will

violate the best least-squares solution assumption. This may cause the algorithm to converge

138

at suboptimal generalization error with the need for higher learning time. To ensure that added

hidden units are linearly independence, the CPCLS recommend connecting only newly

generated last hidden layer to the output layer.

In order to explain the above statements, the effect of adding a newly added hidden layer and

all hidden layers to the output unit on the generalization performance and learning speed of

CPCLS is studied by performing experimental work. Both cases are considered:

1) connecting last hidden layer (LHL)

2) connecting all hidden layers (AHL)

Above both cases were tested using the SinC regression function. For comparison, the data test

Figure 4.9. Shuffled dataset results of LHL and AHL

Table 4.7. Connecting hidden layers to output unit in CPCLS

Performance Algorithm Regression Task

Suffled dataset Different test sizes

Testing RMSE (Mean, (Stdev)) LHL 0.00089 (0.00014) 0.00088 (0.00007)

AHL 0.03689 (0.02831) 0.02127 (0.02986)

Training Time (s) (Mean, (Stdev)) LHL 8.27 (1.93) 10.18 (2.63)

AHL 38.14 (8.06) 28.86 (22.87)

Hidden layers (nos.) (Mean, (Stdev)) LHL 52.95 (3.88) 56.89 (6.51)

AHL 38.14 (8.06) 35.89 (9.02)

139

size was changed from 30% to 70% with an interval of 5% and a shuffling state from 0 to 100

with an interval of 5. This results in 21 shuffled and 9 test sized datasets. The 21 shuffled

datasets were trained using a constant test size of 50%. The one best result attained among 21

shuffled was chosen as a constraint to trained 9 test sizes samples. The learning conditions for

both cases were kept similar to ensure fair comparison and conclusion.

The generalization performance and learning speed of LHL and AHL are shown in Table 4.7.

The results are an average of 21 trails shuffled and 9 trails test sized datasets. Figure 4.9 shows

shuffled dataset generalization performance and learning speed for both cases, whereas, Figure

4.10 shows test sized dataset generalization performance and learning speed for both cases.

The Table and Figures illustrate that the generalization performance of LHL is better and stable

compared to AHL. Also, the learning speed of LHL is faster with stable results compared to

AHL. The AHL network needs to stop earlier when training time approaches more than five

times compared to LHL and no further improvement in generalization performance.

The performance of LHL and AHL can be visualized from Figure 4.4. The figure illustrates

that AHL is unable to approximate SinC function, whereas, LHL (original CPCLS) has truly

approximate all data points of SinC function. The experimental work demonstrates that AHL

Figure 4.10. Different test size dataset results of LHL and AHL

140

results maybe not optimal in many cases compared to LHL. Furthermore, during all trails, it is

observed that CPCLS with LHL gives better and stable results with lesser deviation.

4.6 Summary

In this Chapter, experimental work demonstrates that the newly proposed algorithm known as

CPCLS achieves better generalization performance and faster learning speed, with improved

cascade architecture and tuning free learning method, as compared to similar constructive and

traditional algorithms. CPCLS has several noteworthy features:

▪ The proposed algorithm has achieved better generalization performance. Experimental

results show that improvement in the cascade architecture creates less burden on the

network and it converges more efficiently towards the target error. The proposed

algorithm works more analytically by determining the optimal connection weights,

rather than the gradient optimization and random generation techniques. Determining

the input connection weights analytically by OLT helps in generating linearly

independent hidden units, explaining the maximum variance in the dataset with

maximum error reduction capability.

▪ The learning speed is fast due to analytically calculating connection weights in the

forward step. Unlike the gradient method and random generation, that are involved in

finding optimal connection weights. CPCLS takes only a forward step to calculate the

target error by analytically calculating the connection weights on both sides of the

network.

▪ CasCor needs a lot of human expertise to control the learning because of the QP large

steps which can cause the network to behave chaotically. The QP can encounter the

problem of zero derivative difference which needs extra control to stop the iterations

141

early to prevent overflow problems in the network. CPCLS advantage over QP CasCor

that of tuning free learning.

▪ CPCLS better network generalization performance validates that the hidden layers

generated are capable of maximum error reduction.

▪ Like other gradient-free learning algorithms i.e.; SaELM, the proposed algorithm can

work directly with differential activation functions (e.g. sigmoid, hyperbolic tangent,

rectified linear unit and so on) and nondifferential activation function (e.g.; threshold).

Gradient-based methods cannot handle nondifferential activation functions directly

because of their differentiation property.

▪ The proposed algorithm has solved the problem of linearly dependent input features and

hidden units. It reduces the chance of multicollinearity by orthogonally transforming the

set of any correlated input units and any pre-existing hidden units into uncorrelated

hidden units. The newly added last hidden layer with characteristics of linearly

independent hidden units is connected to the output layer for the best least-squares

solution.

▪ The proposed method is easy to apply in many application areas because of the small

number of user-specified hyperparameters, such as hidden units addition and target

error.

The better generalization performance and learning ability of CPCLS on nonlinear benchmark

problems and real-world problems, verify that it can be efficiently applied to other application

tasks.

142

Chapter 5. Reducing the Negative Role of Global

Parameters with Default Values and Varying Operational

Parameters Effect on the Trip Fuel Estimation

5.1 Introduction

The usage of global parameters with default values rather than local parameters with an

outdated database in the complex mathematical formulation may cause EAs higher fuel

deviation. The low generalization performance and slow learning speed of BPNN may make it

inappropriate to extract information from high-level operational parameters having varying

behaviour among different sectors. To estimate a trip fuel with less deviation, requiring less

expertise, and achieving faster learning speed, real high-dimensional data were provided by an

airline that historically performed international flights in various regions over two years. The

dataset consists of several operational parameters that may or may not contribute to fuel

consumption. For better estimation, the useful operational parameters were first extracted, as

reported in Table 3.1. In our study, the extracted operational parameters are represented by 𝑿

and the objective is to predict the trip fuel �̂� that truly represents the actual consumed trip fuel

𝒀. Two cases were considered for estimating trip fuel consumption. In the first case, all sectors

in the dataset were considered as a combined model. In the second case, each sector was

considered as an individual model. A sector is defined as an airway route starting from the

departure airport and ending in the arrival airport. A comparison of proposed CPCLS is

performed with AEA and BPNN to demonstrate its effectiveness. The study illustrates that the

AEA is highly affected by the usage of global parameters, and the BPNN generalization

performance and learning speed start decreasing with high dimensional data having high-level

operational parameters.

143

The rest of the Chapter is organized as follows. Section 5.2 is about the problem context. In

Section 5.3, fuel consumption data analytics is elaborated. In Section 5.4, solution methods and

their performance analysis are performed by considering two models: combined model

(Subsection 5.4.2.1) and individual model (Subsection 5.4.2.2). Moreover, a comparison is also

made, and major findings are discussed. Section 5.5 concludes this chapter by summarizing

major findings.

5.2 Problem context

The effect of usage of global parameters with default values rather than local parameters,

outdated database, complex mathematical formulation, hyperparameters adjustment, human

intervention and varying nature of operational parameters may cause a negative role by

deviating fuel consumption from estimation. The proposed CPCLS that has characteristics of

better generalization performance and faster learning speed with less human intervention needs

to apply to estimate aircraft trip fuel to demonstrate its effectiveness as a decision tool for

airlines. All the experimental work was carried out in Anaconda Spyder Python v3.2.6

programming language. The dataset was normalized in the range [0,1]. One hot encoding was

performed on categorical variables to represent it in a binary vector. The nonlinear sigmoid

activation function was used in the hidden layers to make it compatible with the dataset. For

both cases, the dataset was split into a 50:50 ratio for training and testing. The stopping criteria

for the algorithms were defined as error convergence or start increasing.

5.3 Data analytics for fuel consumption

The set of real data that we obtained from an international airline operating in Hong Kong

consists of 19,117 international passenger and cargo flights operated among eight sectors

covering different geographical regions. The flights were performed from April 2015 to March

144

2017 by 107 wide-body aircrafts distributed as Airbus A330-300 (31 aircrafts) and Boeing 747-

400 (9 aircrafts)/747-800 (14 aircrafts)/777-300 (53 aircrafts). The data consist of detailed

information on operational parameters related to 1) carrier with aircraft performance and

operations, 2) date/time of departure and arrival per sector with crew allocation, 3) weather and

atmospheric conditions, 4) sector and aircraft local and global conditions, and 5) fuel

requirements at different tanks and consumption levels. Among the collected data, instead of

putting all operational parameters into algorithms, we work in a more novel way by extracting

the relevant high-level operational parameters contributing to fuel consumption, as summarised

in Table 3.1.

5.4 Solution methods and computational results

5.4.1 Proposed method and existing methods

Three estimation methods were used for the comparative study, namely AEA, BPNN, and the

proposed CPCLS. The comparison is important to study the performance of CPCLS compared

to existing AEA and BPNN. For AEA, the estimated and actual consumed fuel information

was retrieved from available data. CPCLS is compared to AEA in terms of fuel deviation,

whereas its comparison with BPNN is in terms of both fuel deviation and learning speed. The

limitation of the study is that the computational time for AEA is not mentioned because the

information is not available in the historical dataset provided by the airline. The CPCLS is

initialized with 5 hidden units in the first layer and 15 added hidden units in the next hidden

layers with respect to previously added hidden units. Among different trial-and-error

approaches, BPNN is selected, with 10 hidden units and 0.5 learning rate as the best

hyperparameters for the network.

145

5.4.2 Trip fuel estimation methods performance analysis and

discussion

The main objective of our study is to develop a method capable of accurately and efficiently

estimating trip fuel; however, a comparison with existing methods plays an important role in

demonstrating effectiveness. To study the effect of operational parameters, the fuel is estimated

for the following two types of models:

5.4.2.1 All sectors combined for fuel estimation (referred to as a combined model)

5.4.2.2 Sector-wise fuel estimation (referred to as individual models)

5.4.2.1 Combined model

In this model, the trip fuel is estimated by considering all eight sectors of the 19,117 flights as

a combined model. The selection of operational parameters and their data trend plays a major

role in examining and analysing the estimation method. The significant difference in

operational parameter ranges and standard deviation may make estimating trip fuel accurately

a challenging task. The runway direction parameter is a categorical variable with 26 different

runway directions for eight sectors. One hot encoding pre-processing technique is applied to

each runway direction to represent it in a binary vector. It translates runway direction features

into 26 subfeatures as a binary value with 1 representing existence and 0 representing no

existence. All the flight phases such as takeoff, climb, cruise, descent, and approach are

considered as one entire flight trip.

Table 5.1 shows the trip fuel deviation resulting from AEA, BPNN, and CPCLS. The values

of Mean, StDev, Min, Q1, Med, Q3, Max, Ran, and IQR under the column absolute percent

error (APE) respectively refer to the mean, standard deviation, minimum, first quartile, median,

146

third quartile, maximum, range, and interquartile range in percentage; RMSE refers to root

mean-squares error; time refers to the computational learning time in seconds needed for

estimation. For the sake of readability, the mean of APE is termed as MAPE. Descriptive

statistics are employed to understand and compare the central tendency and the

variability/dispersion of the fuel deviation estimated by the methods. Mean measures the

central tendency by computing an average value and StDev measures the dispersion by

computing how much data is far from the mean. Both measurements are important to compare

the results generated by the studied methods. The method with less mean and StDev is

favourable because it indicates that the model has estimated fuel with less deviation and

dispersion. The disadvantages of mean and Stdev are that they are only useful to measure the

central tendency and dispersion of the whole dataset. It might occur that some portion of data

may exhibit a higher improvement compared to others. Therefore, to make the comparison

reliable for a better conclusion of the findings, improvements in various quartiles (Q1, Med,

Q3, IQR) along with the minimum (min) and maximum (max) fuel deviation for each method

are also studied. The major findings are:

1 In Table 5.1, the first finding is that CPCLS estimates trip fuel with a fuel deviation of

1.05% compared to AEA (1.53%) and BPNN (1.32%). A CPCLS Stdev of 0.95% implies

that data are more clustered towards the mean compared to AEA (1.28%) and BPNN

Table 5.1. Trip fuel deviation by AEA, BPNN and CPCLS (combined model)

Sector Method APE

RMSE TIME

(s) Mean StDev Min Q1 Med Q3 Max Ran IQR

All

sectors

AEA 1.53 1.28 0.00 0.57 1.21 2.16 11.66 11.66 1.59 0.0183 ---

BPNN 1.32 1.44 0.00 0.42 0.92 1.68 16.14 16.14 1.26 0.0135 745.00

CPCLS 1.05 0.95 0.00 0.37 0.81 1.44 9.31 9.31 1.07 0.0115 8.21

Table 5.2. Performance improvement comparison (combined model)

Sector AEA/BPNN (%) AEA/CPCLS (%) BPNN/CPCLS (%)

All Sectors 15.91 45.71 25.71

147

(1.44%). Regarding convergence rates, CPCLS trains a model in 8.21 s whereas BPNN in

745 s.

2 The fuel deviation results are split into Q1 (25th percentile), Q2 or Median (50th

percentile), and Q3 (75th percentile) to compare the improvement of estimation methods

in different quartiles. IQR, not affected by outliers and often considered a better

measurement of a spread than range, describes the 50% of values by taking the difference

between Q1 and Q3. For each quartile and interquartile, the second finding is that the

improvement of CPCLS is evident compared to AEA and BPNN. For Q1, median, Q3, and

IQR, the fuel estimated by CPCLS features a mean deviation of 0.37%, 0.81%, 1.44%, and

1.07% compared to AEA (0.57%, 1.21%, 2.16%, and 1.59%, respectively) and BPNN

(0.42%, 0.92%, 1.68%, and 1.26%, respectively).

3 The maximum fuel deviation predicted by CPCLS and AEA is 9.31% and 11.66%

respectively, whereas for BPNN it is 16.14%, which makes it unfavourable for this model.

Although the overall BPNN MAPE is better than that of AEA, what validates the

importance and significance of the BPNN, the third finding is that the maximum deviation

approaches 16.14% compared to AEA (11.66%). This may create a problem in a real

application by a wrong estimation of fuel for certain flights. For ease of readability, the

lowest and best values for each metric is highlighted in bold with an underline in Table

5.1.

4 Table 5.2 shows the performance improvement comparison in percentage in terms of

MAPE. Thus, AEA/BPNN, AEA/CPCLS, and BPNN/CPCLS refer to a gain in the

improvement of BPNN compared to AEA, CPCLS compared to AEA, and CPCLS

compared to BPNN, respectively. The fourth finding is that the CPCLS-based estimation

model shows an improvement of 45.71% and 25.71% compared to both AEA and BPNN,

respectively, whereas BPNN shows an improvement of only 15.91% compared to AEA.

148

5 Figure 5.1 shows the AEA, BPNN, and CPCLS fuel deviation in APE for each flight. The

horizontal axis represents a number of operated flights whereas the vertical axis is APE for

each flight. The comparison illustrates that the fuel estimated from CPCLS for each flight

exhibits less fuel deviation compared to AEA and BPNN. Similarly, Figure 5.2 shows the

histogram and distribution comparison of AEA, BPNN, and CPCLS. The horizontal axis

represents an APE interval (bins) whereas the vertical axis represents a number of flights

for each interval. As shown in Figure 5.2, the fifth finding is that CPCLS shifts the flight

trend more towards the lower fuel deviation interval. For instance, the histograms show

that CPCLS estimated the fuel deviation of 6,236 flights in the range 0.00-0.50%, what

Figure 5.1. Fuel deviation (combined model)

Figure 5.2. Fuel deviation interval (combined model)

149

demonstrates an improvement of 47.77% and 11.35% compared to the 4,220 flights from

AEA and 5,600 flights from BPNN, respectively. For other histogram bins, the

improvement shown by CPCLS is more significant.

6 In addition, Figure 5.2 indicates the sixth finding, i.e., that CPCLS has more cluster

distribution (lower StDev) compared to the disperse distribution of AEA and BPNN.

7 However, Figure 5.1 shows that CPCLS and BPNN estimated the number of 2,000 flights

with less improvement in fuel deviation as compared to the remaining flights. An in-depth

analysis shows our seventh finding, i.e., these flights are operated in 𝑆𝑅𝐻𝐷 and 𝑆𝑅𝐻

𝐴 sectors,

which belong to short-range flights, whereas the remaining flights are operated in the

𝑆𝑅𝐻𝑈𝑆𝐼, 𝑆𝑅𝑈𝑆1

𝐻 , 𝑆𝑅𝑈𝑆2𝐻 , 𝑆𝑅𝑈𝐾

𝐻 , 𝑆𝑅𝐻𝑈𝑆2, and 𝑆𝑅𝐻

𝑈𝐾 sectors and belong to long-range flights on

long-haul routes. The large portion of data belongs to the long-range route flights. This

forces NN-based algorithms to determine network connection weights by considering

long-range route flights rather than short-range route flights. This is the main reason for

which it is hard for BPNNs to truly approximate the complex operational parameters and

their data instances. It can be easily visualised from Figure 5.1 that the BPNN estimation

for the mentioned 2000 flights is even worse than both AEA and CPCLS. Therefore, to

study the effect of the short-range and long-range route on estimation models, our work is

extended by estimating the trip fuel for each sector individually. The purpose of sector-

wise fuel estimation (individual model) is to study the effect of the extracted operational

parameters of fuel consumed within each sector.

5.4.2.2 Individual models

The types of operational parameters and their data instances have a significant effect on fuel

consumption. For example, an aircraft traveling from airport A to airport B might face

headwinds, and during the return, it might face tailwinds. The difference in both conditions

150

may cause different fuel consumption for the same route. The real data consist of 2658, 1252,

3080, 2493, 3199, 2423, 3116, and 896 flights operated in sectors 𝑆𝑅𝐻𝑈𝑆𝐼, 𝑆𝑅𝐻

𝐷, 𝑆𝑅𝑈𝑆1𝐻 , 𝑆𝑅𝑈𝑆2

𝐻 ,

𝑆𝑅𝑈𝐾𝐻 , 𝑆𝑅𝐻

𝑈𝑆2, 𝑆𝑅𝐻𝑈𝐾, and 𝑆𝑅𝐻

𝐴, respectively. The aircraft flying in sectors 𝑆𝑅𝐻𝑈𝑆𝐼, 𝑆𝑅𝑈𝐾

𝐻 , 𝑆𝑅𝐻𝑈𝑆2,

𝑆𝑅𝐻𝐴 faced headwind compared to sectors 𝑆𝑅𝐻


𝐻 , which faced tailwind. High-

performance aircrafts are operated in sectors 𝑆𝑅𝑈𝑆2𝐻 , 𝑆𝑅𝑈𝐾

𝐻 , 𝑆𝑅𝐻𝑈𝑆2, and 𝑆𝑅𝐻

𝑈𝐾. The short-range

route flights in sectors 𝑆𝑅𝐻𝐷 and 𝑆𝑅𝐻

𝐴 are operated at a lower altitude level compared to long-

range route flights. The flight time varies according to an increase in distance for all sectors,

whereas the mean aerodynamic chord and aircraft speed are more dependent on aircraft

weights. The runway directions for 𝑆𝑅𝐻𝑈𝑆𝐼 are 07, 15, 25, and 33; for 𝑆𝑅𝐻

𝐷, are 12 and 30; for

𝑆𝑅𝑈𝑆1𝐻 are 07 and 25; for 𝑆𝑅𝑈𝑆2

𝐻 are 07 and 25; for 𝑆𝑅𝑈𝐾𝐻 are 07 and 25; for 𝑆𝑅𝐻

𝑈𝑆2 are 06, 07,

24 and 25; for 𝑆𝑅𝐻𝑈𝐾 are 09 and 27; and for 𝑆𝑅𝐻

𝐴 are 03, 06, 21 and 24. Location of more than

one runway with the same directions are differentiated with a letter as follows: L for left, C for

centre, and R for right. A one hot encoding pre-processing technique is applied sector-wise on

the runway direction categorical variable to convert it to a binary vector. This varying

behaviour of operational parameters may significantly influence the accuracy of machine

learning algorithms.

1 Table 5.3 shows the sector-wise trip fuel deviation results for AEA, BPNN, and CPCLS.

The first finding is that the fuel estimation by CPCLS produces fuel deviation MAPEs of

1.11%, 1.30%, 0.95%, 0.70%, 0.80%, 0.76%, 0.88%, and 1.03% compared to AEA fuel

deviation MAPEs of 2.84%, 2.73%, 1.18%, 0.99%, 1.36%, 1.27%, 1.12%, and 1.52%, and

that of BPNN, i.e., 1.21%, 1.38%, 1.13%, 0.85%, 0.90%, 0.78%, 0.93%, and 1.08% for

sectors 𝑆𝑅𝐻𝑈𝑆𝐼, 𝑆𝑅𝐻


𝐻 ,𝑆𝑅𝑈𝐾𝐻 ,𝑆𝑅𝐻

𝑈𝑆2,𝑆𝑅𝐻𝑈𝐾 and 𝑆𝑅𝐻

𝐴, respectively. StDev

measurements show that CPCLS dispersion from mean is less for all sectors compared to

AEA and BPNN.

151

2 Apart from sector 𝑆𝑅𝐻𝐴, the second finding is that the quartile and interquartile

measurements for all other sectors indicate that CPCLS achieves better estimation for all

quartiles compared to AEA and BPNN. BPNN achieves a better estimation of 0.39% in

Table 5.3. Trip fuel deviation by the AEA, BPNN and CPCLS (individual models)

Sector Method APE

RMSE TIME

(s) Mean StDev Min Q1 Med Q3 Max Ran IQR

𝑆𝑅𝐻𝑈𝑆𝐼

AEA 2.84 1.46 0.00 1.83 2.73 3.75 9.77 9.77 1.92 0.0362 ---

BPNN 1.21 1.03 0.00 0.45 0.95 1.71 9.50 9.50 1.26 0.0243 105.81

CPCLS 1.11 0.93 0.00 0.42 0.88 1.58 9.94 9.93 1.17 0.0225 1.87

𝑆𝑅𝐻𝐷

AEA 2.73 1.69 0.00 1.37 2.57 3.91 10.40 10.40 2.54 0.0864 ---

BPNN 1.38 1.18 0.00 0.50 1.10 1.93 9.96 9.95 1.43 0.0326 23.83

CPCLS 1.30 1.09 0.00 0.50 1.04 1.82 9.49 9.49 1.32 0.0303 0.91

𝑆𝑅𝑈𝑆1𝐻

AEA 1.18 0.96 0.00 0.45 0.95 1.64 7.95 7.95 1.19 0.0362 ---

BPNN 1.13 1.01 0.00 0.41 0.90 1.56 9.63 9.63 1.15 0.0314 91.17

CPCLS 0.95 0.82 0.00 0.35 0.75 1.33 8.88 8.88 0.98 0.0265 3.48

𝑆𝑅𝑈𝑆2𝐻

AEA 0.99 0.83 0.00 0.38 0.80 1.38 7.10 7.10 1.00 0.0365 ---

BPNN 0.85 0.72 0.00 0.32 0.67 1.18 5.34 5.34 0.86 0.0324 92.32

CPCLS 0.70 0.55 0.00 0.28 0.56 0.98 4.53 4.52 0.70 0.0261 2.44

𝑆𝑅𝑈𝐾𝐻

AEA 1.36 0.98 0.00 0.59 1.19 1.96 6.09 6.09 1.37 0.0621 ---

BPNN 0.90 0.73 0.00 0.33 0.74 1.27 5.28 5.28 0.93 0.0410 176.73

CPCLS 0.80 0.64 0.00 0.31 0.67 1.15 5.81 5.81 0.84 0.0364 0.83

𝑆𝑅𝐻𝑈𝑆2

AEA 1.27 0.97 0.00 0.53 1.08 1.79 7.63 7.63 1.26 0.0461 ---

BPNN 0.78 0.64 0.00 0.30 0.64 1.09 5.83 5.83 0.79 0.0319 124.18

CPCLS 0.76 0.63 0.00 0.29 0.61 1.06 5.24 5.24 0.77 0.0313 1.01

𝑆𝑅𝐻𝑈𝐾

AEA 1.12 0.90 0.00 0.46 0.89 1.56 6.62 6.62 1.11 0.0458 ---

BPNN 0.93 0.77 0.00 0.35 0.76 1.32 6.60 6.60 0.98 0.0352 147.00

CPCLS 0.88 0.71 0.00 0.32 0.73 1.24 4.33 4.33 0.92 0.0330 0.52

𝑆𝑅𝐻𝐴

AEA 1.52 1.17 0.00 0.67 1.31 2.11 11.66 11.66 1.44 0.0600 ---

BPNN 1.08 0.97 0.00 0.39 0.83 1.46 10.13 10.12 1.07 0.0440 72.48

CPCLS 1.03 0.94 0.00 0.40 0.80 1.40 9.26 9.26 1.00 0.0423 1.16

Average

AEA 1.62 1.12 0.00 0.78 1.44 2.26 8.40 8.40 1.48 0.0512 ---

BPNN 1.03 0.88 0.00 0.38 0.82 1.44 7.78 7.78 1.06 0.0341 104.19

CPCLS 0.94 0.79 0.00 0.36 0.75 1.32 7.18 7.18 0.96 0.0311 1.53

Table 5.4. Performance accuracy comparison (individual models)

Sector AEA/BPNN (%) AEA/CPCLS (%) BPNN/CPCLS (%)

𝑆𝑅𝐻𝑈𝑆𝐼 134.71 155.86 9.01

𝑆𝑅𝐻𝐷 97.83 110.00 6.15

𝑆𝑅𝑈𝑆1𝐻 4.42 24.21 18.95

𝑆𝑅𝑈𝑆2𝐻 16.47 41.43 21.43

𝑆𝑅𝑈𝐾𝐻 51.11 70.00 12.50

𝑆𝑅𝐻𝑈𝑆2 62.82 67.11 2.63

𝑆𝑅𝐻𝑈𝐾 20.43 27.27 5.68

𝑆𝑅𝐻𝐴 40.74 47.57 4.85

Average 53.57 67.93 10.15

152

Q1 for 𝑆𝑅𝐻𝐴; however, this estimation is very close to 0.40% from CPCLS. Moreover, the

maximum fuel deviation estimated by CPCLS is lessen than those of AEA and BPNN. For

𝑆𝑅𝐻𝑈𝑆𝐼, BPNN was able to achieve a better maximum fuel deviation value of 9.50%

compared to that of CPCLS (9.94%) and AEA (9.77%). An in-depth study shows that

CPCLS was able to estimate 𝑆𝑅𝐻𝑈𝑆𝐼 for all flights within an APE of 7%; however, only one

flight shows as APE of 9.94%. The IQR, not effected by the outliers, explain this issue.

The IQR measure of the CPCLS (1.17%) is significantly better compared to the BPNN

(1.26%).

3 The average network learning time of the CPCLS is 1.53s, which is a hundred times faster

than the BPNN of 104.19s. The total estimation time of CPCLS and BPNN for individual

models is 12.22s and 833.52s which is approximately equal to the combined model of 8.21s

and 745s respectively. The third finding is that the slightly increase in the time is because

of a discovering trend in the individual dataset which were previously difficult for NN

because the majority of the dataset belongs to the long-range route flights.

4 Table 5.4 shows the performance improvement comparison in percentage in terms of

MAPE. The AEA/BPNN, AEA/CPCLS, and BPNN/CPCLS refers to a gain in the

improvement of the BPNN compared to the AEA, the CPCLS compared to the AEA, and

the CPCLS compared to the BPNN respectively. The CPCLS shows maximum

improvement of 155.86% for 𝑆𝑅𝐻𝑈𝑆𝐼 and minimum improvement of 24.21% for 𝑆𝑅𝑈𝑆1

𝐻 , with

an average improvement of 67.93% compared to the AEA. The maximum and minimum

improvements provide an important insight, fourth finding, about the influence of the

operational parameters. In both improvements, the flight route is the same. The 𝑆𝑅𝐻𝑈𝑆𝐼

sector is flight traveling from airport-one located in the United States to Hong Kong and

𝑆𝑅𝑈𝑆1𝐻 sector is flight travelling from Hong Kong to the airport-one in the United States.

This explains that the existing AEA mathematical calculation cannot differentiate among

153

the varying behaviour of operational parameters. The use of global parameters for the both

Figure 5.3. Fuel deviation (𝑆𝑅𝐻

𝑈𝑆𝐼)


𝐷)

Figure 5.5. Fuel deviation (𝑆𝑅𝑈𝑆1

𝐻 )

154

sectors of the same route rather than local parameters with the old database in mathematical

Figure 5.6. Fuel deviation (𝑆𝑅𝑈𝑆2

𝐻 )

Figure 5.7. Fuel deviation (𝑆𝑅𝑈𝐾

𝐻 )


𝑈𝑆2)

155

equations may give better estimation for 𝑆𝑅𝑈𝑆1𝐻 compared to 𝑆𝑅𝐻

𝑈𝑆𝐼.

5 The BPNN leads to an average 53.57% better estimation compared to the AEA but less

accurate than the CPCLS. The better performance of the BPNN in sector-wise individual

fuel estimation models 53.57% (Table 5.4) compared to combined fuel estimation model

15.91% (Table 5.2) suggests that it may perform better with a chunk of data comprising

the same behaviour of the operational parameters. Comparing to the AEA, the CPCLS

shows an improvement from 45.71% (combined model) to 67.93% by performing sector

wise individual fuel estimation. The fifth finding is that the difference in improvement for


𝑈𝐾)


𝐴)

156

the CPCLS is 22.22% as compared to the BPNN of 37.66%. This implies that the CPCLS

Figure 5.11. Fuel deviation interval (𝑆𝑅𝐻

𝑈𝑆𝐼)


𝐷)

Figure 5.13. Fuel deviation interval (𝑆𝑅𝑈𝑆1

𝐻 )

157

estimation capability with high dimensional data is much better than the traditional BPNN.

Figure 5.14. Fuel deviation interval (𝑆𝑅𝑈𝑆2

𝐻 )

Figure 5.15. Fuel deviation interval (𝑆𝑅𝑈𝐾

𝐻 )


𝑈𝑆2)

158

6 Figures 5.3-5.10 shows the fuel deviation in APE for each flight in the sectors by the AEA,

BPNN, and CPCLS. A horizontal axis is a number of flights operated and the vertical axis

is APE for each flight. The comparative study in the figures helps to understand the

improvement made by the CPCLS compared to the AEA and BPNN. The sixth finding is

that the CPCLS and BPNN were able to achieve better estimation for short-range route

flights of 𝑆𝑅𝐻𝐷 and 𝑆𝑅𝐻

𝐴. In the combined model, for the first 2000 flights related to short-

range route flights, the performance of the BPNN was much worse compared to the AEA,

and for CPCLS the performance was better than AEA but not compared to others long-

range route flights. In the current estimation models, Figures. 5.4 and 5.10, the BPNN


𝐴)


𝑈𝐾)

159

achieves better estimation than the AEA, and the CPCLS achieves better estimation

compared to both AEA and BPNN. Similarly, for all other sectors, Figures 5.3, 5.5-5.9,

the APE of each flights shows better results for the CPCLS compared to the AEA and

BPNN.

7 As mentioned earlier that sector 𝑆𝑅𝐻𝑈𝑆𝐼 and 𝑆𝑅𝑈𝑆1

𝐻 follow the same route but travel in the

opposite direction. Comparative study of Figure 5.3 and 5.5, shows that the AEA estimated

APE for flights is much higher for 𝑆𝑅𝐻𝑈𝑆𝐼 (MAPE 2.84%) compared to 𝑆𝑅𝑈𝑆1

𝐻 (MAPE

1.18%). The reason is using the same global parameter for both sectors rather than local

parameters. Based on historical data, the seventh finding is that the CPCLS minimized this

issue and shows APE for both sectors (Figures 5.3 and 5.5) almost identical. For other

sectors having the same route, for instance 𝑆𝑅𝐻𝑈𝑆2,𝑆𝑅𝑈𝑆2

𝐻 and 𝑆𝑅𝑈𝐾𝐻 , 𝑆𝑅𝐻

𝑈𝐾, the figures show

the same issue of usage of global hyperparameter and old database by the AEA. The 𝑆𝑅𝐻𝑈𝑆2

shows higher APE compared to 𝑆𝑅𝑈𝑆2𝐻 , and 𝑆𝑅𝑈𝐾

𝐻 shows higher APE compared to 𝑆𝑅𝐻𝑈𝐾.

This demonstrates the main limitation, as highlighted in the literature review Chapter 2, of

the EA based methods.

8 Figures 5.11- 5.18 shows the histogram and distribution comparison of the AEA, BPNN,

and CPCLS. The histograms indicate that the CPCLS fuel deviation is spread on less APE

bins with a major focus on estimating flights contributing to less deviation bins. For most

sectors, the eighth finding is that the distribution of CPCLS is more clustered as compared

to the AEA and BPNN. For instance, in Figures 5.11 and 5.12, the bell shape of 𝑆𝑅𝐻𝑈𝑆𝐼 and

𝑆𝑅𝐻𝐷 sectors show that the AEA method fuel deviation exists mainly in the bins 1.50-

3.50%, whereas, the CPCLS distribution peak shape shows that the fuel deviation exists in

0.00-1.00%. In Figures 5.13, 5.16, 5.18 for 𝑆𝑅𝑈𝑆1𝐻 , 𝑆𝑅𝐻

𝑈𝑆2, 𝑆𝑅𝐻𝐴, the distribution of BPNN

is almost identical to the AEA or CPCLS. For all sectors, the histogram and distribution

trend of the CPCLS illustrates that it attempts to estimate trip fuel for flights with less

160

deviation compared to the AEA and BPNN. The histogram and distribution of the AEA

show much dispersed fuel deviation. The possible reason for the AEA high fuel deviation

is less confidence in method because of the availability and applicability of global

parameters rather than the local parameters.

5.5 Summary

The experimental work demonstrates the effectiveness of the CPCLS. The estimation of the

CPCLS results in a fuel deviation mean absolute error percentage (MAPE) of 1.05% and 0.94%

with an improvement of 45.71% and 67.93%, and BPNN estimated fuel with a deviation of

1.32% and 1.03% with an improvement of 15.91% and 53.57% compared to the AEA fuel

deviation of 1.53% and 1.62% for combined and individual models respectively. In comparison

with the BPNN, the CPCLS achieved an improvement of 25.71% and 10.15% for combined

and individual models respectively. Similarly, CPCLS trained the network a hundred times

faster compared to the traditional BPNN. In the CPCLS, the analytical calculation of

connection weights by eigendecomposition of the input covariance squares matrix tries to

generate linearly independent hidden units explaining the maximum variance in the operational

parameters. The eigendecomposition eliminates a problem of linear dependence of operational

parameters and the hidden units. These characteristics make it more suitable to accurately

estimate trip fuel with less expertise requirement and faster learning speed. Sector-wise

(individual) fuel estimation models validate the earlier work of fuel estimation by the BPNN

with a limited amount of data operational parameters and flights, divided among different flight

phases. However, during the study, it is observed that the estimation accuracy of the BPNN

starts decreasing with an increase in the data structure. The varying nature of the operational

parameter behavior made it difficult for the BPNN to create hidden units with the capability of

operational parameters linearly separable. The sole purpose of this work is not a comparison

161

of the CPCLS with AEA and BPNN. The main contribution is to propose and apply the self-

organizing CNN algorithm that gives better estimation accuracy with less expertise

requirement and faster learning speed. The significant improvement of 67.93% by the proposed

CPCLS method in existing airline fuel estimation (i.e., AEA) will directly benefit in

eliminating excess fuel consumption.

162

Chapter 6. Conclusion and Future Work

6.1 Conclusion

This work addresses airline trip fuel deviation from estimated values, which results in excessive

fuel consumption. Accurate fuel estimation is crucial for airlines to ensure safety and

profitability. Nowadays, the aviation industry is growing at a significant rate and improvement

in various operations can ensure long term sustainability. Among various operating expenses,

fuel cost contributes to an average 28.2% which makes it more valuable to design a method

that may accurately estimate fuel for each flight journey. It has been noticed that fuel

consumption demand is increasing each year. The forecasted demand was to reach 97 billion

gallons in 2019 compared to the level of 2006 because of an increase in demand of passengers

and cargos. These promising economic opportunities may become a threat if airlines are unable

to handle problems of increase in jet fuel prices and international authorities regulations to

protect environmental degradation. It was forecasted that jet fuel prices will increase by 31.18%

in 2019 compared to the 2015 level. Besides, to protect environmental degradation and avoid

ozone depletion, international authorities stress to reduce CO2 emissions by 50% by 2050 with

respect to 2005 and reduce fuel consumption 1.5% per year. In the future, international

authorities are planning to make it compulsory for airlines to certify their aircraft based on size

and weight, according to CO2 certification standards. The growing demand with high fuel

prices and increasing awareness for environmental protection are encouraging airlines to adopt

competitive strategies in fuel management to control excess fuel consumption for long-term

sustainability.

163

In the existing literature, the efforts made by researchers to address the fuel estimation problem

are noteworthy. However, there are still some research gaps that need considerable attention.

The research gaps are concisely summarized as follows:

1 The most popular fuel estimation methods that are currently in practice by various

airlines are based on energy balance mathematical formulation. It is identified that the

actual performance of aircraft usually deviates from such estimation. The EAs need

information about many coefficients to determine the amount of fuel needed for each

journey. Firstly, the information about coefficients is not always available in a real

scenario and a lot of flight testing is needed to perform to generate such information.

This makes the method expensive to adopt. Secondly, the unavailability of such

information may lead to the use of global parameters with default values rather than

local parameters which may result in inaccurate estimation. In current literature, an

insufficient attempt has been made to study the effect of using global parameters on the

deviation of fuel from estimation using a practical example.

2 The popularity of the application of machine learning NNs in the aviation sector to

improve various operations is significant. The BPNN based fuel estimation was

proposed to provide an alternative to EAs. A pitfall is that BPNN is a fixed topology

NN and adjustment of various hyperparameters is based on human expertise. The

BPNN drawbacks of trial and error experimental work to determine network and

convergence at local minima may not assure that the selected network is optimal

causing weak generalization performance and slow convergence. In existing works,

BPNN is applied to estimate fuel without addressing the above key issues.

3 In the existing literature, the operational parameters are selected based on prior

experience and knowledge. There may exist some operational parameters that may also

contribute to fuel consumption and are unnoticed. BPNN based fuel estimation models

164

are learned on low-level operational parameters with the future recommendation of

incorporating more parameters. In literature, the research gap still exists to incorporate

high-level operational parameters that are previously ignored.

4 Likewise, EAs and BPNNs, many other models are proposed to estimate fuel for

distinct flight phases. The cumulative effect of distinct phases to estimate fuel for the

complete journey may even lead to higher fuel deviation. Therefore, it is important to

have a model that can efficiently estimate fuel for all phases collectively to ensure

profitability and safety.

5 The trained NN performance may no longer be suitable for learning the same

application if data structure and size changes. How to improve the machine learning

NNs that give the same performance regardless of change is data structure and size is

gaining significant consideration.

To address the above research gaps, we conducted our work in three main stages to achieve

three objectives discussed in Chapter 1, Section 1.3. The first stage is discussed in Chapter 3,

in which we discussed the framework and model is formulated to achieve an objective function

of minimizing fuel deviation. The significance of the proposed work and its relationship to

existing methods is constructed to understand the purpose and need of the work. Chapter 3

mainly addresses the first four research gaps. The second stage is discussed in Chapter 4, in

which novel CNN is proposed to avoid trial and error experimental work by analytically

calculating connection weights with improved cascade architecture. This chapter mainly

addresses research gaps second and fifth. The third stage is discussed in Chapter 5, in which

comparative study among proposed CPCLS, AEA and BPNN is performed. The chapter

provides important managerial insight and mainly addresses all research gaps.

165

In Chapter 3, prior to formulating model and defining an objective function, a framework of

fuel consumption and its deviation from estimation problem, and useful operational parameters

that can contribute to fuel consumption are explained. The problem of fuel deviation arises

because of the inappropriate selection of estimation methods and operational parameters. Low

confidence in existing estimation methods results in adding more fuel in discrepancy reservoirs

other than contingency and supplementary reservoirs to ensure a smooth journey. This

ultimately increases the total weight of the aircraft and results in more fuel consumption. This

problem is known as overestimation and concern profitability issues. Also, loading suboptimal

fuel, known as underestimation, may concern safety issues. Therefore, the objective of reaching

a designation airport with minimal deviation cannot be achieved. The methods used for

estimating fuel consumption are mainly divided into two types: the mathematical method and

data-driven method. Mathematical methods are advantageous that they provide the physical

understanding of the system, however, according to the recent studies many challenges exist in

real practice. The mathematical methods need a lot of information to define the relationship

among variables, difficult to study system non-linearities, and need expert involvement.

Machine learning data-driven methods are considered a viable tool for discovering hidden

patterns or information from historical data. Among many machine learning algorithms, NNs

are generally favorable because of their universal approximation capability. However, existing

BPNN based fuel estimation models are limited in scope by considering low-level operational

parameters with few aircraft types with the future recommendation of incorporating high-level

operational parameters. The reason for this is low generalization performance and slow

convergence drawbacks of BPNN. Moreover, it requires much human intervention to decide

and adjust hyperparameters. In addition, selecting operational parameters based on prior

experience or knowledge and using global values may consequence in inaccurate results. In

this chapter, a trip fuel estimation model is formulated by considering the first four research

166

gaps. The useful operational parameters representing historical real data that can contribute to

fuel consumption are extracted and added to the model. The objective function of minimizing

fuel deviation is defined. Rather than estimating fuel for distinct phases, the model is based on

estimating fuel for all phases collectively.

In Chapter 4, a novel self-organizing CNN is proposed. The objective is to improve the

performance of the estimation model with less expertise involvement and obtain results with

fast learning speed. A comprehensive review of FNNs was conducted (Chapter 2, Section 2.2)

to understand the merits and limitations along with applications of various algorithms. Based

on three decades of review of FNN, we classified the literature into six categories. Those six

categories are gradient learning algorithms, gradient-free learning algorithms, optimization

algorithms for learning rate, bias and variance (Underfitting and Overfitting) minimization

algorithms, constructive topology FNN, and metaheuristic search algorithms. Following the

literature review, we identified some research gaps in the existing FNNs. Those research gaps

are studying the effect of activation functions, designing efficient and compact algorithm with

fewer hyperparameters, analytically calculating connection weights rather than a random

generation and gradient optimization, change in a data structure, eliminating the need for data

pre-processing and transformation, adopting a hybrid approach for real-world applications, and

determining analytically a number of hidden units. Inspired from the gradient-free learning

algorithms (Chapter 2, Subsection 2.2.3.2) having the property of calculating analytically

output connection weights and the constructive topology FNN (Chapter 2, Subsection 2.2.4.3)

having the property of adding hidden units in the hidden layer, we proposed new algorithm.

Efforts were also made to address two research gaps of “calculating connection weights rather

than a random generation and gradient optimization” and “change in data structure” in the

proposed algorithm. Combining the concept of two categories and two research gaps result in

a novel algorithm named as cascade principal component least squares (CPCLS) neural

167

network. The characteristics of CPCLS is that it analytically determines connection weights

rather than gradient or random generation and add multiple linearly independent hidden units

in the hidden layer to improve and make convergence faster. Experimental work was conducted

on artificial benchmarking and real-world problems to demonstrate the superiority of CPCLS.

The work shows that proposed CPCLS achieved better generalization performance and faster

learning speed compared to other known NNs, such as CasCor, BPNN, SaELM, OLSCN, and

FCNN. The theoretical justification and experimental work demonstrated that hidden units

generated in the hidden layer, and connecting only newly added last hidden layer to output

layer ensures maximum convergence.

In chapter 5, the novel CPCLS is applied to the fuel estimation model to achieve the objective

function of minimal fuel deviation and reduce the negative role of global parameters and

varying operational parameters on fuel consumption. The purpose was to eliminate the need

for a trial-and-error approach and reduce the number of hyperparameter adjustments and expert

involvement. We consider that insufficient attempts have been reported in the literature

concerning the estimation of trip fuel using CNNs along with high-dimensional data for the

entire trip flight phases collectively. A comparative study with the existing AEA and a

traditional BPNN to estimate trip fuel prove that our work is unique in studying the sectors

both in combination and individually by considering all flight phases as a complete flight trip.

Unlike AEA and BPNN, the proposed CPCLS does not require complex mathematical

formulation, a trial-and-error approach, gradient derivative techniques, or too many

hyperparameter initializations. This makes it superior to estimate trip fuel accurately with fewer

expertise requirements and faster learning speed. Previously, the entire trip was divided into

different flight phases, namely takeoff, climb, cruise, descent, and approach, to propose a fuel

estimation model with limited flight data restricted to a small number of aircraft or generated

by a simulation planner. In our study, we analyzed high-dimensional data associated with 107

168

wide-body aircrafts ̶ Airbus A330-300 (31 aircrafts) and Boeing 747-400 (9 aircrafts)/747-

800 (14 aircrafts)/777-300 (53 aircrafts) ̶ that operated a total of 19,117 real international

passenger and cargo flights in eight sectors over two years. Comparative study results

demonstrate that the trip fuel estimation by the proposed CPCLS achieves better results while

requiring the shortest time compared to AEA and BPNN. The CPCLS exhibits an average

improvement of 45.71% and 25.71% for a combined eight-sector model, and 67.93% and

10.15% for a sector-wise individual model, compared to AEA and BPNN, respectively. The

significant improvement in trip fuel estimation creates greater confidence in the proposed

CPCLS given that it may eliminate the need for adding more fuel-based solely on experience.

In summary, educationally, the experimental work on the varying nature of operational

parameters and high dimensionality of data demonstrates the effectiveness of CPCLS, which,

with low fuel deviation and more stable results, demonstrates that can be effectively employed

for estimating trip fuel for each flight. The better and faster estimation results of the self-

organizing CPCLS is because of its analytical calculation of network connection weights

explaining maximum variance in the operational parameters. Thus, CPCLS is a promising

machine learning tool for estimating flight trip fuel. In practical applications, the proposed

CPCLS will be beneficial to airlines by accurately planning the amount of flight trip fuel. This

may avoid the need for extra fuel to be loaded in the aircraft and helps in better management

control. Flight planners and pilots will be more confident with trip fuel estimation and will

avoid the requirement for extra fuel in the discrepancy tank. The weight of aircraft will

decrease, which will not only reduce fuel consumption but also will increase the lifetime of the

engines. The flight planner may suggest and select a suitable airways route by simulating

different combinations of operational parameters to reach the destination with fewer fuel

requirements. Furthermore, controlling excess fuel consumption may benefit in contributing to

less environmental pollution, preventing the scarce jet fuel from natural resources from being

169

wasted, as well as surviving with growing fuel prices, less unplanned aircraft maintenance,

fewer fuel surcharges, and more competitiveness by fulfilling the demand of passengers and

cargo.

6.2 Future Work

This work has addressed the problem of fuel deviation by formulating a model and proposing

a novel CNN. However, there are still some limitations that need to be addressed in future

work. Several important research directions are summarized as:

1. In this work, our main focus was to minimize the fuel deviation by accurately estimating

trip fuel. The work is about predictive analytics that mainly addresses the fuel

estimation and deviation problem. How to identify the alternative routes having less

fuel requirement shall be addressed in the future by performing prescriptive analytics.

The formulated model in Chapter 3 can be extended to estimate aircraft horizontal and

vertical trajectories along with trip fuel estimation. Initial prediction of trajectories by

machine learning and further optimization by mathematical modelling or reinforcement

learning may help to discover alternative routes having less fuel requirement. Currently,

airlines follow the routes as per international guidelines. Identifying the alternative

routes having less fuel requirement may facilitate airlines for making long term

contingency plans.

2. In Chapter 4, OLT was achieved by the eigendecomposition of the covariance

symmetric matrix. Other than the covariance matrix, correlation and single value

decomposition methods can also be applied to achieve OLT. Besides, the experimental

work was performed by considering only the popular sigmoid activation function.

Future direction is to consider other OLT methods and activation functions and study

170

their effects on the performance and learning speed of novel CPCLS. Additionally, its

consequence on trip fuel estimation needs to explore in the future.

3. In Chapter 5, CPCLS is compared to AEA in terms of fuel deviation, whereas its

comparison with BPNN is in terms of both fuel deviation and learning speed. The

computational time for AEA was not considered because the information was not

available in the historical dataset provided by the airline. How much CPCLS is

computational better than AEA is still unclear. This is the main practical limitations in

applying the proposed fuel estimation model in the airline industry. The computational

comparison between CPCLS and AEA maybe explore in the future to facilitate airlines

operating companies in understanding the true benefits of the proposed algorithm.

In addition to the above research directions, some new research directions were identified in

Chapter 2 Section 2.3 to add a contribution to the existing FNNs literature. The research

directions are summarized as:

1. Activation function: Few attempts have been made to study the effect of using various

types of activation functions on FNN. Karlik and Olgac (2011) made a comparison of

popular activation functions including uni-polar sigmoid, bi-polar sigmoid, tangent

hyperbolic, radial bias function and conic section function along with gradient-based

algorithms for fixed topology FNN. The limitation of this experimental work was in

using 10 hidden units and 40 hidden units with 100 iterations and 500 iterations

respectively. Moreover, the data structure, normalization techniques, learning

algorithms, and hyperparameters were not clearly stated in the experimental work. The

experimental results demonstrated that tangent hyperbolic activation function

performance is better than the others. However, it can be observed in various studies

that the sigmoid activation function application is more common compared to tangent

171

hyperbolic function. The importance of hidden units and their activation functions is

clear and is used in every type of FNN. The studies of various activation functions

performance based on fixed and constructive topology, along with gradient algorithms

and gradient-free algorithms, still need researcher attention.

2. Efficient and compact Algorithm with fewer hyperparameters: The gradient learning

algorithms are favorable because of compact size, whereas, gradient-free learning

algorithms are favorable because of gradient-free learning and fast convergence. The

gradient-free algorithms network size becomes much larger which increases its

complexity and the chance of overfitting increases. The best FNN may be considered

to be one having the characteristics of compact architecture with a small number of

hidden units and connection weights. It may analytically calculate hidden units and

connection weights, need fewer hyperparameters and reaches a global minimum in less

training time. There is always a trade-off between network generalization performance

and learning speed. Some algorithms have the advantage of being more efficient than

others but may be constrained by memory requirements, complex architecture, and/or

more learning time. Therefore, more efforts are needed to construct efficient and fast

algorithms with fewer hyperparameters, that are computationally simple and have a

compact size.

3. Eliminating the need for data pre-processing and transformation: Better results of

machine learning algorithms are highly dependent on data pre-processing and the

transformation steps. In pre-processing, various algorithms/techniques are adapted to

clean data to reduce noise, outliers, and missing values, whereas, in transformation,

various algorithms/techniques are adapted to transform the data into formats and forms,

by encoding or standardization, that is appropriate for machine learning. Both steps are

performed prior to feeding data into FNN and are adopted in many algorithms.

172

Insufficient knowledge or inappropriate application of pre-processing and

transformation techniques may cause the algorithm to wrongly conclude the findings.

Future algorithms may be designed that are less sensitive to noise, outliers, missing

values, and do not need a specific form of reducing data features magnitude to a

common scale.

4. A hybrid approach for real-world applications: Researchers demonstrate the

effectiveness of the proposed algorithm either on benchmarking or on real-world

problems or a combination of both. The study of all six categories applications

(Appendix A) gives a clear indication that the interest in researchers proposed

algorithms on real-world problems is rapidly increasing. Traditionally, in 2000 and

earlier, experimental work for demonstrating the effectiveness of proposed algorithms

was limited to artificial benchmarking problems. The possible reasons were the

unavailability of database sources and user reluctance in using FNN. In the literature

survey, it is noticed that nowadays researchers most often use real-world data to ensure

consistency and compare their results with other popular published known algorithms.

Frequently using the same real-world applications data may cause specific data to

become a benchmark. The same issue has been observed in all categories and especially

in the second category. It is important to avoid causing specific data to become a

benchmark because that data may become unsuitable for practical application in the

future with the passage of time. The role of high dimensional, nonlinear, noisy and

unbalanced big data is changing at a rapid rate. The best practice may be to use a hybrid

approach of well-known data in the field along with new data applications during the

comparative study of algorithms. This may increase and maintain user interest in FNN

over the passage of time.

173

5. Hidden units’ analytical calculation: The successful application of learning and

regularization algorithms on complex high dimensional big data application areas,

involving more than 50,000 instances, to improve the convergence of deep neural

networks is noteworthy. The maximum error reduction property of hidden units in deep

neural networks is dependent on connection weights and activation function

determination. More research work is needed to focus on hidden units’ analytical

calculations to avoid the trial and error approach. Calculating the optimal number of

hidden units and connection weights for single or multiple hidden layers in the network,

such that no linear dependence exists may help in achieving better and stable

generalization performance. Similarly, future research work may give clear direction to

users about the application of different activation functions for enhancing business

decisions.

It should be noticed that two other research gaps named “connection weight initialization” and

“data structure” were also identified in Chapter 2 (Section 2.3). This work has addressed both

research gaps to some extent, however, some further improvements are needed to improve

FNNs:

1. Connection weight initialization: In Chapter 2, we explain that existing NNs can be

further strengthened by calculating all connection weights (including input connection

and output connection) analytically to generate hidden units, explaining maximum

variance in the dataset. This may also help to further improve the generalization and

learning speed by compacting the size of the network. From work in Chapter 4, we

concluded that generalization and learning speed has improved, however, the network

size is not compact. How to improve FNNs to make the network compact and avoid

overfitting need further attention in the future.

174

2. Data structure: In Chapter 2, we explain that few attempts have been made to study the

effect of data size and features on FNN algorithms. Future research on designing an

algorithm that can approximate the problem task equally, regardless of data size

(instance) and shape (features), will be a breakthrough achievement. From work in

Chapter 5, we concluded that the difference in improvement for the CPCLS is 22.22%

as compared to the BPNN of 37.66% by changing data from high dimensional to

chunks. This implies that the CPCLS estimation capability with high dimensional data

is much better than the traditional BPNN. The difference in improvement should be on

the lower side. Lower the difference means that algorithm performance on high

dimensional and chunks are approximately similar. CPCLS difference is lower,

however, future research work is needed to further improve FNNs algorithm that may

give a much lower difference.

175

Appendix A. Neural Networks Learning Algorithms and

Optimization Techniques Applications

A.1 Application of gradient learning algorithms

The gradient learning algorithms have gained much attention from researchers compared to

traditional statistical techniques. Gradient learning algorithms help to make a more informed

decision from the available information. Table A.1 highlights some of the applications of

gradient learning algorithms that researchers used to demonstrate the effectiveness of

algorithms during the comparative study. Before the year 2000, the range of real-world

applications appears to be on the low side. The gradient learning algorithms are among the

early attempts that researchers investigated to build FNN. In the early attempt phase, the

possible reason for the unavailability of public real-world application data sources and less

research interest might have forced researchers to rely highly on using artificial benchmarking

data.

The successful application of gradient learning algorithms is dependent on user expertise in

deciding and adjusting the hyperparameters correctly. The major concern of researchers and

users in gradient learning algorithms is to find a method for the network to converge faster. It

is believed that SGD helps to achieve generalization performance many times faster than for

batch learning. Wilson and Martinez (2003) demonstrated that SGD was able to achieve the

required accuracy, on average, 20 times faster compared to batch learning during classifying

real-world problems such as credit card requests, patient diabetes, flower species, beverages

types, religions in the country, crime, voters, and various health diseases. Similarly, while

dealing with problems having more than 1000 instances such as satellite images, shuttle

controls and displaying seven-segment digits, the SGD achieved the required accuracy, on

176

average, 70 times faster than in batch learning. Increasing the data size further reduces the

speed of batch learning and may take more than 300 times longer compared to SGD for

problems having instances greater than 10,000.

The issue with first-order gradient learning is slow learning ability and their usage in a fixed

topology neural network can make the task more time-consuming. Deciding on too many

hidden units in the network may decrease the learning speed and cause the network to become

Table A.1. Applications of gradient learning algorithms

Application Description

Hepatitis Predicting whether the patient will survive or die suffering from

hepatitis (Wilson and Martinez, 2003)

Animals Classifying animals into seven classes based on their physical

characteristics (Wilson and Martinez, 2003)

Flowers species Classifying the flowers into different species from available

information on the width and length of petals and sepals (Wilson and

Martinez, 2003)

Beverages Identifying the type of beverages in term of its physical and chemical

characteristics (Wilson and Martinez, 2003)

Country religion Predicting the religion of the countries from the information such as

population size and their flag colours (Wilson and Martinez, 2003)

Object detection Predicting whether object is rock or mine from the signal information

obtained from various sensors (Wilson and Martinez, 2003)

Crime Identifying of glass type used in crime scene based on chemical oxide

content such as sodium, potassium, calcium, iron and many others

(Wilson and Martinez, 2003)

Voters Classifying voters based on their education, crime, immigration, tax

payers and many others (Wilson and Martinez, 2003)

Heart diseases Diagnosing and categorizing the presence of heart diseases in a patient

by studying the previous history of drug addiction, health issues, blood

tests, and many others (Wilson and Martinez, 2003)

Liver disorder Diagnosing alcohol-related liver disorder based on the reports of

various blood tests (Wilson and Martinez, 2003)

Earth atmosphere Determining the strength of ions and free electrons on the layer of earth

atmosphere (Wilson and Martinez, 2003)

Outdoor objects

segmentation

Segmenting the outdoor images into many different classes such as

window, path, sky and many others (Wilson and Martinez, 2003)

Vowel recognition Recognizing vowel of different or same languages in the speech mode

(Wilson and Martinez, 2003)

177

slow and unstable. Using second-order gradient learning algorithms overcomes the first-order

limitation but is constrained by the memory requirement. Setiono and Hui (1995) demonstrated

that by using second-order learning algorithms, such as quasi NM along with constructive

neural network, limits the growth of hidden units and is helpful in achieving the required

accuracy in less time. Their work on breast cancer problems, was able to improve prediction

accuracy rate 2.92% to 3.15%.

Similar to GD popularity, another widely used learning algorithm in many application areas

and embedded in many simulation packages for training network is LM. The learning of LM

is considered to be, on average, 16 – 136 times faster than another second-order CG (Hagan

and Menhaj, 1994), but limits their applicability to the least square loss function and fixed

topology neural networks. Hunter et al. (2012) explained that second-order NBN can be applied

to the constructive neural network as an alternative, with performance identical to LM.

Breast cancer Diagnosing breast cancer as a malignant or benign based on the feature

extracted from the cell nucleus (Setiono and Hui, 1995; Wilson and

Martinez, 2003)

Credit card Deciding to approve or reject credit card request based on the available

information such as credit score, income level, gender age, sex, and

many others (Wilson and Martinez, 2003)

Diabetes Diagnosing whether the patient has diabetes based on certain

diagnostic measurements (Wilson and Martinez, 2003)

Silhouette vehicle

images

classification

Classifying image into different types of vehicle based on the feature

extracted from the silhouette (Wilson and Martinez, 2003)

Seven segment

display

Predicting number one to nine in seven segmental display (Wilson and

Martinez, 2003)

Mushroom Differentiating poisonous and non-poisonous mushroom based on the

mushroom different physical characteristics (Wilson and Martinez,

2003)

Shuttle Deciding the type of control suitable for the shuttle during an auto

landing rather than manual control (Wilson and Martinez, 2003)

English letters Identifying black and white image as one of the English letters among

twenty-six capital letters (Wilson and Martinez, 2003)

178

A.2 Application of gradient-free learning algorithms

Many authors have successfully demonstrated the effectiveness of gradient-free learning

algorithms in a wide range of applications, consisting of a variety of applications in the

management, engineering, and health sciences domain. The notable applications include

various areas, but is not limited to, supply chains and logistics, financial analysis, marketing

and sales, management information systems, decision support systems, product and process

improvements, manufacturing cost reduction, business improvements, and health services.

Table A.2 illustrates the range of applications of gradient-free learning algorithms. In the

literature, gradient-free learning algorithms are mostly compared with gradient learning

algorithms by considering the application areas shown in Table A.2. The gradient-free learning

algorithms are relatively new compared to the gradient learning algorithms and are

continuously gaining the attention of researchers.

The gradient learning algorithms disadvantages face a problem of local minimum which can

reduce the generalization performance, and iterative tuning of connection weights may cause

the learning to be more time-consuming. Gradient free learning algorithms are considered to

have faster convergence with more stable results on many application problems. For instance,

the application of the gradient-free learning algorithm (e.g. ELM) on various problems such as

predicting stock prices, house prices, automobile prices, species age, cancer, diabetes drug

compounds, aircraft ailerons and elevators, and adults income found that generalization

performance improves by, on average, 0.12 times with learning speed, on average, 20 times

faster than gradient learning algorithms. Moreover, on large complex problems such as

predicting soil types, segmenting objects, and shuttle control, the ELM improved accuracy, on

average, 6% and achieved prediction a thousand times faster than the gradient learning

algorithms.

179

Later, several variants were proposed to improve ELM and most of them are discussed in

Subsection 2.2.3.2. The application of variants such as B-ELM on the problem of measuring

telemarking calls, computer performance, car fuel, concrete strength, and beverage quality

Table A.2. Applications of gradient free learning algorithms


Boston house price Estimating the price of houses based on the availability of clean quality

air (Feng et al., 2009; Han et al., 2017; Huang et al., 2012; Huang,

Chen, et al., 2006; Huang and Chen, 2007, 2008; Ying, 2016)

California house

price

Predicting the house prices based on geographical location and

infrastructure of the house (Han et al., 2017; Huang, Chen, et al., 2006;

Huang, Zhu, et al., 2006; Huang and Chen, 2007, 2008; Liang et al.,

2006; Ying, 2016)

Species Determining the age of species from their known physical

measurements (Feng et al., 2009; Han et al., 2017; Huang et al., 2012;

Huang, Chen, et al., 2006; Huang, Zhu, et al., 2006; Huang and Chen,

2007, 2008; Liang et al., 2006; Ying, 2016; Zong et al., 2013)

Aircrafts ailerons Controlling the ailerons of a fighter aircrafts (Han et al., 2017; Huang,

Chen, et al., 2006; Huang, Zhu, et al., 2006; Huang and Chen, 2007,

2008)

Aircrafts elevators Controlling the elevators of a fighter aircrafts (Han et al., 2017; Huang,

Chen, et al., 2006; Huang, Zhu, et al., 2006; Huang and Chen, 2007,

2008)

Computers system

activity

Measuring the portion of time that central processing units is running

in user mode, system mode, waiting mode and idle mode from the

collection of computers systems activity (Huang, Zhu, et al., 2006;

Huang and Chen, 2008)

House prices in

specific region

Determining the median prices of houses based on prices in the region

and demographic information. (Han et al., 2017; Huang, Chen, et al.,

2006; Huang, Zhu, et al., 2006; Huang and Chen, 2007, 2008; Ying,

2016)

Adult income Determining the income of adult based on demographic information

(Zong et al., 2013)

Automobile price Determining the prices of automobile based on various auto

specifications, the degree to which auto is risky than price, and an

average loss per auto per year (Feng et al., 2009; Huang et al., 2012;

Huang, Chen, et al., 2006; Huang, Zhu, et al., 2006; Huang and Chen,

2007, 2008; Ying, 2016)

Cars fuel

consumption

Determining the fuel consumption of cars in terms of engine

specification and car characteristics (Liang et al., 2006; Yang et al.,

2012)

180

showed an improvement in learning speed was on average 34, 4 145 times faster than I-ELM,

EM-ELM, and EI-ELM. Similarly, the application of another ELM variant named DAOI-ELM

Drug compound Designing modern drug by predicting whether the compound is active

or inactive to the binding target (Huang, Zhu, et al., 2006; Ying, 2016)

Computer machine Estimating the relative performance of a computer central processing

unit considering the memory and channels requirements (Feng et al.,

2009; Han et al., 2017; Huang, Chen, et al., 2006; Huang, Zhu, et al.,

2006; Huang and Chen, 2007, 2008; Yang et al., 2012; Ying, 2016)

Servomechanism

rise time

Estimating servomechanism rise time in term of two choices of

mechanical linkage and two gain setting (Huang, Zhu, et al., 2006;

Huang and Chen, 2008; Ying, 2016)


extracted from the cell nucleus (Huang, Zhu, et al., 2006; Ying, 2016;

Zong et al., 2013)

Telemarketing Measuring the accomplishment of telemarketing calls for marketing

bank long term deposits (Huang, Zhu, et al., 2006; Huang and Chen,

2008; Yang et al., 2012)

Stock price Discovering the stock price trend of the company based on information

generated by similar competitive companies (Huang, Zhu, et al., 2006;

Ying, 2016)


diagnostic measurements (Cao et al., 2016; Huang, Zhu, et al., 2006;

Huang and Chen, 2008; Tang et al., 2016; Zong et al., 2013)

Soil classification Classifying image according to a different type of soil such as grey

soil, vegetation soil, red soil and many others based on a database

consisting of the multi-spectral images (Feng et al., 2009; Huang et al.,

2012; Huang, Zhu, et al., 2006; Liang et al., 2006; Tang et al., 2016;

Ying, 2016; Zong et al., 2013)

Outdoor objects

segmentation


window, path, sky and many others (Feng et al., 2009; Huang et al.,

2012; Huang, Zhu, et al., 2006; Liang et al., 2006; Ying, 2016)

Shuttle Deciding the type of control suitable for the shuttle during an auto

landing rather than manual control (Huang et al., 2012; Huang, Zhu, et

al., 2006; Zong et al., 2013)

Clustering Clustering the dataset into different classes based on available target

vector (Huang, Zhu, et al., 2006; Zong et al., 2013)



many others (Tang et al., 2016)

Liver disorder Diagnosing alcohol-related liver disorder based on the reports of

various blood tests (Tang et al., 2016)

Cancer Classification of the leukaemia cancer as acute lymphoblast leukaemia

or acute myeloid leukaemia (Tang et al., 2016; Zong et al., 2013)

181

in studying the stock market, forest burning, concrete strength, beverage quality achieved more

stable and smooth results compared to the B-ELM and the fluctuating results of I-ELM and

Gene expression

level

Analysing the gene correlation expression level in different tissues of

the tumor colon and normal colon (Tang et al., 2016; Zong et al., 2013)

Object

discrimination

Discriminating stars from galaxy using broadband photometric

information (Huang et al., 2012)

Mushroom Differentiating poisonous and non-poisonous mushroom based on

mushroom different physical characteristics (Tang et al., 2016)


information on the width and length of petals and sepals (Huang et al.,

2012; Tang et al., 2016; Wang et al., 2016; Zong et al., 2013)

Crime Identifying glass type used in crime scene based on chemical oxide


(Cao et al., 2016; Huang et al., 2012; Tang et al., 2016; Ying, 2016;

Zong et al., 2013)

DNA splicing Recognizing exon/intron and intron/exon boundaries in the DNA

splicing (Huang et al., 2012; Liang et al., 2006; Tang et al., 2016; Zong

et al., 2013)

Industrial strike

volume

Estimating the industrial strike volume for the next fiscal year

considering key factors such as unemployment, inflation and labor

unions (Huang et al., 2012)

Weather

forecasting

Forecasting weather in terms of cloud appearance (Huang et al., 2012)

Dihydrofolate

reductase

inhibition

Predicting the inhibition of dihydrofolate reductase by pyrimidines

(Huang et al., 2012; Huang and Chen, 2008)

Human body fats Determining the percentage of human body fats from key physical

factors such as weight, age, chest size and other body parts

circumference (Huang et al., 2012)



tests, and many others (Huang et al., 2012)

Mental disorder Testing mental behaviour of the patient from inflated balloons (Huang

et al., 2012)

Earthquake

strength

Forecasting the strength of earthquake given its latitude, longitude and

focal point (Huang et al., 2012)

Presidential

election

Estimating the proportion of voter in the presidential election based on

key factors such as education, age, and income (Huang et al., 2012)

Robot end effector Determining the distance of robot end effector from a target based on

the robot positions and angles (Huang and Chen, 2008)

Concrete strength Determining slump, flow and compressive strength of the concrete

from influencing ingredients such as cement, water, ash, and many

others (Yang et al., 2012; Zou et al., 2018)

182

OI-ELM. More work on fault diagnosis of the Tennessee-Eastman Process (TEP) demonstrates

that DAOI-ELM improved the accuracy to 1.38%-4.54% for classes-2, 1.12%-6.39% for

Beverages quality Determining quality of same class of the beverages based on relevant

ingredients (Tang et al., 2016; Wang et al., 2016; Yang et al., 2012;

Zou et al., 2018)

Industrial fault

diagnosis

Diagnosing fault of the industrial systems such as Tennessee-Eastman

Process (Zou et al., 2018)

Heating,

ventilation and air-

conditioning

Determining the heating load and cooling load of the residential

building by considering the design layout of the walls, rooms, and

surface (Zou et al., 2018)

Forest burned area Predicting the burned area of the forest considering various

environmental and weather conditions (Zou et al., 2018)

Stock exchange

market

Studying relationship of the 100-index stock exchange market with

other international stock market indices (Zou et al., 2018)

Protein localization Predicting protein localization by studying the cell membranes

characteristics (Zong et al., 2013)

Patient disease Diagnosing whether the patient is suffering from hypothyroidism or

hyperthyroidism (Zong et al., 2013)

Page block

segmentation

Segmenting the type of page block as text, horizontal line, graphics,

vertical line or picture (Zong et al., 2013)

Breast cancer Studying the effect of breast cancer by predicting that the patient will

survive less or more than five years (Zong et al., 2013)

Handwritten

images

classification

Classifying images of the handwritten digits (Kasun et al., 2013; Tang

et al., 2016)

Object

identification

Detecting whether the object is a car or not from its side view (Tang et

al., 2016)

Hand gestures Extracting useful information from the hand gesture (Tang et al., 2016)

Appearance

changes

Modelling and tracking of appearance changes such as pose variation,

shape deformation, illumination change, camera motion, and many

others (Tang et al., 2016)

Energy particles Classifying energy particles either as gamma or hadron (Cao et al.,

2016)

Human faces

gestures

Recognizing human faces gestures such as head pose, facial

expression, eyes state, and many others (Cao et al., 2016)


characteristics (Huang et al., 2012)

Vowel recognition Recognizing vowel of different or same languages in the speech mode

(Huang et al., 2012; Tang et al., 2016; Zong et al., 2013)

Silhouette vehicle

images

classification


extracted from the silhouette (Huang et al., 2012; Ying, 2016; Zong et

al., 2013)

183

classes-4 and 4.36%-5.47% for classes-8 fault compared to BP, I-ELM, CI-ELM, and OI-ELM.

The DAOI-ELM application shows that it obtained better generalization performance with

compact architecture; however, the learning speed of DAOI-ELM is not clearly illustrated in

their work. The ELM and many of its variants are advantageous in that they randomly generate

hidden units and analytically calculate output connection weights which make them simple and

easy to train network. However, randomly generating hidden units may cause the network size

to increase largely which increases the chances of overfitting.

The application of semi gradient and iterative tuning algorithms such as IFNNRWs

demonstrated its effectiveness on classification problems of identifying crime object, energy

particles, human face gestures, and diabetes, but cannot approximate regression problems well.

Another limitation is an increase in the training time of IFNNRWs due to repetitively tuning

of the output connection weights.

A.3 Application of optimization algorithms for learning rate

The learning algorithms are found to have applications in complex domain problem-solving.

The high dimensional data with a lot of features may make the task difficult to solve with an

initial guess and manual settings of the parameter such as learning rate. Deciding on a large

learning rate may make the network unstable causing the generalization performance to be

poor, whereas, a small learning rate may reduce the learning speed. The algorithms in this


twenty-six capital letters (Feng et al., 2009; Huang et al., 2012; Tang

et al., 2016; Ying, 2016)

Handwritten text

recognition

Recognizing isolated, touching, overlapping and cursive handwritten

text from digital images of the city, states, Zip codes, and alphanumeric

characters (Huang et al., 2012; Tang et al., 2016; Zong et al., 2013)

Basketball winning Predicting basketball winning team based on players, team formation

and actions information (Huang et al., 2012)

184

category help to adjust the learning rate on each iteration and move the gradients faster towards

the long axis of the valley for better performance. Table A.3 contains some of the applications

on complex task problem-solving. AdaGrad solved the problem of newspaper articles

classification by improving various categories performance, on average, by 9%-109%.

Similarly, for the subcategory image classification problem, AdaGrad was able to improve the

precision by 2.09% - 9.8%. The graphical representation of the AdaDelta results showed that

it gained the highest accuracy on speech recognition of English data. The application of Adam

on multiclass logistics regression, multilayer neural networks and convolutional neural

networks for classification of handwritten images, object images, and movie reviews

graphically showed that Adam has a better generalization performance by having a smaller

convergence per iteration. The above problems illustrate that the application of optimization

Table A.3. Applications of optimization algorithms for learning rate


Game decision

rule

Predicting decision rules for the Nine Men’s Morris game (Riedmiller

and Braun, 1993)

Census Predicting whether the individual has income above or below the

average income based on certain demographic and employment-related

information (Duchi et al., 2011)

Newspaper articles Classifying the newspapers articles into four major categories such as

economics, commerce, medical, and government with multiple more

specific categories (Duchi et al., 2011)

Subcategories

image

classification

Classifying thousands of images in each of their individual subcategory

(Duchi et al., 2011)

Handwritten

images

classification

Classifying images of the handwritten digits (Duchi et al., 2011;

Kingma and Ba, 2014; Zeiler, 2012)

English data

recognition

Recognizing speech from several hundred hours of the US English data

collected from voice IME, voice search and read data (Zeiler, 2012)

Movie reviews Classifying review of the movies either positive or negative to know

the sentiment of the reviewers (Kingma and Ba, 2014)

Digital images

classification

Classifying the images into one of a category such as an airplane, deer,

ship, frog, horse, truck, cat, dog, bird and automobile (Kingma and Ba,

2014)

185

algorithms for the learning rate are more suitable for high dimension complex datasets to avoid

manual adjustment of the learning rate.

A.4 Applications of bias and variance minimization algorithms

The algorithms in this category improve the generalization performance of the network and

include applications that cannot be solved by a simple network, and more sophisticated

knowledge is needed to avoid overfitting in gaining better performance. The regularization,

pruning, and ensembles methods are implemented by training many networks simultaneously

to get the average best results. The limitation of this category is that it requires more learning

time compared to the standard networks for convergence. The limitation can be minimized by

adopting a hybrid approach using learning optimization algorithms and this category. Besides,

the limitation of regularization techniques such as dropout, DropConnect, and shakeout are that

they are suitable to work with fully connected neural networks.

Table A.4 shows some of the applications in areas such as classification, recognition,

segmentation, and forecasting. Shakeout work on the classification of high dimensional digital

images and complex handwritten digits images, comprising more than 50,000 images,

improved the accuracy by about 0.95%-2.85% and 1.63%-4.55% respectively for fully

connected neural networks. For classifying more than 50,000 house number images generated

from google street views, DropConnect slightly enhanced the accuracy to 0.02%-0.03%.

Similarly, DropConnect work on 3D object recognition achieved a better accuracy of 0.34%.

Shakeout and DropConnect show promising results on highly complex data, however, the

popularity and success of the dropout technique in many applications is comparatively high.

Some applications of dropout include classifying species images into their classes and

subclasses, recognizing the different dialect speech, classifying newspaper articles into

different categories, detecting human diseases by predicting alternative splicing and classifying

186

hundred and thousands of images into different categories. Dropout, because of its property of

dropping hidden unit to avoid overfitting, achieved an average more than 5% better accuracy

for the mentioned applications.

The application of the ensemble technique CNNE on various problems, such as assessing credit

card requests, diagnosing breast cancer, diagnosing diabetes, identifying objects used in

Table A.4. Applications of bias and variance minimization algorithms


Handwritten

images

classification

Classifying the images of hand-written numerals with 20% flip rate

(Geman et al., 1992)

Text to speech

conversion

Learning to convert English text to speech (Krogh and Hertz, 1992)



many others (Islam et al., 2003)


extracted from the cell nucleus (Islam et al., 2003)


diagnostic measurements (Islam et al., 2003)



(Islam et al., 2003)



tests, and many others (Islam et al., 2003)


twenty-six capital letters (Islam et al., 2003)

Soybean defects Determining the type of defect in soybean based on physical

characteristics of the plant (Islam et al., 2003)

Energy

consumption

Predicting the hourly consumption of building electricity and

associated cost based on environmental and weather conditions (Liu et

al., 2008)

Robot arm

acceleration

Estimating the angular acceleration of the robot arm based on a

position, velocity, and torque (Liu et al., 2008)

Semiconductor

manufacturing

Analysing semiconductor manufacturing by examining the number of

dies in a wafer that pass electrical tests (Seni and Elder, 2010)

Credit defaulter Predicting credit defaulters based on the credit score information (Seni

and Elder, 2010)

187

conducting crime, categorizing heart diseases, identifying images as English letters, and

determining the types of defects in soybean, were on average 1-3 times better in generalization

performance compared to other popular ensembles techniques. Some other problems that have

gained popularity in this category are predicting energy consumption, estimating the angular

acceleration of the robot arm, analysing semiconductor manufacturing by examining the

number of dies, and predicting credit defaulters.

A.5 Application of constructive FNN

The benefit of constructive over fixed topology FNN, which makes it more favorable, is the

addition of a hidden unit in each hidden layer until error convergence. This eliminates the

problem of finding an optimal network for fixed topology FNN by doing a lot of experimental

work. The learning speed of constructive algorithms are better than for fixed topology,

3D object

recognition

Recognizing 3D object by classifying the images into generic

categories (Wan et al., 2013)

Google street

images

classification

Classifying the images containing information of house numbers

collected by google street view (Srivastava et al., 2014; Wan et al.,

2013)

Class and

superclass

Classifying the images into class (for example: shark) and their

superclass (for example: fish) (Srivastava et al., 2014)

Speech recognition Recognizing speech from different dialects of the American language

(Srivastava et al., 2014)

Newspaper articles Classifying the newspapers articles into major categories such as

finance, crime, and many others (Srivastava et al., 2014)

Ribonucleic acid Understanding human disease by predicting alternative splicing based

on ribonucleic acid features (Srivastava et al., 2014)

Subcategories

image

classification

Classifying thousands of images in each of their individual

subcategory (Han et al., 2015; Ioffe and Szegedy, 2015; Srivastava et

al., 2014)

Handwritten digits

classification

Classifying images of the handwritten digits (Han et al., 2015; Kang et

al., 2017; Srivastava et al., 2014; Wan et al., 2013)

Digital images

classification

Classifying the images into one of a category such as an airplane, deer,

ship, frog, horse, truck, cat, dog, bird and automobile (Kang et al.,

2017; Srivastava et al., 2014; Wan et al., 2013)

188

however, the generalization performance is not always guaranteed to be optimal. Table A.5

shows some of the applications of constructive FNN.

The application of CCOEN on regression prediction of human body fat, automobile prices,

voters, weather, winning teams, strike volumes, earthquake strength, heart diseases, house

Table A.5. Applications of constructive algorithms


Biological activity Predicting of the biological activity of molecules such as

benzodiazepine derivatives with anti-pentylenetetrazole activity,

antimycin analogues with antifilarial activity and many others from

molecular structure and physicochemical properties (Kovalishyn et

al., 1998)

Communication

channel

Equalizing burst of bits transferred through a communication

channel (Lehtokangas, 2000)

Clinical

electroencephalograms

Classifying artifacts and a normal segment in clinical

electroencephalograms (Schetinin, 2003)

Vowel recognition Recognizing vowel of different or same languages in the speech

mode (Huang, Song, et al., 2012)

Cars fuel consumption Determining the fuel consumption of cars in terms of engine

specification and car characteristics (Huang, Song, et al., 2012)

Beverages quality Determining quality of same class of the beverages based on

relevant ingredients (Huang, Song, et al., 2012; Qiao et al., 2016)

House price Estimating the price of houses based on the availability of clean

quality air (Huang, Song, et al., 2012; Nayyeri et al., 2018)

Species Determining the age of species from their known physical

measurements (Huang, Song, et al., 2012; Nayyeri et al., 2018)

Soil classification Classifying image according to a different type of soil such as grey

soil, vegetation soil, red soil and many others based on a database

consisting of the multi-spectral images (Qiao et al., 2016)

English letters Identifying black and white image as one of the English letters

among twenty-six capital letters (Qiao et al., 2016)



(Qiao et al., 2016)

Silhouette vehicle

images classification

Classifying image into different types of vehicle based on the

feature extracted from the silhouette (Qiao et al., 2016)

Outdoor objects

segmentation


window, path, sky and many others (Huang, Song, et al., 2012;

Qiao et al., 2016)

189

prices, and species age showed that the algorithm gives generalization performance, on

average, 4 times better in most cases. CCOEN theoretically guarantees the solution will be the

global minimum, however, the improvement in learning speed is not clear. The prediction

accuracy and learning speed of FCNN is considered to be 5% better and more than 20 times

faster respectively on predicting beverages quality, classifying soil into different types,

identifying black and white image as one of the English letter, identifying objects used in crime,

classifying images into different types of vehicle and segmenting outdoor images. The problem

of vowel recognition and car fuel consumption are considered to be better solved by OLSCN.

Vowel recognition had an improvement in accuracy of 8.61% and car fuel consumption was

estimated with improvement in the performance of 0.15 times. Some other problems

demonstrating the applications of this category include molecular biological activities

prediction, equalizing burst of bits and classifying artifacts and normal segment in clinical

electroencephalograms.

Human body fats Determining the percentage of human body fats from key physical

factors such as weight, age, chest size and other body parts

circumference (Nayyeri et al., 2018)

Automobile prices Determining the prices of automobile based on various auto

specifications, the degree to which auto is risky than price, and an

average loss per auto per year (Nayyeri et al., 2018)

Presidential election Estimating the proportion of voter in the presidential election based

on key factors such as education, age, and income (Nayyeri et al.,

2018)

Weather forecasting Forecasting weather in terms of cloud appearance (Nayyeri et al.,

2018)

Basketball winning Predicting basketball winning team based on players, team

formation and actions information (Nayyeri et al., 2018)

Industrial strike

volume

Estimating the industrial strike volume for the next fiscal year

considering key factors such as unemployment, inflation and labor

unions (Nayyeri et al., 2018)

Earthquake strength Forecasting the strength of earthquake given its latitude, longitude

and focal point (Nayyeri et al., 2018)

Heart diseases Diagnosing and categorizing the presence of heart diseases in a

patient by studying the previous history of drug addiction, health

issues, blood tests, and many others (Nayyeri et al., 2018)

190

A.6 Application of metaheuristic search algorithms

The basic purpose of metaheuristics search algorithms is to be used in applications having

incomplete gradient information. However, because of its successful application in many

problems attracted the attention of researchers. At present, the application is not limited to

incomplete gradient information but also to other problems searching for the best

hyperparameter that can converge at a global minimum. Table A.6 shows some of the

applications researchers used to demonstrate the effectiveness of the metaheuristic search

algorithm. Mohamad et al (2017) recommended GANN to be a reliable technique for solving

complex problems in the field of excavatability. Their work on predicting ripping production,

from experimental collected inputs such as the weather zone, joint spacing, point load strength

index and sonic velocity, shows that GANN achieved generalization performance 0.42 times

better compared to FNN. Shaghaghi et al. (2017) applied the GA optimized neural network to

Table A.6. Applications of metaheuristic search algorithms


Ripping

production

Predicting ripping production, used as an alternative to blasting for

ground loosening and breaking in mining and civil engineering, from

experimental collected input such as weather zone, joint spacing, point

load strength index and sonic velocity (Mohamad et al., 2017)



(Ding et al., 2011)


information on the width and length of petals and sepals (Ding et al.,

2011)

River width Estimating the width of a river, to minimize erosion and deposition,

from fluid discharge rate, bed sediments of the river, shields parameter

and many other (Shaghaghi et al., 2017)


characteristics (Ding et al., 2011)

Silhouette vehicle

images

classification


extracted from the silhouette (Ding et al., 2011)

191

estimate the width of a river and found a 0.40 times better generalization performance. Besides,

it was reported that neural networks optimized by GA are more efficient than PSO. Ding et al.

(2011) worked on predicting flower species, beverages type, identifying objects used in crime

and classifying images showed that the hybrid approach of GA and BP achieved a prediction

accuracy on average 2%-3% better than GA and BP. The study further explained that the

accuracy of BP in the above problems is better than GA. Furthermore, it was concluded that

the GA needs more learning time compared to BP and this learning speed can be increased to

some extent by using the hybrid approach. The hybrid approach achieved learning speed 0.05

times better than GA but was still slower than BP.

192

References

Abdel-Hamid, O., Mohamed, A., Jiang, H., Deng, L., Penn, G. and Yu, D. (2014),

“Convolutional neural networks for speech recognition”, IEEE/ACM Transactions on

Audio, Speech, and Language Processing, Vol. 22 No. 10, pp. 1533–1545.

Abdelghany, K., Abdelghany, A. and Raina, S. (2005), “A model for the airlines’ fuel

management strategies”, Journal of Air Transport Management, Vol. 11 No. 4, pp. 199–

206.

Al_Janabi, S., Al_Shourbaji, I. and Salman, M.A. (2018), “Assessing the suitability of soft

computing approaches for forest fires prediction”, Applied Computing and Informatics,

Elsevier, Vol. 14 No. 2, pp. 214–224.

Au, K.F., Choi, T.M. and Yu, Y. (2008), “Fashion retail forecasting by evolutionary neural

networks”, International Journal of Production Economics, Elsevier, Vol. 114 No. 2, pp.

615–630.

Babaee, M., Dinh, D.T. and Rigoll, G. (2018), “A deep convolutional neural network for video

sequence background subtraction”, Pattern Recognition, Elsevier Ltd, Vol. 76, pp. 635–

649.

Bagirov, A., Taheri, S. and Asadi, S. (2019), “A difference of convex optimization algorithm

for piecewise linear regression”, Journal of Industrial & Management Optimization, Vol.

15 No. 2, pp. 909–932.

Baklacioglu, T. (2016), “Modeling the fuel flow-rate of transport aircraft during flight phases

using genetic algorithm-optimized neural networks”, Aerospace Science and Technology,

Elsevier Masson SAS, Vol. 49, pp. 52–62.

Banerjee, P., Singh, V.S., Chatttopadhyay, K., Chandra, P.C. and Singh, B. (2011), “Artificial

neural network model as a potential alternative for groundwater salinity forecasting”,

Journal of Hydrology, Elsevier, Vol. 398 No. 3–4, pp. 212–220.

Bianchini, M. and Scarselli, F. (2014), “On the complexity of neural network classifiers: A

comparison between shallow and deep architectures”, IEEE Transactions on Neural

Networks and Learning Systems, IEEE, Vol. 25 No. 8, pp. 1553–1565.

Bottani, E., Centobelli, P., Gallo, M., Kaviani, M.A., Jain, V. and Murino, T. (2019),

“Modelling wholesale distribution operations: an artificial intelligence framework”,

Industrial Management & Data Systems, Emerald Publishing Limited, Vol. 119 No. 4,

pp. 698–718.

Candanedo, L.M. and Feldheim, V. (2016), “Accurate occupancy detection of an office room

from light, temperature, humidity and CO2 measurements using statistical learning

models”, Energy and Buildings, Elsevier, Vol. 112, pp. 28–39.

193

Cao, F., Wang, D., Zhu, H. and Wang, Y. (2016), “An iterative learning algorithm for

feedforward neural networks with random weights”, Information Sciences, Elsevier Inc.,

Vol. 328, pp. 546–557.

Cao, W., Wang, X., Ming, Z. and Gao, J. (2018), “A review on neural networks with random

weights”, Neurocomputing, Elsevier, Vol. 275, pp. 278–287.

Chati, Y.S. and Balakrishnan, H. (2017), “Statistical modeling of aircraft engine fuel”, Twelfth

USA/Europe Air Traffic Management Research and Development Seminar (ATM 2017).

Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K. and Yuille, A.L. (2018), “DeepLab:

Semantic image segmentation with deep convolutional nets, atrous convolution, and fully

connected CRFs”, IEEE Transactions on Pattern Analysis and Machine Intelligence,

IEEE, Vol. 40 No. 4, pp. 834–848.

Cheng, Y. (2017), “Backpropagation for fully connected cascade networks”, Neural

Processing Letters, Springer, Vol. 46 No. 1, pp. 293–311.

Choi, T.-M., Cai, Y.-J. and Shen, B. (2018), “Sustainable fashion supply chain management:

A system of systems analysis”, IEEE Transactions on Engineering Management, IEEE,

Vol. 66 No. 4, pp. 730–745.

Choi, T.M., Chiu, C.H. and Chan, H.K. (2016), “Risk management of logistics systems”,

Transportation Research Part E: Logistics and Transportation Review, Elsevier, Vol. 90,

pp. 1–6.

Choi, T.M., Wallace, S.W. and Wang, Y. (2018), “Big data analytics in operations

management”, Production and Operations Management, Wiley Online Library, Vol. 27

No. 10, pp. 1868–1883.

Chung, S.H., Ma, H.L. and Chan, H.K. (2017), “Cascading delay risk of airline workforce

deployments with crew pairing and schedule optimization”, Risk Analysis, Wiley Online

Library, Vol. 37 No. 8, pp. 1443–1458.

Chung, S.H., Tse, Y.K. and Choi, T.M. (2015), “Managing disruption risk in express logistics

via proactive planning”, Industrial Management & Data Systems, Emerald Group

Publishing Limited, Vol. 115 No. 8, pp. 1481–1509.

Collins, B.P. (1982), “Estimation of aircraft fuel consumption”, Journal of Aircraft, Vol. 19

No. 11, pp. 969–975.

Cui, Q. and Li, Y. (2017), “Airline efficiency measures under CNG2020 strategy: An

application of a Dynamic By-production model”, Transportation Research Part A: Policy

and Practice, Elsevier, Vol. 106, pp. 130–143.

Dancila, B.D., Botez, R. and Labour, D. (2013), “Fuel burn prediction algorithm for cruise,

constant speed and level flight segments”, Aeronautical Journal, Vol. 117 No. 1191, pp.

194

491–504.

Deng, C., Miao, J., Ma, Y., Wei, B. and Feng, Y. (2019), “Reliability analysis of chatter

stability for milling process system with uncertainties based on neural network and fourth

moment method”, International Journal of Production Research, Taylor & Francis, pp.

1–19.

Diao, X. and Chen, C.-H. (2018), “A sequence model for air traffic flow management rerouting

problem”, Transportation Research Part E: Logistics and Transportation Review,

Elsevier, Vol. 110, pp. 15–30.

Ding, S., Su, C. and Yu, J. (2011), “An optimizing BP neural network algorithm based on

genetic algorithm”, Artificial Intelligence Review, Vol. 36 No. 2, pp. 153–162.

Dong, C., Loy, C.C., He, K. and Tang, X. (2016), “Image super-resolution Using deep

convolutional networks”, IEEE Transactions on Pattern Analysis and Machine

Intelligence, Vol. 38 No. 2, pp. 295–307.

Dua, D. and Taniskidou, E.K. (2017), “UCI Machine Learning Repository”, available at:

http://archive.ics.uci.edu/ml (accessed 28 March 2018).

Duchi, J., Hazan, E. and Singer, Y. (2011), “Adaptive subgradient methods for online learning

and stochastic optimization”, Journal of Machine Learning Research, Vol. 12, pp. 2121–

2159.

Ebadzadeh, M.M. and Salimi-Badr, A. (2018), “IC-FNN: A novel fuzzy neural network with

interpretable, intuitive, and correlated-contours fuzzy rules for function approximation”,

IEEE Transactions on Fuzzy Systems, IEEE, Vol. 26 No. 3, pp. 1288–1302.

Eberhart, R. and Kennedy, J. (1995), “A new optimizer using particle swarm theory”,

Proceedings of the Sixth International Symposium on Micro Machine and Human Science

MHS’95., pp. 39–43.

Edwards, H.A., Dixon-Hardy, D. and Wadud, Z. (2016), “Aircraft cost index and the future of

carbon emissions from air travel”, Applied Energy, Elsevier, Vol. 164, pp. 553–562.

Fahlman, S.E. (1988), An Empirical Study of Learning Speed in Back-Propagation Networks,

School of Computer Science, Carnegie Mellon University, Pittsburgh PA 15213.

Fahlman, S.E. and Lebiere, C. (1990), “The cascade-correlation learning architecture”, in

Lippmann, R.P., Moody, J.E. and Touretzky, D.S. (Eds.), Advances in Neural Information

Processing Systems, Morgan Kaufmann, Denver, pp. 524–532.

Farlow, S.J. (1981), “The GMDH algorithm of Ivakhnenko”, The American Statistician, Vol.

35 No. 4, pp. 210–215.

Feng, G., Huang, G.-B., Lin, Q. and Gay, R.K.L. (2009), “Error minimized extreme learning

195

machine with growth of hidden nodes and incremental learning”, IEEE Trans. Neural

Networks, Vol. 20 No. 8, pp. 1352–1357.

Ferrari, S. and Stengel, R.F. (2005), “Smooth function approximation using neural networks”,

IEEE Transactions on Neural Networks, Vol. 16 No. 1, pp. 24–38.

Fierimonte, R., Barbato, M., Rosato, A. and Panella, M. (2016), “Distributed learning of

random weights fuzzy neural networks”, International Conference on Fuzzy Systems,

IEEE, pp. 2287–2294.

Geman, S., Doursat, R. and Bienenstock, E. (1992), “Neural networks and the bias/variance

dilemma”, Neural Computation, Vol. 4 No. 1, pp. 1–58.

Goldberger, A.S. (1964), “Classical linear regression”, Econometric Theory, New York: John

Wiley & Sons., pp. 156–212.

Gori, M. and Tesi, A. (1992), “On the problem of local minima in backpropagation”, IEEE

Transactions on Pattern Analysis & Machine Intelligence, IEEE, Vol. 14 No. 1, pp. 76–

86.

Guo, X., Grushka-Cockayne, Y. and De Reyck, B. (2018), “Forecasting airport transfer

passenger flow using real-time data and machine learning”, SSRN Electronic Journal,

available at:https://doi.org/10.2139/ssrn.3245609.

Hagan, M.T. and Menhaj, M.B. (1994), “Training feedforward networks with the Marquardt

algorithm”, IEEE Transactions on Neural Networks, Vol. 5 No. 6, pp. 989–993.

Han, F., Zhao, M.-R., Zhang, J.-M. and Ling, Q.-H. (2017), “An improved incremental

constructive single-hidden-layer feedforward networks for extreme learning machine

based on particle swarm optimization”, Neurocomputing, Elsevier, Vol. 228, pp. 133–142.

Han, S., Pool, J., Tran, J. and Dally, W.J. (2015), “Learning both weights and connections for

efficient neural networks”, in Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M. and

Garnett, R. (Eds.), Advances in Neural Information Processing Systems, MIT press,

Montreal, pp. 1135–1143.

Hansen, L.K. and Salamon, P. (1990), “Neural network ensembles”, IEEE Transactions on

Pattern Analysis and Machine Intelligence, Vol. 12 No. 10, pp. 993–1001.

Hayashi, Y., Hsieh, M.-H. and Setiono, R. (2010), “Understanding consumer heterogeneity: A

business intelligence application of neural networks”, Knowledge-Based Systems,

Elsevier, Vol. 23 No. 8, pp. 856–863.

Hecht-Nielsen, R. (1989), “Theory of the backpropagation neural network”, International Joint

Conference on Neural Networks, Vol. 1, IEEE, Washington DC, pp. 593–605 vol.1.

Hinton, G.E., Srivastava, N. and Swersky, K. (2012), “Lecture 6a- overview of mini-batch

196

gradient descent”, COURSERA: Neural Networks for Machine Learning, available at:

http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf (accessed 19

July 2018).

Hornik, K., Stinchcombe, M. and White, H. (1989), “Multilayer feedforward networks are

universal approximators”, Neural Networks, Vol. 2 No. 5, pp. 359–366.

Huang, C., Xu, Y. and Johnson, M.E. (2017), “Statistical modeling of the fuel flow rate of GA

piston engine aircraft using flight operational data”, Transportation Research Part D:

Transport and Environment, Elsevier, Vol. 53, pp. 50–62.

Huang, G.-B. and Chen, L. (2007), “Convex incremental extreme learning machine”,

Neurocomputing, Elsevier, Vol. 70 No. 16–18, pp. 3056–3062.

Huang, G.-B. and Chen, L. (2008), “Enhanced random search based incremental extreme

learning machine”, Neurocomputing, Vol. 71 No. 16–18, pp. 3460–3468.

Huang, G.-B., Chen, L. and Siew, C.K. (2006), “Universal approximation using incremental

constructive feedforward networks with random hidden nodes”, IEEE Transactions on

Neural Networks, Vol. 17 No. 4, pp. 879–892.

Huang, G.-B., Zhou, H., Ding, X. and Zhang, R. (2012), “Extreme learning machine for

regression and multiclass classification”, IEEE Transactions on Systems, Man, and

Cybernetics. Part B, Cybernetics, Vol. 42 No. 2, pp. 513–29.

Huang, G.-B., Zhu, Q.-Y. and Siew, C.-K. (2006), “Extreme learning machine: theory and

applications”, Neurocomputing, Vol. 70 No. 1–3, pp. 489–501.

Huang, G., Huang, G.-B., Song, S. and You, K. (2015), “Trends in extreme learning machines:

A review”, Neural Networks, Elsevier, Vol. 61, pp. 32–48.

Huang, G., Song, S. and Wu, C. (2012), “Orthogonal least squares algorithm for training

cascade neural networks”, IEEE Transactions on Circuits and Systems I: Regular Papers,

Vol. 59 No. 11, pp. 2629–2637.

Hunter, D., Yu, H., Pukish III, M.S., Kolbusz, J. and Wilamowski, B.M. (2012), “Selection of

proper neural network sizes and architectures—A comparative study”, IEEE Transactions

on Industrial Informatics, Vol. 8 No. 2, pp. 228–240.

Hwang, J.-N., You, S.-S., Lay, S.-R. and Jou, I.-C. (1996), “The cascade-correlation learning:

A projection pursuit learning perspective”, IEEE Transactions on Neural Networks, Vol.

7 No. 2, pp. 278–289.

IATA. (2018), “Fact Sheet Climate Change & CORSIA”, Online, available at:

https://www.iata.org/pressroom/facts_figures/fact_sheets/Documents/fact-sheet-climate-

change.pdf (accessed 7 May 2018).

197

IATA. (2019), “Industry facts and statistics”, Online, available at:

https://www.iata.org/pressroom/facts_figures/fact_sheets/Pages/index.aspx (accessed 5

August 2019).

Ijjina, E.P. and Chalavadi, K.M. (2016), “Human action recognition using genetic algorithms

and convolutional neural networks”, Pattern Recognition, Elsevier, Vol. 59, pp. 199–212.

Ioffe, S. and Szegedy, C. (2015), “Batch Normalization: Accelerating deep network training

by reducing internal covariate shift”, available at:https://doi.org/10.1007/s13398-014-

0173-7.2.

Irrgang, M.E., Kaul, C.E., Hall, A.E., Klerk, A.D., Elham and Boozarjomehri. (2015), “Aircraft

fuel optimization analytics”.

Islam, M., Yao, X. and Murase, K. (2003), “A constructive algorithm for training cooperative

neural network ensembles”, IEEE Transactions on Neural Networks, Vol. 14 No. 4, pp.

820–834.

Jensen, L., Hansman, R.J., Venuti, J.C. and Reynolds, T. (2013), “Commercial airline speed

optimization strategies for reduced cruise fuel consumption”, 2013 Aviation Technology,

Integration, and Operations Conference, pp. 1026–1038.

Jolliffe, I.T. (2002), Principal Component Analysis, edited by 2, Springer-Verlag, New York.

Kang, G., Li, J. and Tao, D. (2017), “Shakeout: A new approach to regularized deep neural

network training”, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.

40 No. 5, pp. 1245–1258.

Kapanova, K.G., Dimov, I. and Sellier, J.M. (2018), “A genetic approach to automatic neural

network architecture optimization”, Neural Computing and Applications, Springer, Vol.

29 No. 5, pp. 1481–1492.

Karlik, B. and Olgac, V. (2011), “Performance analysis of various activation functions in

generalized MLP architectures of neural networks”, International Journal of Artificial

Intelligence And Expert Systems (IJAE), Vol. 1 No. 4, pp. 111–122.

Karnin, E.D. (1990), “A simple procedure for pruning back-propagation trained neural

networks”, IEEE Transactions on Neural Networks, Vol. 1 No. 2, pp. 239–242.

Kastrati, Z., Imran, A.S. and Yayilgan, S.Y. (2019), “The impact of deep learning on document

classification using semantically rich representations”, Information Processing &

Management, Elsevier, Vol. 56 No. 5, pp. 1618–1632.

Kasun, L.L.C., Zhou, H., Huang, G. and Vong, C. (2013), “Representational learning with

extreme learning machine for big data”, IEEE Intelligent System, Vol. 28 No. 6, pp. 31–

34.

198

Khanmohammadi, S., Tutun, S. and Kucuk, Y. (2016), “A new multilevel input layer artificial

neural network for predicting flight delays at JFK airport”, Procedia Computer Science,


Kim, Y.-S., Rim, H.-C. and Lee, D.-G. (2019), “Business environmental analysis for textual

data using data mining and sentence-level classification”, Industrial Management & Data

Systems, Emerald Publishing Limited, Vol. 119 No. 1, pp. 69–88.

Kingma, D.P. and Ba, J. (2014), “Adam: A Method for Stochastic Optimization”, pp. 1–15.

Kovalishyn, V. V, Tetko, I. V, Luik, A.I., Kholodovych, V. V, Villa, A.E.P. and Livingstone,

D.J. (1998), “Neural network studies. 3. variable selection in the cascade-correlation

learning architecture”, Journal of Chemical Information and Computer Sciences, Vol. 38

No. 4, pp. 651–659.

Krogh, A. and Hertz, J.A. (1992), “A simple weight decay can improve generalization”, in

Hanson, S.J., Cowan, J.D. and Giles, C.L. (Eds.), Advances in Neural Information

Processing Systems, Morgan Kaufmann, Denver, pp. 950–957.

Krogh, A. and Vedelsby, J. (1995), “Neural Network Ensembles, Cross Validation, and Active

Learning”, in Touretzky, D.S., Mozer, M. and Hasselmo, M.E. (Eds.), Advances in Neural

Information Processing Systems, MIT press, Denver, pp. 231–238.

Kumar, A., Rao, V.R. and Soni, H. (1995), “An empirical comparison of neural network and

logistic regression models”, Marketing Letters, Vol. 6 No. 4, pp. 251–263.

Kummong, R. and Supratid, S. (2016), “Thailand tourism forecasting based on a hybrid of

discrete wavelet decomposition and NARX neural network”, Industrial Management and

Data Systems, Vol. 116 No. 6, pp. 1242–1258.

Kuo, Y.-H. and Kusiak, A. (2019), “From data to big data in production research: the past and

future trends”, International Journal of Production Research, Taylor & Francis, Vol. 57

No. 15–16, pp. 4828–4853.

Kwok, T.-Y. and Yeung, D.-Y. (1997), “Constructive algorithms for structure learning in

feedforward neural networks for regression problems”, IEEE Transactions on Neural


Lam, H.Y., Ho, G.T.S., Wu, C.-H. and Choy, K.L. (2014), “Customer relationship mining

system for effective strategies formulation”, Industrial Management & Data Systems,

Emerald Group Publishing Limited, Vol. 114 No. 5, pp. 711–733.

Lang, K.J. (1989), “Learning to tell two spirals apart”, Proceedings of the 1988 Connectionist

Models Summer School, Morgan Kaufmann, pp. 52–59.

LeCun, Y., Bengio, Y. and Hinton, G. (2015), “Deep learning”, Nature, Nature Publishing

Group, Vol. 521 No. 7553, pp. 436–444.

199

Lehtokangas, M. (2000), “Modified cascade-correlation learning for classification”, IEEE

Transactions on Neural Networks, Vol. 11 No. 3, pp. 795–798.

Lewis, A.S. and Overton, M.L. (2013), “Nonsmooth optimization via quasi-Newton methods”,

Mathematical Programming, Vol. 141 No. 1–2, pp. 135–163.

Li, M., Ch’ng, E., Chong, A.Y.L. and See, S. (2018), “Multi-class Twitter sentiment

classification with emojis”, Industrial Management & Data Systems, Emerald Publishing

Limited, Vol. 118 No. 9, pp. 1804–1820.

Liang, H. and Dai, G. (1998), “Improvement of cascade correlation learning”, Information

Sciences, Vol. 112 No. 1–4, pp. 1–6.

Liang, N.-Y., Huang, G.-B., Saratchandran, P. and Sundararajan, N. (2006), “A fast and

accurate online sequential learning algorithm for feedforward networks”, IEEE

Transactions on Neural Networks, Vol. 17 No. 6, pp. 1411–1423.

Liew, S.S., Khalil-Hani, M. and Bakhteri, R. (2016), “An optimized second order stochastic

learning algorithm for neural network training”, Neurocomputing, Elsevier, Vol. 186, pp.

74–89.

Lin, Z. and Vlachos, I. (2018), “An advanced analytical framework for improving customer

satisfaction: A case of air passengers”, Transportation Research Part E: Logistics and

Transportation Review, Elsevier, Vol. 114, pp. 185–195.

Ling, Y., Zhou, Y. and Luo, Q. (2017), “Lévy flight trajectory-based whale optimization

algorithm for global optimization.”, IEEE Access, Vol. 5 No. 99, pp. 6168–6186.

Liu, Y., Starzyk, J.A. and Zhu, Z. (2008), “Optimized approximation algorithm in neural

networks without overfitting”, IEEE Transactions on Neural Networks, Vol. 19 No. 6, pp.

983–995.

Lu, X., Ming, L., Liu, W. and Li, H.-X. (2018), “Probabilistic regularized extreme learning

machine for robust modeling of noise data”, IEEE Transactions on Cybernetics, IEEE,

Vol. 48 No. 8, pp. 2368–2377.

Lucas, J.M. and Saccucci, M.S. (1990), “Exponentially weighted moving average control

schemes: Properties and enhancements”, Technometrics, IEEE, Vol. 32 No. 1, pp. 1–12.

Merkert, R. and Swidan, H. (2019), “Flying with(out) a safety net: Financial hedging in the

airline industry”, Transportation Research Part E: Logistics and Transportation Review,


Mirjalili, S. and Lewis, A. (2016), “The whale optimization algorithm”, Advances in

Engineering Software, Elsevier Ltd, Vol. 95, pp. 51–67.

Mitchell, M. (1998), “Genetic Algorithms: An Overview”, in Watson, T., Robbins, C., Belew,

200

R.K., Wilson, S.W., Holland, J.H., Waltz, D.L. and Koza, J.R. (Eds.), An Introduction to

Genetic Algorithms, MIT press, London, England, pp. 1–15.

Mohamad, E.T., Faradonbeh, R.S., Armaghani, D.J., Monjezi, M. and Majid, M.Z.A. (2017),

“An optimized ANN model based on genetic algorithm for predicting ripping production”,

Neural Computing and Applications, Springer, Vol. 28 No. 1, pp. 393–406.

Mohamed Shakeel, P., Tobely, T.E. El, Al-Feel, H., Manogaran, G. and Baskar, S. (2019),

“Neural network based brain tumor detection using wireless infrared imaging sensor”,

IEEE Access, IEEE, Vol. 7, pp. 5577–5588.

Mori, J., Kajikawa, Y., Kashima, H. and Sakata, I. (2012), “Machine learning approach for

finding business partners and building reciprocal relationships”, Expert Systems with

Applications, Elsevier, Vol. 39 No. 12, pp. 10402–10407.

Nasir, M., South-Winter, C., Ragothaman, S. and Dag, A. (2019), “A comparative data analytic

approach to construct a risk trade-off for cardiac patients’ re-admissions”, Industrial

Management & Data Systems, Emerald Publishing Limited, Vol. 119 No. 1, pp. 189–209.

Nayyeri, M., Yazdi, H.S., Maskooki, A. and Rouhani, M. (2018), “Universal approximation by

using the correntropy objective function”, IEEE Transactions on Neural Networks and

Learning Systems, IEEE, Vol. 29 No. 9, pp. 4515–4521.

Nguyen, D. and Widrow, B. (1990), “Improving the learning speed of 2-layer neural networks

by choosing initial values of the adaptive weights”, International Joint Conference on

Neural Networks, Vol. 3, IEEE, San Diego CA, pp. 21–26.

Nuic, A. (2014), User Manual for the Base of Aircraft Data (BADA) Revision 3.12, European

Organisation for the Safety of Air Navigation, available at:

https://www.eurocontrol.int/sites/default/files/field_tabs/content/documents/sesar/user-

manual-bada-3-12.pdf.

Pagoni, I. and Psaraki-Kalouptsidi, V. (2017), “Calculation of aircraft fuel consumption and

CO2 emissions based on path profile estimation by clustering and registration”,

Transportation Research Part D: Transport and Environment, Elsevier, Vol. 54, pp. 172–

190.

Qian, N. (1999), “On the momentum term in gradient descent learning algorithms”, Neural


Qiao, J., Li, F., Han, H. and Li, W. (2016), “Constructive algorithm for fully connected cascade

feedforward neural networks”, Neurocomputing, Elsevier, Vol. 182, pp. 154–164.

Reed, R. (1993), “Pruning algorithms-a survey”, IEEE Transactions on Neural Networks, Vol.

4 No. 5, pp. 740–747.

Riedmiller, M. and Braun, H. (1993), “A direct adaptive method for faster backpropagation

201

learning: the RPROP algorithm”, IEEE International Conference on Neural Networks,

Vol. 1993, IEEE, pp. 586–591.

Ruiz-Aguilar, J.J., Turias, I.J. and Jiménez-Come, M.J. (2014), “Hybrid approaches based on

SARIMA and artificial neural networks for inspection time series forecasting”,

Transportation Research Part E: Logistics and Transportation Review, Elsevier, Vol. 67,

pp. 1–13.

Rumelhart, D.E., Hinton, G.E. and Williams, R.J. (1986), “Learning representations by back-

propagating errors”, Nature, Vol. 323 No. 6088, pp. 533–536.

Ryerson, M.S., Hansen, M. and Bonn, J. (2011), “Fuel consumption and operational

performance”, 9th USA/Europe Air Traffic Management Research and Development

Seminar, pp. 1–10.

Schetinin, V. (2003), “A learning algorithm for evolving cascade neural networks”, Neural

Processing Letters, Springer, Vol. 17 No. 1, pp. 21–31.

Schilling, G.D. (1997), Modeling Aircraft Fuel Consumption with a Neural Network, Doctoral

Dissertation, Virginia Tech.

Seni, G. and Elder, J. (2010), “Model complexity, model selection and regularization”, in

Grossman, R. (Ed.), Ensemble Methods in Data Mining: Improving Accuracy Through

Combining Predictions, Morgan & Claypool Publishers, Chicago, pp. 21–40.

Senzig, D.A., Fleming, G.G. and Iovinelli, R.J. (2009), “Modeling of terminal-area airplane

fuel consumption”, Journal of Aircraft, Vol. 46 No. 4, pp. 1089–1093.

Setiono, R. and Hui, L.C.K. (1995), “Use of a quasi-Newton method in a feedforward neural

network construction algorithm”, IEEE Transactions on Neural Networks, Vol. 6 No. 1,

pp. 273–277.

Seufert, J.H., Arjomandi, A. and Dakpo, K.H. (2017), “Evaluating airline operational

performance: A Luenberger-Hicks-Moorsteen productivity indicator”, Transportation

Research Part E: Logistics and Transportation Review, Elsevier, Vol. 104, pp. 52–68.

Shaghaghi, S., Bonakdari, H., Gholami, A., Ebtehaj, I. and Zeinolabedini, M. (2017),

“Comparative analysis of GMDH neural network based on genetic algorithm and particle

swarm optimization in stable channel design”, Applied Mathematics and Computation,


Shanno, D.F. (1970), “Conditioning of quasi-Newton methods for function minimization”,

Mathematics of Computation, Vol. 24 No. 111, pp. 647–647.

Shen, B. and Chan, H.-L. (2017), “Forecast information sharing for managing supply chains in

the big data era: Recent development and future research”, Asia-Pacific Journal of

Operational Research, World Scientific, Vol. 34 No. 01, pp. 1740001-1-1740001–26.

202

Shen, B., Choi, T.-M. and Chan, H.-L. (2019), “Selling green first or not? A Bayesian analysis

with service levels and environmental impact considerations in the Big Data Era”,

Technological Forecasting and Social Change, Elsevier, Vol. 144, pp. 412–420.

Shen, B., Choi, T.-M. and Minner, S. (2019), “A review on supply chain contracting with

information considerations: information updating and information asymmetry”,

International Journal of Production Research, Taylor & Francis, Vol. 57 No. 15–16, pp.

4898–4936.

Sheng, D., Li, Z.-C. and Fu, X. (2019), “Modeling the effects of airline slot hoarding behavior

under the grandfather rights with use-it-or-lose-it rule”, Transportation Research Part E:

Logistics and Transportation Review, Elsevier, Vol. 122, pp. 48–61.

Shi, Y. and Eberhart, R. (1998), “A modified particle swarm optimizer”, 1998 IEEE

International Conference on Evolutionary Computation Proceedings. IEEE World

Congress on Computational Intelligence (Cat. No.98TH8360), IEEE, pp. 69–73.

Sibdari, S., Mohammadian, I. and Pyke, D.F. (2018), “On the impact of jet fuel cost on airlines’

capacity choice: Evidence from the U.S. domestic markets”, Transportation Research

Part E: Logistics and Transportation Review, Elsevier, Vol. 111, pp. 1–17.

Singh, V. and Sharma, S.K. (2015), “Fuel consumption optimization in air transport: a review,

classification, critique, simple meta-analysis, and future research implications”, European

Transport Research Review, Springer, Vol. 7 No. 2, p. 12.

Specht, D.F. (1990), “Probabilistic neural networks”, Neural Networks, Vol. 3 No. 1, pp. 109–

118.

Specht, D.F. (1991), “A general regression neural network”, IEEE Transactions on Neural


Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. and Salakhutdinov, R. (2014),

“Dropout: A simple way to prevent neural networks from overfitting”, Journal of Machine

Learning Research, Vol. 15 No. 1, pp. 1929–1958.

Taherkhani, A., Belatreche, A., Li, Y. and Maguire, L.P. (2018), “A supervised learning

algorithm for learning precise timing of multiple spikes in multilayer spiking neural

networks”, IEEE Transactions on Neural Networks and Learning Systems, IEEE, Vol. 29

No. 11, pp. 5394–5407.

Tang, J., Deng, C. and Huang, G.-B. (2016), “Extreme learning machine for multilayer

perceptron”, IEEE Transactions on Neural Networks and Learning Systems, Vol. 27 No.

4, pp. 809–821.

Teo, A.-C., Tan, G.W.-H., Ooi, K.-B., Hew, T.-S. and Yew, K.-T. (2015), “The effects of

convenience and speed in m-payment”, Industrial Management & Data Systems, Vol. 115

203

No. 2, pp. 311–331.

Tkáč, M. and Verner, R. (2016), “Artificial neural networks in business: Two decades of

research”, Applied Soft Computing, Vol. 38, pp. 788–804.

Trani, A., Wing-Ho, F., Schilling, G., Baik, H. and Seshadri, A. (2004), “A neural network

model to estimate aircraft fuel consumption”, AIAA 4th Aviation Technology, Integration

and Operations (ATIO) Forum, American Institute of Aeronautics and Astronautics,

Reston, Virigina, p. 6401.

Trani, A.A. and Wing-Ho, F. (1997), Enhancements to SIMMOD : A Neural Network Post-

Processor to Estimate Aircraft Fuel Consumption Phase I Final Report.

Tu, J. V. (1996), “Advantages and disadvantages of using artificial neural networks versus

logistic regression for predicting medical outcomes”, Journal of Clinical Epidemiology,

Vol. 49 No. 11, pp. 1225–1231.

Turgut, E.T., Cavcar, M., Usanmaz, O., Canarslanlar, A.O., Dogeroglu, T., Armutlu, K. and

Yay, O.D. (2014), “Fuel flow analysis for the cruise phase of commercial aircraft on

domestic routes”, Aerospace Science and Technology, Vol. 37, pp. 1–9.

Turgut, E.T. and Rosen, M.A. (2012), “Relationship between fuel consumption and altitude for

commercial aircraft during descent: Preliminary assessment with a genetic algorithm”,

Aerospace Science and Technology, Vol. 17 No. 1, pp. 65–73.

Verma, L.K., Kishore, N. and Jharia, D.C. (2017), “Predicting dangerous seismic events in

active coal mines through data mining”, International Journal of Applied Engineering

Research, Vol. 12 No. 5, pp. 567–571.

Wan, L., Zeiler, M., Zhang, S., LeCun, Y. and Fergus, R. (2013), “Regularization of neural

networks using dropconnect”, International Conference on Machine Learning, pp. 109–

111.

Wang, G.-G., Lu, M., Dong, Y.-Q. and Zhao, X.-J. (2016), “Self-adaptive extreme learning

machine”, Neural Computing and Applications, Springer, Vol. 27 No. 2, pp. 291–303.

Wang, J., Wu, X. and Zhang, C. (2005), “Support vector machines based on K-means

clustering for real-time business intelligence systems”, International Journal of Business

Intelligence and Data Mining, Citeseer, Vol. 1 No. 1, pp. 54–64.

Wang, L., Yang, Y., Min, R. and Chakradhar, S. (2017), “Accelerating deep neural network

training with inconsistent stochastic gradient descent”, Neural Networks, Elsevier, Vol.

93, pp. 219–229.

Widrow, B., Greenblatt, A., Kim, Y. and Park, D. (2013), “The No-Prop algorithm: A new

learning algorithm for multilayer neural networks”, Neural Networks, Elsevier Ltd, Vol.

37, pp. 182–188.

204

Wilamowski, B.M., Cotton, N.J., Kaynak, O. and Dundar, G. (2008), “Computing gradient

vector and Jacobian matrix in arbitrarily connected neural networks”, IEEE Transactions

on Industrial Electronics, Vol. 55 No. 10, pp. 3784–3790.

Wilamowski, B.M. and Yu, H. (2010), “Neural network learning without backpropagation”,

IEEE Transactions on Neural Networks, Vol. 21 No. 11, pp. 1793–1803.

Wilson, D.R. and Martinez, T.R. (2003), “The general inefficiency of batch training for

gradient descent learning”, Neural Networks, Elsevier, Vol. 16 No. 10, pp. 1429–1451.

Wong, T.C., Haddoud, M.Y., Kwok, Y.K. and He, H. (2018), “Examining the key determinants

towards online pro-brand and anti-brand community citizenship behaviours: a two-stage

approach”, Industrial Management & Data Systems, Emerald Publishing Limited, Vol.

118 No. 4, pp. 850–872.

Yang, Y., Wang, Y. and Yuan, X. (2012), “Bidirectional extreme learning machine for

regression problem and its learning effectiveness”, IEEE Transactions on Neural

Networks and Learning Systems, IEEE, Vol. 23 No. 9, pp. 1498–1505.

Yanto, J. and Liem, R.P. (2018), “Aircraft fuel burn performance study: A data-enhanced

modeling approach”, Transportation Research Part D: Transport and Environment,


Yeung, D.S., Ng, W.W.Y., Wang, D., Tsang, E.C.C. and Wang, X.-Z. (2007), “Localized

Generalization Error Model and Its Application to Architecture Selection for Radial Basis

Function Neural Network”, IEEE Transactions on Neural Networks, IEEE, Vol. 18 No.

5, pp. 1294–1305.

Yin, X. and Liu, X. (2018), “Multi-task convolutional neural network for pose-invariant face

recognition”, IEEE Transactions on Image Processing, Vol. 27 No. 2, pp. 964–975.

Ying, L. (2016), “Orthogonal incremental extreme learning machine for regression and

multiclass classification”, Neural Computing and Applications, Springer, Vol. 27 No. 1,

pp. 111–120.

Ypma, T.J. (1995), “Historical development of the Newton–Raphson method”, SIAM Review,

Society for Industrial and Applied Mathematics, Vol. 37 No. 4, pp. 531–551.

Zaghloul, W., Lee, S.M. and Trimi, S. (2009), “Text classification: neural networks vs support

vector machines”, Industrial Management & Data Systems, Vol. 109 No. 5, pp. 708–717.

Zeiler, M.D. (2012), “ADADELTA: An adaptive learning rate method”, available

at:https://doi.org/http://doi.acm.org.ezproxy.lib.ucf.edu/10.1145/1830483.1830503.

Zhang, G.P. (2000), “Neural networks for classification: a survey”, IEEE Transactions on

Systems, Man, and Cybernetics, Part C (Applications and Reviews), IEEE, Vol. 30 No. 4,

pp. 451–462.

205

Zhang, J.R., Zhang, J., Lok, T.M. and Lyu, M.R. (2007), “A hybrid particle swarm

optimization-back-propagation algorithm for feedforward neural network training”,

Applied Mathematics and Computation, Vol. 185 No. 2, pp. 1026–1037.

Zong, W., Huang, G.-B. and Chen, Y. (2013), “Weighted extreme learning machine for

imbalance learning”, Neurocomputing, Elsevier, Vol. 101, pp. 229–242.

Zou, W., Xia, Y. and Li, H. (2018), “Fault diagnosis of Tennessee-Eastman process using

orthogonal incremental extreme learning machine based on driving amount”, IEEE

Transactions on Cybernetics, IEEE, Vol. 48 No. 12, pp. 3403–3410.