evaluation of machine learning interpolation techniques for prediction of physical properties

8
Evaluation of machine learning interpolation techniques for prediction of physical properties Eve Bélisle a , Zi Huang a , Sébastien Le Digabel c , Aïmen E. Gheribi b,a School of Information Technology and Electrical Engineering, The University of Queensland, Brisbane QLD 4072, Australia b CRCT Center for Research in Computational Thermochemistry, Department of Chemical Eng., École Polytechnique de Montreal (Campus of Université de Montréal), Box 6079, Station Downtown, Montréal, Québec H3C 3A7, Canada c GERAD and Department of Mathematics and Industrial Eng., École Polytechnique de Montréal, Succ. Centre-ville, Montréal, Québec H3C 3A7, Canada article info Article history: Received 25 June 2014 Received in revised form 14 October 2014 Accepted 17 October 2014 Keywords: Superalloys Database Gaussian process Neural network Quadratic regression Physical properties Computational dependence abstract A knowledge of the physical properties of materials as a function of temperature, composition, applied external stresses, etc. is an important consideration in materials and process design. For new systems, such properties may be unknown and hard to measure or estimate from numerical simulations such as molecular dynamics. Engineers rely on machine learning to employ existing data in order to predict properties for new systems. Several techniques are currently used for such purposes. These include neural network, polynomial interpolation and Gaussian processes as well as the more recent dynamic trees and scalable Gaussian processes. In this paper we compare these approaches for three sets of materials sciences data: molar volume, electrical conductivity and Martensite start temperature. We make recommendations depending on the nature of the data. We demonstrate that a thorough knowl- edge of the problem beforehand is critical in selecting the most successful machine learning technique. Our findings show that the Gaussian process regression technique gives very good predictions for all three sets of tested data. Typically, Gaussian process is very slow with a computational complexity of typ- ically n 3 where n is the number of data points. In this paper, we found that the scalable Gaussian process approach was able to maintain the high accuracy of the predictions while improving speed considerably, make on-line learning possible. Ó 2014 Elsevier B.V. All rights reserved. 1. Introduction Through the years, various machine learning techniques have been employed to fit sets of known data associated with certain properties in order to predict these properties on sets of unknown data. The ‘‘No Free Lunch theorem’’ was introduced in 1997 by Wolpert and Macready [1], stating that for every optimization problem, there is no perfect algorithm. For a given problem for which an approach works well, there exists another problem for which the same method fails miserably. This paper aims at com- paring different machine learning techniques for predicting prop- erties of different types of data. Our focus is on material science data of molten oxides systems collected from the literature. It is important for material science engineers to know the physical properties of such systems in order to design new materials, or improve the current processes. Currently, there are a variety of machine learning techniques for predicting a function f ðxÞ given x. Polynomial interpolation was one of the first to be developed [2], and is still a very popular method in fields such as digital photography and image re-sam- pling as well as for scientific data. Gaussian processes (GPs) were introduced in the 1940s [3], but it is only in 1978 that they were employed to define prior distributions over functions [4]. More recently, with the introduction and increasing popularity of neural networks with back propagation, Gaussian processes started to be used for supervised machine learning [5] and for regression prob- lems [6]. In the last few years, various attempts have been made to improve known approaches, in particular by the group of Robert B. Gramacy at the University of Chicago, with the introduction of treed Gaussian processes [7] and dynamic trees [8]. In 1996, Rad- ford Neal showed that a Bayesian neural network with a Gaussian prior on individual weights with an infinite number of hidden nodes converges to a GP [9]. In this work, we perform a comparative study of the predicting power of six of the most popular and emerging machine learning techniques. The different techniques are tested on datasets from the materials science industry: molar volume (MV), electrical http://dx.doi.org/10.1016/j.commatsci.2014.10.032 0927-0256/Ó 2014 Elsevier B.V. All rights reserved. Corresponding author. E-mail addresses: [email protected] (E. Bélisle), [email protected] (Z. Huang), [email protected] (S. Le Digabel), [email protected] (A.E. Gheribi). Computational Materials Science 98 (2015) 170–177 Contents lists available at ScienceDirect Computational Materials Science journal homepage: www.elsevier.com/locate/commatsci

Upload: aimen-e

Post on 04-Apr-2017

215 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Evaluation of machine learning interpolation techniques for prediction of physical properties

Computational Materials Science 98 (2015) 170–177

Contents lists available at ScienceDirect

Computational Materials Science

journal homepage: www.elsevier .com/locate /commatsci

Evaluation of machine learning interpolation techniques for predictionof physical properties

http://dx.doi.org/10.1016/j.commatsci.2014.10.0320927-0256/� 2014 Elsevier B.V. All rights reserved.

⇑ Corresponding author.E-mail addresses: [email protected] (E. Bélisle), [email protected] (Z.

Huang), [email protected] (S. Le Digabel), [email protected](A.E. Gheribi).

Eve Bélisle a, Zi Huang a, Sébastien Le Digabel c, Aïmen E. Gheribi b,⇑a School of Information Technology and Electrical Engineering, The University of Queensland, Brisbane QLD 4072, Australiab CRCT Center for Research in Computational Thermochemistry, Department of Chemical Eng., École Polytechnique de Montreal (Campus of Université de Montréal), Box 6079,Station Downtown, Montréal, Québec H3C 3A7, Canadac GERAD and Department of Mathematics and Industrial Eng., École Polytechnique de Montréal, Succ. Centre-ville, Montréal, Québec H3C 3A7, Canada

a r t i c l e i n f o a b s t r a c t

Article history:Received 25 June 2014Received in revised form 14 October 2014Accepted 17 October 2014

Keywords:SuperalloysDatabaseGaussian processNeural networkQuadratic regressionPhysical propertiesComputational dependence

A knowledge of the physical properties of materials as a function of temperature, composition, appliedexternal stresses, etc. is an important consideration in materials and process design. For new systems,such properties may be unknown and hard to measure or estimate from numerical simulations suchas molecular dynamics. Engineers rely on machine learning to employ existing data in order to predictproperties for new systems. Several techniques are currently used for such purposes. These includeneural network, polynomial interpolation and Gaussian processes as well as the more recent dynamictrees and scalable Gaussian processes. In this paper we compare these approaches for three sets ofmaterials sciences data: molar volume, electrical conductivity and Martensite start temperature. Wemake recommendations depending on the nature of the data. We demonstrate that a thorough knowl-edge of the problem beforehand is critical in selecting the most successful machine learning technique.Our findings show that the Gaussian process regression technique gives very good predictions for allthree sets of tested data. Typically, Gaussian process is very slow with a computational complexity of typ-ically n3 where n is the number of data points. In this paper, we found that the scalable Gaussian processapproach was able to maintain the high accuracy of the predictions while improving speed considerably,make on-line learning possible.

� 2014 Elsevier B.V. All rights reserved.

1. Introduction

Through the years, various machine learning techniques havebeen employed to fit sets of known data associated with certainproperties in order to predict these properties on sets of unknowndata. The ‘‘No Free Lunch theorem’’ was introduced in 1997 byWolpert and Macready [1], stating that for every optimizationproblem, there is no perfect algorithm. For a given problem forwhich an approach works well, there exists another problem forwhich the same method fails miserably. This paper aims at com-paring different machine learning techniques for predicting prop-erties of different types of data. Our focus is on material sciencedata of molten oxides systems collected from the literature. It isimportant for material science engineers to know the physicalproperties of such systems in order to design new materials, orimprove the current processes.

Currently, there are a variety of machine learning techniques forpredicting a function f ðxÞ given x. Polynomial interpolation wasone of the first to be developed [2], and is still a very popularmethod in fields such as digital photography and image re-sam-pling as well as for scientific data. Gaussian processes (GPs) wereintroduced in the 1940s [3], but it is only in 1978 that they wereemployed to define prior distributions over functions [4]. Morerecently, with the introduction and increasing popularity of neuralnetworks with back propagation, Gaussian processes started to beused for supervised machine learning [5] and for regression prob-lems [6]. In the last few years, various attempts have been made toimprove known approaches, in particular by the group of Robert B.Gramacy at the University of Chicago, with the introduction oftreed Gaussian processes [7] and dynamic trees [8]. In 1996, Rad-ford Neal showed that a Bayesian neural network with a Gaussianprior on individual weights with an infinite number of hiddennodes converges to a GP [9].

In this work, we perform a comparative study of the predictingpower of six of the most popular and emerging machine learningtechniques. The different techniques are tested on datasets fromthe materials science industry: molar volume (MV), electrical

Page 2: Evaluation of machine learning interpolation techniques for prediction of physical properties

Nomenclature

d Kronecker deltal meanj electrical conductivityr varianceD number of dimensionsGP Gaussian processMs Martensite start temperatureMV molar volume

n number of training (experimental) pointsNG Gaussian NoiseNS Nash–Sutcliffe model efficiencyRMSE root mean square errorT temperaturew width of a Gaussian kernel

E. Bélisle et al. / Computational Materials Science 98 (2015) 170–177 171

conductivity (EC) and Martensite temperature (Ms), respectivelyconsisting of smooth, nonsmooth and noisy (several local minimain a small domain) data. We consider data on molar volume tobe smooth because the theory tells us that it should vary almostlinearly and also because the experimental datasets are in goodagreement at equal composition and temperature. For the electri-cal conductivity, the data is very scattered and that is why we con-sider it to be nonsmooth. As for Ms, we consider the data noisybecause here we are omitting to include certain influential param-eters such as the fine austenite grain size [10] and are consideringuniquely the initial composition. We wish to demonstrate how athorough knowledge of the system as well as machine–humaninteractions can improve the quality of the predictions. Stry et al.compared the quadratic and linear interpolation applied to thenumerical simulation of crystal growth [11]. They found that a cus-tom quadratic approach developed by them gave more accurateresults with smaller computational time. Ghosh and Rudy foundan improvement of the relative error of reconstructed versus mea-sured epicardial potentials of Electrocardiographic Imaging whenusing a quadratic interpolation instead of linear one [12]. Skinnerand Broughton published their work on neural networks appliedto material science, and compared different methods for findingthe weights of feed-forward neural networks [13]. In the presentpaper we have added comparisons with more recent techniques:linear and quadratic interpolation, neural network, Gaussian pro-cesses (GP), and dynamic trees. We also include a comparison witha new strategy, the scalable Gaussian process regression (SGP) [14]that was developed to speed up Gaussian process regression whilemaintaining an acceptable prediction error. This was motivated bythe idea to introduce physical properties as one of the possibleparameters inside the FactOptimal module of the FactSage soft-ware. FactSage is a software system that was created for treatingthermodynamic properties and calculations in chemical metal-lurgy [15]. It is used today all over the world by more than 400universities and companies in the domain of material chemistry.It contains various modules allowing users to perform a wide vari-ety of thermochemical calculations [16]. The FactOptimal module[17–19] allows one to find the best set of conditions given con-straints while optimizing chosen properties. The program usesthe NOMAD derivative-free solver [20] to find the best parameters.For example, given chemical system (ex. x1Cþ x2Mnþ x3Siþ x4Cr),one may wish to find the values of chemical compositions (xi) thatwould give an equilibrium temperature of around 275 �C. To do so,NOMAD tries different combinations of compositions (xi), obtain-ing the corresponding value of temperature from FactSage until,hopefully, an optimal solution is found. The idea to introducematerial properties as possible constraints or as values to be opti-mized requires the use of a machine learning tool to predict theseproperties. Because a large number of predictions are performedduring a FactOptimal run, the computational time to make thesepredictions is of great importance. Furthermore, we wish to makeon-line learning possible, as it may be the case that newexperimental data is fed dynamically into the learning database.

Making predictions on the Martensite start temperature is not anew domain. Some authors use a neural network model with goodresults [21,22]. A thermodynamic framework [23] or a purelyempirical approach [24,25] have also been studied. Soumail et al.in 2006 [26] compared these methods and concluded that althoughthe thermodynamic approach provides satisfying results, there is astrict limitation in the query points, based on the fundamentalassumptions upon which the model was based. They observed thatthe neural network approach performs just as good as others butwith a higher amount of outliers or wild predictions, therefore theyrecommended the use of a Bayesian framework. Using a BayesianGP model, very accurate predictions were obtained for the predic-tion of austenite formation (Martensite is formed in carbon steelswhen cooling austenite) [27].

An empirical model [28] and a combined model with quantumchemical molecular dynamics and kinetic Monte Carlo method [29]were applied to predict electrical conductivity. Both models aredeveloped specifically for electrical conductivity and would requireextensive work to be adapted to predict other physical properties.To the best of our knowledge, all published material on Ms andelectrical conductivity prediction discuss their results in terms ofprediction accuracy and no report is given on the computationaltime.

In the following sections we first provide a description of thedatabases that were employed for this research, then brieflydescribe each interpolation technique. Then we present results interms of computational time and accuracy. We then discuss theresults and make recommendations on the use of each methoddepending on the type of problem.

2. Materials data

For this work, we have access to three databases of experimen-tal points collected from the literature. The database employed formolar volume predictions has 2700 data points (n = 2700), withvarious compositions in mole percent on 10 dimensions (D = 10),temperature in Kelvin and an associated molar volume value incubic centimeters per mole. The electrical conductivity databaseconsists of approximately 9300 data points with compositions inmole percent over 10 dimensions, temperature (T) in Kelvin(D = 10) and an associated electrical conductivity (EC) value in Sie-mens per meter. For both the MV and EC databases, the materialsare insulating oxides, therefore EC refers to the ionic conductivity.The Martensite start temperature (Ms) database consists ofapproximately 1100 data points with composition values in weightpercent on 15 dimensions (D = 15) and an associated Ms value inKelvin. The main element, Fe, is not used in the regressions. Table 1gives the range of compositions of each database.

Some physical properties can be measured with reasonableaccuracy, therefore there is very little discrepancy between thedifferent data sources. Moreover, certain properties have a quasi lin-ear dependence with the constituents chemical compositions, whileothers may have a more complex dependence on compositions and

Page 3: Evaluation of machine learning interpolation techniques for prediction of physical properties

Table 1Ranges of the databases.

Element Ms (wt.%) MV (Mol.%) EC (Mol.%)

Min Max Min Max Min Max

C 0 2.25Mn 0 10.24Si 0 3.8Cr 0 18Ni 0 31.54Mo 0 8V 0 4.55Co 0 16.08Al 0 3.01W 0 18.6Cu 0 3.04Nb 0 1.98Ti 0 2.52B 0 0.006N 0 2.65Fe 65.09 99.83SiO2 0 90 0 100Al2O3 0 90 0 100MgO 0 85.51 0 64.08CaO 0 87.91 0 74.79Na2O 0 60.1 0 62.9277K2O 0 50 0 45.8Li2O 0 65 0 59.4MnO 0 77.17782 0 77.21PbO 0 95 0 100T(K) 713 3273 398 3223

172 E. Bélisle et al. / Computational Materials Science 98 (2015) 170–177

can vary exponentially according to the temperature (singularityand local extrema). Measuring the molar volume on liquid oxidesat high temperature can lead to a relatively large level of uncertaintyand discordance between existing data sources. Despite this fact, weconsider the molar volume as smooth as most of the dataset has lit-tle discrepancy. Electrical conductivity is also measured at hightemperature, leading to a lower level of confidence. This combinedwith the fact that it has a complex dependence on compositions, andobeys Arrhenius laws (see Section 4.2), we consider EC as beingnonsmooth. MV and EC are properties dependent on the same vari-ables describing Gibbs free energy under a certain atmosphericpressure. On the other hand, Ms is influenced by kinetic factors suchas the cooling rate. These factors are not considered in our datasetand for this is a reason why Ms is considered noisy. Also becauseof its dependence not only on compositions but also on the differentphases within a given steel.

The Ms database is available for download on the ThomasSourmail website [30]. The other two databases are given assupplementary material for this work.

3. Theoretical methods

In this section we very briefly introduce each technique. Formore details, refer to the cited authors.

3.1. Linear interpolation

Linear interpolation is no doubt one of the simplest method onecan employ to fit experimental data. One assigns parameters b 2 R

and c 2 R such that f ðxÞ can be predicted using a linear model ofthe form

bxþ c ð1Þ

For multidimensional problems, normalized areas bound byknown data are used in order to interpolate unknown data points[31]. This method has the advantage of being easy to understand,fast and straightforward to implement, but it is an approach

specific to a given problem since it is parametric. Because of thisapproach, while doing on-line learning, parameters have to berecalculated each time new data is added to the learning set. Whilethis is adding to the computational complexity, the most importantlimitation of the linear interpolation model is that it is a simplisticapproach that may be inappropriate for complex problems. Linearinterpolation has been used successfully on many varied problems,including pricing and stock market [32,33], medical science [34]and digital imaging [35].

3.2. Quadratic interpolation

Both linear and quadratic interpolation techniques belong tothe polynomial interpolation family. Linear interpolation is limitedto a model of the first order while quadratic interpolation is of thesecond order. Similarly to the linear interpolation approach, theobjective is to find parameters a; b and c such that f ðxÞ can be pre-dicted using a quadratic function of the form

12

axT þ bxþ c ð2Þ

The data is represented by a quadratic. As with linear interpola-tion, this is a parametric approach, with the same disadvantages.However, it is also a simple method to implement and predictionsdo not require a lot of computational time. It has been successfullyused in image reconstruction and sampling [36] as well as inastronomy [37].

3.3. Neural network

The neural network approach has been extensively employed inrecent years in applications such as pattern recognition [38] andmaterial science [39]. Inspired by the nervous system, neural net-works are composed of highly interconnected elements, workingtogether to make predictions. It is a very good approach whenworking with nonlinear functions [40] as it can detect complexrelationships between independent variables. However, disadvan-tages include a large computational time, its empirical natureand a tendency to overfit [41]. As with polynomial interpolation,model parameters have to be carefully chosen and are specific toa problem. For this work we used the Tiberius data mining soft-ware [42] version 7.0.7.

3.4. Gaussian process regression

A Gaussian process (GP) is a generalized Gaussian probabilitydistribution [3]. A Gaussian process regression computes the pos-terior distribution based on training data, or prior distribution. Ithas the advantage of being a non-parametric approach and adapt-able to various situations, especially for high dimensional spaceproblems [3]. However, when computing Gaussian process regres-sion, one has to deal with matrices inversions, which leads to a typ-ical computational complexity of n3 where n is the number oftraining data points. Consequently, this model may be very slowand not suitable for on-line applications. The Gaussian processregression technique applied in this work is based on the earlierwork of Gibbs and MacKay [43].

Let f ¼ ðf 1; f 2; . . . ; f nÞ be observed responses for one of theblackbox outputs at inputs X ¼ ðx1; x2; . . . ; xnÞ which can be consid-ered as a set of training points in a n dimensional space Rn. Theobjective is to learn a function CðXÞ transforming the input vectorinto a target function f ðXÞ ¼ CðXÞ þ NGðl;rÞ where NG is aGaussian noise for which the mean, l, is assumed to be zero every-where and the variance is r2

n. In this case the covariance function Krelates one function value to another one. In this work we consider

Page 4: Evaluation of machine learning interpolation techniques for prediction of physical properties

Fig. 1. Final selection of the training points for each batch of predictions (querybatch): the closest points to the geometrical mean are chosen first (left) then thetraining set is expanded to include the closest points to each point (right).

E. Bélisle et al. / Computational Materials Science 98 (2015) 170–177 173

the Gaussian kernel to define the covariance matrix as in previouswork of Gibbs and MacKay [43]:

KðX;X 0Þ ¼ r2f exp �1

2

Xn

j¼1

ðxi � x0jÞ2

wj

( )þ r2

ndðX;X0Þ ð3Þ

where d is the Kronecker delta function, r2f denotes the overall var-

iance of the process and w represents the width of the Gaussian ker-nel, it governs the rate of decay of the special correlation in eachinput direction, in other words d is a characteristic euclidean dis-tance above which two points will be uncorrelated. The joint distri-bution of the observed and predicted function for a special point(i.e. composition in our work) is given by

f ðX�Þ ¼ KT� ðK þ r2

nIw�1

f ð4Þ

with KT� ¼ KðX;X�Þ. r2

n and Iw are a set of free parameters for a flex-ible customization of the GP to take into account the specificity ofthe problem. These two adjustable parameters are called hyperpa-rameters. They are usually automatically optimized using Quasi-Newton methods [44] by maximizing the log marginal likelihoodof the model given the data. The choice of the covariance functionsand the two hyperparameters is the first step of the GP processoften denoted by ‘‘model selection’’.

After the model selection, the second step of a GP consists ofperforming a model regression performed upon the input functions(training), typically the available or part of the available experi-mental data. The variance of the predicted function resulting fromthe regression step is then given by

Vf ðX�Þ ¼ K�� � KT� ðK þ r2

nIw�1

KT� ð5Þ

with K�� ¼ KðX�;X�Þ. From the above equations, one can see that aGP requires operations using a covariance matrix K, representedby covariance functions on all possible combinations of trainingdata point pairs. All training data points also have to be processedin order to perform the regression and compute KT

� . From this wecan conclude that the computational cost of a GP is heavily depen-dent on size of the training data and will grow exponentially. Forthis reason, GPs are not practical for real-time applications.

3.5. Scalable Gaussian process

The scalable Gaussian process (SGP) is a recent technique allow-ing the use of a Gaussian process Regression for on-line learning.The method was introduced in a previous work by the sameauthors, see [14] for the details. In this work we are using theonline web application that has been developed by the authorsin order to perform the calculations [45]. The technique is dividedinto 4 steps. First, the predictions points are clustered into groupswith similar attributes. Then, the entire training data is condensedto avoid redundant information, grouping similar data together. Inthe next step, for each group created in the first step, a subset ofthe condensed training database is selected in order to performthe regression in the last step. The final points in the training setare chosen using a distance function, the closest points to the cen-ter of each cluster created in the first step being the most relevant.In this work, we added the extra step of adding the K 0 nearest-neighbors (K 0-NN) from the training data for each target batchquery point (Fig. 1), using a normal Euclidean distance (Eq. (6)).ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiXD

i¼1ðyi � xiÞ2

rð6Þ

where x and y are two points in an Euclidean space of dimension D.For data where the training set is very large, this method can be

very effective as there is a higher chance of having close data to thequery points in the training set. If the set of training points is

smaller, we have found that this extra step is not necessary. How-ever, if there is a concern about the data being concentrated in cer-tain areas as illustrated in Fig. 1, it may be necessary to include thisstep to ensure that relevant points are included in the training set.In this work, it has been done in every calculation in order to makecomparisons possible.

3.6. Dynamic trees

The idea with dynamic regression trees or dynaTree, as imple-mented in the R software package dynaTree [46], is to partitionthe space with several tree models where each tree correspondsto one partitioning scheme and each leaf of each tree correspondsto a region. Once these trees are determined, predictions areachieved by averaging model values over all trees. The main advan-tage of such an approach is the use of simple models within eachpartition [8]. It is a non-parametric approach, and particle learningalgorithms make on-line learning possible. Because of the parti-tioning approach, it may be well suited and modelled for real-world applications where variables can be of totally different nat-ure. However, one of the disadvantages of such an approach is thatthe generated trees may be very large and complex. Also, as withany partitioning approach, there is always the risk of too muchdata approximation. In this work we use two versions of dynamictrees: the constant model (dynaTree CST) and the linear model(dynaTree LIN). The difference lies in space partitioning. Both makeuse of a full binary tree, the constant model with a fixed number ofleaf data points, three, and 2þ D for the linear model, D being thedimension of the covariate space.

4. Results

In this section we present the prediction accuracy obtainedwhen training the chosen models, followed by a general discussion.The computational time is discussed in Section 5. For each type ofdataset treated in this paper, we measure the quality of the tech-niques in terms of root mean square of the relative error (RMSE),given by the following equation:

RMSE ¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPXx¼1ðQ

xo � Q x

pÞ2

X

sð7Þ

where Qxo is the observed value and Qx

p is the predicted value for aquery point x and X is the total number of query points. We alsoemploy two other predictive accuracy measures: the Nash–Sutcliffemodel efficiency coefficient (NS) [47] and the proposed Order effi-ciency coefficient. The NS coefficient is calculated as follows:

NS ¼ 1�PX

x¼1ðQxo � Qx

pÞ2

PXx¼1ðQ

xo � QoÞ

2 ð8Þ

Page 5: Evaluation of machine learning interpolation techniques for prediction of physical properties

Fig. 2. Comparison of the RMSE for molar volume predictions.

Table 3NS, Order and RMSE for molar volume (50% training points).

Technique NS Order RMSE

Linear 0.9301 0.9104 8.4144Quad 0.9501 0.9291 7.4706dynaTree CST 0.9512 0.9129 6.0979dynaTree LIN 0.9376 0.9217 8.3299GP 0.9784 0.9509 4.5036SGP 0.9783 0.9584 3.2712NN 0.9166 0.9119 8.6115

174 E. Bélisle et al. / Computational Materials Science 98 (2015) 170–177

It gives an indication of how good the predictions are comparedto the mean of the observed values. The Order coefficient is deter-mined by taking all possible combinations on pairs of predictions

compared to the actual values. For each pair ði; jÞ, if ðQ io < Qj

oÞand ðQ i

p < Q jpÞ, or ðQ i

o > Q joÞ and ðQ i

p > QjpÞ then it is considered a

good prediction, and a counter O is incremented by one. The Ordercoefficient is then calculated as follows:

Order ¼ OXðX � 1Þ=2

ð9Þ

where X is the total number of predictions. The Order coefficientgives an indication of how accurate the model is at comparing twopoints. The NS coefficient ranges from�1 to 1, and the Order coeffi-cient from 0 to 1. In both cases, the closer to 1, the more accurate thepredictions. If NS ’ 0, it is an indication that the predictions are asaccurate as the mean of the observed data (Qo), while NS < 0 indi-cates that the observed mean is a better predictor than the model.

For all three databases, we employ data collected from the liter-ature of tables consisting of chemical composition for Martensitetemperature, including the temperature for electrical conductivityand molar volume, for a n 2 ½10; 15�. See Table 2 for a simple exam-ple training set and query point on molar volume data. Each set ofcompositions (row) has an associated property that we arepredicting. For each data set and technique, we randomly selectthe training data points from the database and use the remainingdata for evaluating the predictions. In the following subsectionswe present results using from 50% to 90% of training data.

For each technique, outliers (wild predictions) are excludedfrom the average RMSE, as we believe that including a few verylarge numbers would not give an accurate representation and thusthe comparison would be distorted. Predictions with an errorgreater than 200% are considered as outliers. In Section 5 we dis-cuss in more detail the percentages of outliers obtained for eachtested technique.

4.1. Molar volume data

The performance of each technique is illustrated in Fig. 2 andTable 3. All six techniques performed relatively well, maintainingan average RMSE below 10%. However, the GP was the clear winnerwith an average RMSE below 5% for every test. The linear, qua-dratic and dynaTree LIN models give very similar results, with anaverage RMSE of 7–9%. As expected, there is a general tendencyfor an improved accuracy as the proportion of training pointsincreases. The more training data is available, better are the chanceof covering the entire space. The dynaTree CST technique givesvery good results for 4 datasets out of 5. This behavior is confirmedby the NS and Order coefficients.

4.2. Electrical conductivity data

We test the linear, Quad, dynaTree CST and dynaTree LINtechniques with actual electrical conductivity values cond as wellas lnðcondÞ and lnðT � condÞ. As mentioned earlier, electrical

Table 2Example of training and prediction query (p) for molar volume (MV) data. The input com

SiO2 Al2O3 MgO CaO Na2O K2O

Training53 0 0 5.1 41.9 056 0 0 0 0 078.56 0 0 0 14.3 047.6 5.61 21.29 25.5 0 0

Prediction60.56 5.08 28.57 0 3.53 2.3

conductivity here refers to the ionic conductivity. Table 4 showsthat using logðT � condÞ gives the best predictions, therefore wecompare the RMSE making predictions on this value. This can beexplained because in general, the electrical conductivity (j)temperature dependence obeys Arrhenius laws, that is:lnðjÞ ¼ aþ b=T where b is the activation energy and a is a valueof electrical conductivity at a reference temperature. However,for silicate systems, there is a deviation from this law [48]. Conse-quently, we decided to test all three cases mentioned above toevaluate how the prior knowledge of the problem influences pre-dictions quality. In this case there is a clear improvement on theNS coefficient (27%) while the Order coefficient had only a slightincrease (2%). Fig. 3 shows that the GP technique gave the lowestaverage RMSE for all the testing sets. The NN and linear interpola-tion techniques performed quite poorly, especially with only 50%of training data, giving respectively average RMSE of 47% and28%, both with an NS coefficient of 0.79 compared to 0.98 for GP.

4.3. Martensite temperature data

Once again, as illustrated by Fig. 4 and Table 5, GP gave the bestpredictions, maintaining an average RMSE of 5.85%. Linear interpo-lation performed remarkably well overall with an average RMSE of13.6%. Quadratic regression and dynaTree LIN gave good resultswith a large training set, however performed very poorly with asmaller training set. For the previous two problems, propertiesare measured within one chemical phase, therefore, measuredvalues depend only on chemical composition and temperature.However, the value of Ms is dependant on multiple chemical

positions are in mole percent and the molar volume in cm3/mol.

LiO2 MnO PbO T (K) MV

0 0 0 1573 26.610 0 44 1323 25.557.14 0 0 1773 26.50 0 0 1773 22.93

0 0 0 1053 p

Page 6: Evaluation of machine learning interpolation techniques for prediction of physical properties

Table 4NS, Order and RMSE for electrical conductivity (50% training points). N/A signifies that no data was available for this particular type.

Technique cond ln(cond) ln(T � cond)

NS Order NS Order NS Order RMSE

Linear 0.5960 0.8419 0.7661 0.8509 0.7853 0.8604 27.8245Quad 0.7610 0.8675 0.8981 0.8961 0.9117 0.8990 22.6727dynaTree CST 0.7265 0.8599 0.8847 0.8399 0.8818 0.8442 21.0682dynaTree LIN 0.6585 0.8397 0.8771 0.8741 0.8909 0.8800 22.4429GP N/A N/A N/A N/A 0.9610 0.9193 8.9757SGP N/A N/A N/A N/A 0.9872 0.9446 9.4010NN N/A N/A N/A N/A 0.7897 0.8630 47.4660

Fig. 3. Comparison of the RMSE for electrical conductivity predictions.

Fig. 4. Comparison of the RMSE for Martensite temperature predictions.

Table 5NS, Order and RMSE for Martensite temperature (50% training points).

Technique NS Order RMSE

Linear 0.8505 0.9000 13.2032Quad 0.4988 0.8896 35.4405dynaTree CST 0.7997 0.8517 13.7197dynaTree LIN 0.5570 0.7917 26.9888GP 0.8987 0.9120 8.1254SGP 0.7949 0.8738 11.8719NN 0.7533 0.8933 14.0708

Table 6Training RMSE for Martensite temperature (50%training points).

Technique Training RMSE

Linear 14.3434Quad 12.9457dynaTree CST 14.4358dynaTree LIN 26.3521GP 3.5598SGP 6.8557NN 10.0290

Table 7Overall average time per prediction in seconds.

Technique Time (s)

Linear 4E�6Quad 3.5E�5dynaTree CST 0.088dynaTree LIN 4.22GP 18.85SGP 0.94NN 7.73

E. Bélisle et al. / Computational Materials Science 98 (2015) 170–177 175

phases, and is influenced by operating factors such as the coolingrate, hence the noisy nature of the data. For this specific problem,we also added an additional measure: the RMSE on the trainingdata at 50% training, in order to show the quality of the regressionmethods on noisy data. The results are presented in Table 6. GPpresents the smallest training error with 3.56% while the worseperformer is dynaTree LIN with 23.81%.

4.4. Computational time

We had an average time of 18.8 s per prediction point whenrunning a GP regression, on a desktop computer Intel i7 3.4 GHz

with 16 GB of RAM. The NN was the second slowest with an aver-age of 7.7 s per prediction while dynaTree LIN came in third with4.2 s per prediction. The SGP technique produced an average timeper prediction of 0.94 s. The other three techniques performedunder 0.1 s, as shown in Table 7. The times include both trainingand prediction.

5. Discussion

The main preoccupation of an engineer when attempting tomodel new data is the reliability of the prediction. In terms of pre-dicting accuracy, for all three types of data the GP and SGP (Fig. 5)are the clear winners in our evaluations. Overall, the GP has a offersslightly better prediction accuracy, but this technique is by far theslowest to run and can be impractical with very large datasets. Iftime is not a factor, GP seems to be the best choice. However, foron-line applications or any application where computational timeis an important factor to consider, one may wish to consider usingSGP, which offers a slight setback in accuracy but improves greatlythe computational time.

Smooth data. Within the three faster techniques, dynaTree CSTgave the best performance. Nevertheless, since all models gaveacceptable results, one may consider using strictly linear interpola-tion, as the excess (or deviation from linearity) has proven to bevery low and the computational time exceptionally fast.

Nonsmooth data. The quadratic interpolation model representsthe best choice for this type of data within the faster techniques.With the Electrical Conductivity example, we show that usinglnðT � condÞ leads to better predictions. Therefore, this clearly

Page 7: Evaluation of machine learning interpolation techniques for prediction of physical properties

Fig. 5. Average RMSE obtained for each set of data.

Fig. 6. Percentage of excluded outliers (RMSE > 200%) per set of data.

176 E. Bélisle et al. / Computational Materials Science 98 (2015) 170–177

demonstrates that a thorough knowledge of the problem is animportant factor influencing the quality of predictive models.

Noisy data. For this type of data, prediction accuracy clearlyimproves as the training set gets larger. As we can see in Fig. 4,with a large training set (90%), all techniques give acceptableresults. Consequently, if the training set is complete enough, thepolynomial interpolation models seem to be an interesting choicebecause of their low computational cost. Some authors have sug-gested that Ms can be a linear function [49,50]. However, if thiswas the case, the linear regression model would give the best pre-dictions. Since the Gaussian approaches are clear winners over thelinear approach, it seems apparent that Ms is a much more com-plex function. Here, using parameters other than the chemicalcomposition as part of the model could improve significantly thepredictions accuracy.

There is a clear magnitude difference in the general average rel-ative error obtained by all techniques on all three sets of data. Formolar volume, the average error was under 10%, while for electricalconductivity and Martensite temperature, the average was morearound 15–20%. This can be explained by the fact that the molarvolume is easier to measure than the other two properties, thusminimizing the intrinsic error.

One can argue that the real power of machine learning tech-niques lies in predictions made with a minimum set of trainingdata. In the real world, it is often the case that engineers have lim-ited experimental points and still wish to make predictions basedon this data set. In this light, if we compare the results with only50% of training points (Table 8), one should avoid neural networkfor nonsmooth data, and quadratic interpolation for noisy data.With an average RMSE of over 35% on prediction of Ms, Quadclearly overestimates the non-linearity of the function, while it isnot the case for the two other types of data. Once again, the mostreliable technique is the Gaussian process regression for two of thethree cases. SGP and dynaTree CST are good alternatives to GP toreduce the computational time. The training RMSE at 50% trainingpoints (Table 6) is representative of the results obtained on testingpoints with the exception of Quad, which has a training error of12.95% and a prediction error of 35.44%. However, as mentionedat the beginning of Section 4, outliers were excluded from theaverage RMSE, and the same treatment has been done whilst

Table 8Relative RMSE in percent at 50% training. On each row, lowest RMSE is represented in ita

Linear Quad dynaTree CS

Molar volume 8.41 7.47 6.10Electrical Conductivity 27.82 22.67 21.07Martensite temperature 13.20 35.44 13.72

calculating the training error. While most technique producedpractically no outliers on training data, 3.3% of outliers wereexcluded for the Quad technique. The neural network models wellthe training data despite giving somewhat erratic results on testingpoints.

Fig. 6 illustrates the total percentage of excluded data per tech-nique. From this figure, we can conclude that a smooth data setleads to very few wild predictions, however, for nonsmooth data,human validation is required in order to make sure that these pre-dictions are not considered. Quadratic interpolation gave very fewoutliers for smooth and nonsmooth data, however it ended up hav-ing almost 2% of rejected data for a noisy set of data. In general, SGPwas the most reliable technique with less than 0.05% of outliers foreach training set, while the neural network model was unreliableespecially for nonsmooth data, giving more than 2% wild predic-tions, and performing erratically for noisy data. The remarkablysmall number of outliers for SGP is interesting and can beexplained by the fact that this technique is partitioning the datain very small clusters. Rather than approximating a Gaussian func-tion over the entire training set, it is done in very small areas, andthis reduces the error where in areas where data would be sparseover the entire space. This is a very important advantage of SGP,and it also explains while in some cases, especially at 50% trainingdata, SGP had a smaller average RMSE than a conventional GP.

6. Conclusion

A material engineer wishing to make predictions on specific setsof data must study the nature of the data in order to make aninformed decision. Assisted by computer scientists, one can makethe best choice to achieve the most accurate predictions whilst min-imizing the computational time. This study demonstrated thatknowing the behavior of electrical conductivity data led to moreaccurate results by using a logarithmic value. Overall, within thetested techniques, the Gaussian process regression gave the bestprediction accuracy, but was by far the slowest technique. For appli-cations where computational time is an important factor, such asreal-time applications, we recommend using a modified version ofGPs such as the Scalable GP proposed in the current manuscript.The constant model of dynaTree could also be a good alternative.This paper demonstrates how computer science can be coupledwith material engineering, in order to improve material and alloydesign [51].

lics and highest in bold.

T dynaTree LIN GP SGP NN

8.33 4.50 3.27 8.6122.44 8.98 9.41 47.4726.99 8.13 11.87 14.07

Page 8: Evaluation of machine learning interpolation techniques for prediction of physical properties

E. Bélisle et al. / Computational Materials Science 98 (2015) 170–177 177

Acknowledgement

This work is partially supported by the ARC grant FT130101530and DP140103171 and by the NSERC grant 418250.

Appendix A. Supplementary material

Supplementary data associated with this article can be found, inthe online version, at http://dx.doi.org/10.1016/j.commatsci.2014.10.032.

References

[1] D. Wolpert, W. Macready, IEEE Trans. Evol. Comput. 1 (1) (1997) 67–82, http://dx.doi.org/10.1109/4235.585893.

[2] E. Meijering, Proc. IEEE 90 (3) (2002) 319–342, http://dx.doi.org/10.1109/5.993400.

[3] C.E. Rasmussen, Gaussian Processes for Machine Learning, MIT Press, 2006.[4] A. O’Hagan, J.F.C. Kingman, J. Roy. Stat. Soc. Ser. B (Methodol.) 40 (1) (1978) 1–

42. <http://www.jstor.org/stable/2984861>.[5] D. Rummelhart, G.E. Hinton, R.J. Williams, Nature 323 (9) (1986) 533–535.[6] C.E. Rasmussen, C.K.I. Williams, Gaussian Processes for Machine Learning, the

MIT Press, 2006. ISBN 026218253X. c 2006 Massachusetts Institute ofTechnology.

[7] R.B. Gramacy, H.K.H. Lee, J. Am. Stat. Assoc. 103 (483) (2008) 1119–1130,http://dx.doi.org/10.1198/016214508000000689. arXiv: http://amstat.tandfonline.com/doi/pdf/10.1198/016214508000000689.

[8] M.A. Taddy, R.B. Gramacy, N.G. Polson, J. Am. Stat. Assoc. 106 (493) (2011)109–123, http://dx.doi.org/10.1198/jasa.2011.ap09769. arXiv: http://amstat.tandfonline.com/doi/pdf/10.1198/jasa.2011.ap09769.

[9] R. Neal, Bayesian Learning for Neural Networks, Lecture Notes in Statistics,Springer, New York, 1996. <http://books.google.com.au/books?id=myE0nwEACAAJ>.

[10] A. Garcia-Junceda, C. Capdevila, F. Caballero, C.G. de Andres, Scr. Mater. 58 (2)(2008) 134–137, http://dx.doi.org/10.1016/j.scriptamat.2007.09.017. <http://www.sciencedirect.com/science/article/pii/S13596%46207006641>.

[11] Y. Stry, M. Hainke, T. Jung, Int. J. Numer. Methods Heat Fluid Flow 12 (2002)1009–1031.

[12] S. Ghosh, Y. Rudy, Ann. Biomed. Eng. 33 (9) (2005) 1187–1201, http://dx.doi.org/10.1007/s10439-005-5537-x.

[13] A.J. Skinner, J.Q. Broughton, Modell. Simul. Mater. Sci. Eng. 3 (3) (1995) 371.<http://stacks.iop.org/0965-0393/3/i=3/a=006>.

[14] E. Bélisle, Z. Huang, A. Gheribi, in: Databases Theory and Applications,Springer, 2014, pp. 38–49.

[15] C.W. Bale, P. Chartrand, S.A. Degterov, G. Eriksson, K. Hack, R.B. Mahfoud, J.Melançon, A.D. Pelton, S. Petersen, Calphad-Comput. Coupling Phase DiagramsThermochem. 26 (2002) 189–228, http://dx.doi.org/10.1016/S0364-5916(02)00035-4.

[16] C. Bale, E. Bélisle, P. Chartrand, S. Decterov, G. Eriksson, K. Hack, I.-H. Jung, Y.-B.Kang, J. Melançon, A. Pelton, C. Robelin, S. Petersen, Calphad 33 (2) (2009)295–311, http://dx.doi.org/10.1016/j.calphad.2008.09.009. Tools forComputational Thermodynamics. <http://www.sciencedirect.com/science/article/pii/S03645%91608000965>.

[17] A. Gheribi, C. Audet, S. Le Digabel, E. Bélisle, C. Bale, A. Pelton, Calphad 36 (0)(2012) 135–143, http://dx.doi.org/10.1016/j.calphad.2011.06.003. <http://www.sciencedirect.com/science/article/pii/S03645%91611000563>.

[18] A.E. Gheribi, C. Robelin, S. Le Digabel, C. Audet, A.D. Pelton, J. Chem.Thermodyn. 43 (9) (2011) 1323–1330, http://dx.doi.org/10.1016/j.jct.2011.03.021. <http://www.sciencedirect.com/science/article/pii/S00219%61411001005>.

[19] A.E. Gheribi, S. Le Digabel, C. Audet, P. Chartrand, Thermochim. Acta 559 (0)(2013) 107–110, http://dx.doi.org/10.1016/j.tca.2013.02.004. <http://www.sciencedirect.com/science/article/pii/S00406%03113000816>.

[20] S. Le Digabel, Algorithm 909: NOMAD: Nonlinear Optimization with the MADSAlgorithm, ACM Trans. Math. Softw. 37 (4) (2011), http://dx.doi.org/10.1145/1916461.1916468.

[21] A. d. W.W.G. Vermeulen, P.F. Morris, S. van der Zwagg, Prediction of martensitestart temperature using artificial neural network, Ironmaking Steelmaking 23(5) (1996).

[22] T. Sourmail, C. Garcia-Mateo, Comput. Mater. Sci. 34 (2) (2005) 213–218,http://dx.doi.org/10.1016/j.commatsci.2005.01.001. <http://www.sciencedirect.com/science/article/pii/S09270%25605000121>.

[23] A. Stormvinter, A. Borgenstam, J. Ågren, Metall. Mater. Trans. A 43A (10) (2012)3870–3879. qC 20121029.

[24] S.-J. Lee, K.-S. Park, Metall. Mater. Trans. A 44 (8) (2013) 3423–3427, http://dx.doi.org/10.1007/s11661-013-1798-4.

[25] P. Payson, C. Savage, Trans. ASM 33 (1944) 261–275.[26] T. Sourmail, C. Garcia-Mateo, Comput. Mater. Sci. 34 (4) (2005) 323–334,

http://dx.doi.org/10.1016/j.commatsci.2005.01.002. <http://www.sciencedirect.com/science/article/pii/S09270%25605000133>.

[27] C. Bailer-Jones, H. Bhadeshia, D. MacKay, Gaussian process modelling ofaustenite formation in steel, Mater. Sci. Technol. 15 (3) (1999).

[28] Y. Mualem, S.P. Friedman, Water Resour. Res. 27 (10) (1991) 2771–2777,http://dx.doi.org/10.1029/91WR01095.

[29] H. Tsuboi, A. Chutia, C. Lv, Z. Zhu, H. Onuma, R. Miura, A. Suzuki, R. Sahnoun, M.Koyama, N. Hatakeyama, A. Endou, H. Takaba, C.A.D. Carpio, R.C. Deka, M.Kubo, A. Miyamoto, J. Mol. Struct.: {THEOCHEM} 903 (1–3) (2009) 11–22,http://dx.doi.org/10.1016/j.theochem.2008.11.040. Recent advances in thetheoretical understanding of catalysis. <http://www.sciencedirect.com/science/article/pii/S01661%28009000384>.

[30] T. Sourmail, Predicting the Martensite Start Temperature (ms) of Steels@ONLINE, 2014. <http://www.thomas-sourmail.net/martensite.html>.

[31] R. Wagner, Multi-linear Interpolation, 2013. <http://bmia.bmt.tue.nl/people/BRomeny/Courses/8C080/Interpolation.pdf>.

[32] J.C. Hull, A.D. White, J. Derivatives 1 (1) (1993) 21–31.[33] T.-S. Dai, J.-Y. Wang, H.-S. Wei, in: M.-Y. Kao, X.-Y. Li (Eds.), Algorithmic

Aspects in Information and Management, Lecture Notes in Computer Science,vol. 4508, Springer, Berlin Heidelberg, 2007, pp. 262–272, http://dx.doi.org/10.1007/978-3-540-72870-2_25.

[34] D. Cockcroft, K. Murdock, J. Mink, CHEST J. 84 (4) (1983) 505–506.[35] H. Malvar, L.-W. He, R. Cutler, in: Acoustics, Speech, and Signal Processing,

2004. Proceedings. (ICASSP ’04). IEEE International Conference on, vol. 3, 2004,pp. iii-485-8, http://dx.doi.org/10.1109/ICASSP.2004.1326587.

[36] N. Dodgson, IEEE Trans. Image Process. 6 (9) (1997) 1322–1326, http://dx.doi.org/10.1109/83.623195.

[37] G. Schaller, D. Schaerer, G. Meynet, A. Maeder, Astron. Astrophys. Suppl. Ser. 96(1992) 269–331.

[38] H.A. Rowley, S. Baluja, T. Kanade, IEEE Trans. Pattern Anal. Mach. Intell. 20 (1)(1998) 23–38.

[39] A. Skinner, J. Broughton, Modell. Simul. Mater. Sci. Eng. 3 (3) (1995) 371.[40] D.F. Specht, IEEE Trans. Neural Networks 2 (6) (1991) 568–576, http://

dx.doi.org/10.1109/72.97934.[41] J.V. Tu, J. Clin. Epidemiol. 49 (11) (1996) 1225–1231, http://dx.doi.org/

10.1016/S0895-4356(96)00002-9. <http://www.sciencedirect.com/science/article/pii/S08954%35696000029>.

[42] T.D. Mining, Tiberius Data Mining Predictive Modelling Software @ONLINE,2014. <http://www.tiberius.biz/>.

[43] M.N. Gibbs, D.J.C. MacKay, Stat. Comput., submitted for publication.[44] B. Schölkopf, A.J. Smola, Learning with Kernels: Support Vector Machines,

Regularization, Optimization, and Beyond, MIT press, 2002.[45] E. Belisle, Scalable Gaussian Process @ONLINE, 2014. <http://http://

www.crct.polymtl.ca/SGP/run_gp.php>.[46] R. Gramacy, M. Taddy, dynaTree: An R Package Implementing Dynamic Trees for

Learning and Design, 2010. <http://CRAN.R-project.org/package=dynaTree>.[47] J. Nash, J. Sutcliffe, J. Hydrol. 10 (3) (1970) 282–290, http://dx.doi.org/10.1016/

0022-1694(70)90255-6. <http://www.sciencedirect.com/science/article/pii/002216%9470902556>.

[48] P. Richet, Geochim. Cosmochim. Acta 48 (3) (1984) 471–483, http://dx.doi.org/10.1016/0016-7037(84)90275-8. <http://www.sciencedirect.com/science/article/pii/001670%3784902758>.

[49] G. Ghosh, G. Olson, Acta Metall. Mater. 42 (10) (1994) 3361–3370, http://dx.doi.org/10.1016/0956-7151(94)90468-5. <http://www.sciencedirect.com/science/article/pii/095671%5194904685>.

[50] G. Ghosh, G. Olson, Acta Metall. Mater. 42 (10) (1994) 3371–3379, http://dx.doi.org/10.1016/0956-7151(94)90469-3. <http://www.sciencedirect.com/science/article/pii/095671%5194904693>.

[51] J.-P. Harvey, A. Gheribi, Process simulation and control optimization of a blastfurnace using classical thermodynamics combined to a direct searchalgorithm, Metall. Mater. Trans. B 45 (1) (2014) 307–327, http://dx.doi.org/10.1007/s11663-013-0004-9.