edic research proposal 1 machine learning … exam... · machine learning techniques for predicting...

6
EDIC RESEARCH PROPOSAL 1 Machine Learning techniques for predicting molecular properties Viviana Petrescu viviana.petrescu@epfl.ch School of Computer and Communication Systems, EPFL . Abstract—The high computational cost of quantum chem- istry calculations have prompted the use of less expensive machine learning methods for predicting molecular properties in chemical compound space. Finding good feature representations for molecules is hard, in part because of the graph-like structure geometry of the molecules that need to be represented as high dimensional vectors. Another difficulty in applying machine learning to molecular data is choosing and training the right models across a large dataset of labeled data. This write-up proposes suggestions to overcome some of the above issues. Index Terms—molecular descriptors, GP-UCB, unlabeled data I. I NTRODUCTION The discovery of new molecular materials in chemistry has the potential of solving many of the problems we face today. Having a system which predicts both accurately and at a small computational cost the properties of new materials is highly desirable and has applications such as novel drugs discovery, water purification and efficient materials design for high energy transmission and storage [2]. Proposal submitted to committee: June 29th, 2015; Candi- dacy exam date: July 6th, 2015; Candidacy exam committee: Exam president, thesis director, co-examiner. This research plan has been approved: Date: ———————————— Doctoral candidate: ———————————— (name and signature) Thesis director: ———————————— (name and signature) Thesis co-director: ———————————— (if applicable) (name and signature) Doct. prog. director:———————————— (B. Falsafi) (signature) EDIC-ru/05.05.2009 In the next section I will summarize three papers. The first one[5] addresses directly the idea of using machine learning for predicting molecular properties. Although the last two focus on different domains (computer vision, natural language processing, temperature data), their ideas could be easily extended to our domain. First, in [5] the authors propose a new molecular descrip- tor which gave a three-fold accuracy improvement over the previously used approaches. The next paper [8] introduces the idea of using unsupervised sparse coding techniques for training models when the labeled data is scarce. Obtaining labeled molecular data is a time consuming process and making use of large unlabeled datasets has the potential to increase the current prediction accuracy across different domains. A large dataset usually requires the use of a complex model and with it comes the difficulty of tuning its parameters. Instead of using a grid-search approach, the parameter selec- tion can be done using Bayesian Optimization for black-box function optimization, where the function to be optimized is the validation accuracy. One such possibility is to use Gaussian Processes in the bandit setting. In [13] the authors prove sub- linear converge rates for Gaussian Processes Upper Confidence Bound (GP-UCB). Their experimental results show the ability of GP-UCB to efficienyly optimize a black-box function with as few evaluations as possible. The last section outlines the directions for future work based on the presented papers. II. BACKGROUND WORK A. Learning Invariant Representations of Molecules for Atom- ization Energy prediction While domain specific descriptors for molecules exist [14], recent work [9] has proposed to predict properties of a molecule of size N only from the 3D positions of the atoms R i , i 1..N and their nuclear charge Z i , i 1..N . This has prompted the introduction of the Coloumb Matrix descriptor, whose individual entries appear in the Schroedinger equation. It is defined by a N xN matrix with entries given by: M (i, j )= 0.5 * Z 2.4 i ifi = j Z i * Z j /||R i - R j || otherwise (1) where ||Ri - Rj || represents the distance between the atoms i and j .

Upload: phamtuyen

Post on 11-Mar-2018

219 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: EDIC RESEARCH PROPOSAL 1 Machine Learning … exam... · Machine Learning techniques for predicting molecular properties ... people often employ Amazon Mechanical Turk for ... introduce

EDIC RESEARCH PROPOSAL 1

Machine Learning techniques for predictingmolecular properties

Viviana [email protected]

School of Computer and Communication Systems, EPFL

. Abstract—The high computational cost of quantum chem-istry calculations have prompted the use of less expensivemachine learning methods for predicting molecular properties inchemical compound space. Finding good feature representationsfor molecules is hard, in part because of the graph-like structuregeometry of the molecules that need to be represented as highdimensional vectors. Another difficulty in applying machinelearning to molecular data is choosing and training the rightmodels across a large dataset of labeled data. This write-upproposes suggestions to overcome some of the above issues.

Index Terms—molecular descriptors, GP-UCB, unlabeled data

I. INTRODUCTION

The discovery of new molecular materials in chemistryhas the potential of solving many of the problems we facetoday. Having a system which predicts both accurately andat a small computational cost the properties of new materialsis highly desirable and has applications such as novel drugsdiscovery, water purification and efficient materials design forhigh energy transmission and storage [2].

Proposal submitted to committee: June 29th, 2015; Candi-dacy exam date: July 6th, 2015; Candidacy exam committee:Exam president, thesis director, co-examiner.

This research plan has been approved:

Date: ————————————

Doctoral candidate: ————————————(name and signature)

Thesis director: ————————————(name and signature)

Thesis co-director: ————————————(if applicable) (name and signature)

Doct. prog. director:————————————(B. Falsafi) (signature)

EDIC-ru/05.05.2009

In the next section I will summarize three papers. The firstone[5] addresses directly the idea of using machine learningfor predicting molecular properties. Although the last twofocus on different domains (computer vision, natural languageprocessing, temperature data), their ideas could be easilyextended to our domain.

First, in [5] the authors propose a new molecular descrip-tor which gave a three-fold accuracy improvement over thepreviously used approaches.

The next paper [8] introduces the idea of using unsupervisedsparse coding techniques for training models when the labeleddata is scarce. Obtaining labeled molecular data is a timeconsuming process and making use of large unlabeled datasetshas the potential to increase the current prediction accuracyacross different domains.

A large dataset usually requires the use of a complex modeland with it comes the difficulty of tuning its parameters.Instead of using a grid-search approach, the parameter selec-tion can be done using Bayesian Optimization for black-boxfunction optimization, where the function to be optimized isthe validation accuracy. One such possibility is to use GaussianProcesses in the bandit setting. In [13] the authors prove sub-linear converge rates for Gaussian Processes Upper ConfidenceBound (GP-UCB). Their experimental results show the abilityof GP-UCB to efficienyly optimize a black-box function withas few evaluations as possible.

The last section outlines the directions for future work basedon the presented papers.

II. BACKGROUND WORK

A. Learning Invariant Representations of Molecules for Atom-ization Energy prediction

While domain specific descriptors for molecules exist [14],recent work [9] has proposed to predict properties of amolecule of size N only from the 3D positions of the atomsRi, i ∈ 1..N and their nuclear charge Zi, i ∈ 1..N . This hasprompted the introduction of the Coloumb Matrix descriptor,whose individual entries appear in the Schroedinger equation.It is defined by a NxN matrix with entries given by:

M(i, j) =

{0.5 ∗ Z2.4

i ifi = jZi ∗ Zj/||Ri −Rj || otherwise

(1)

where ||Ri− Rj|| represents the distance between the atomsi and j.

Page 2: EDIC RESEARCH PROPOSAL 1 Machine Learning … exam... · Machine Learning techniques for predicting molecular properties ... people often employ Amazon Mechanical Turk for ... introduce

EDIC RESEARCH PROPOSAL 2

The dimensionality of the Coloumb Matrix is determined bythe number of atoms in a molecule. Since that varies acrossmolecules in a dataset, one common trick is to pad the matricescorresponding to small molecules with 0’s until they reach themaximum molecule size in a dataset. This limits the size ofthe molecules that can be used, since the descriptor scales withO(N2), where N is the number of atoms.

Due to the graph-like structure of the molecules, finding afixed size representation is difficult. The desired properties of amolecular descriptor are invariance to translation and rotationof the molecule and invariance to the indexing of the atoms.

While the Coloumb Matrix representations solves the rota-tion and translation invariance through the use of distances be-tween atoms ||Ri−Rj ||, invariance in atom indexing still needsto be tackled since any permutation of atom indexes results ina valid Coloumb descriptor. Two variations of the descriptorare proposed in [5]. The first one, called Sorted Coloumb,uses the permutation of the atoms given by the sorting ofthe row norms of a valid Coloumb matrix. Any molecule hasa unique representation given by the Sorted Coloumb matrix.The second representations is based on the idea that the normsof the rows can have very similar values in practice and thesorting, therefore also the atom indexing, is subject to smallnoise. The new descriptor, Random Coloumb Matrices is acollection of Sorted Coloumb matrices. For every molecule,approximatively 10 Randomly Sorted Coloumb matrices aredrawn. One Random Sorted Coloumb Matrix is obtained bysorting the set of rows according to their norm at which smallGaussian noise was added. New predictions can be made byaverging the prediction results for 10 such instantations.

The performane of the descriptors was evaluated on predict-ing the atomization energy on a dataset of 7,165 moleculeswith at most 23 atoms per molecule. The predicted valuesrange from -800 to -2000 kcal/mol. The splitting of the datainto 5 folds for cross validation was done using stratifiedsampling. The molecules were clustered into 5 sets withsimilar atomization energy levels and the folds were createdby randomly selecting one molecule from each bucket.

For experimental evaluation of the performance of de-scriptors, both kernel ridge regression and multilayer feed-forward networks were tried. The best test performance of3.1 kcal/mol Mean Absolute Error (MAE) was obtained usinga neural network with Randomly Sorted Coloumb matrix.The improvement was three fold with respect to the previousstate of the art, which gave a MAE of 9kcal/mol, while aperformance level of 1kcal/mol is desired to reach chemicalaccuracy level. Later work [3] improved upon this result byreducing the current state-of-the-art to only 1.5kcal/mol usinga bag-of-bonds descriptor with Laplacian kernel.

While recently Neural Networks (especially deep networks)have achieved state-of-the-art in many fields ranging fromcomputer vision, natural language processing and speechrecogition, one of the main challenges they pose is thedifficulty in training for people without domain expertise. Inorder to obtain the competitive performance of 3.1kcal/molusing multi layer feed-forward neural networks, multiple trickshave been used, inspired from [6]. First of all, taking the realentries of the Coloumb matrix as input to the neural net proved

to perform poorly. To overcome this, a binarization step wasperformed which added an extra dimension to the dataset.Specifically, every x ∈ RD representing a flatten Coloumbmatrix is mapped to a new binarized descriptor y ∈ RD×M

such that yi ∈ R1×M is computed by taking shifted versionsof xi entry that are passed through a sigmoidal function:

yi = [..., tanh(xi − θθ

), tanh(xiθ

), tanh(xi + θ

θ), ..] (2)

Depending on the range of xi, the size of yi can varyacrosss dimensions if we ignore the saturated values. Theother parameters that are selected using cross validation arethe number of hidden units per layer, the number of layersand the learning rate.

B. Self-Taught Learning: Transfer Learning from UnlabeledData

One of the challenges in general supervised learning tasks isto obtained labeled data. Usually, the higher the dimensionalityof the input representations, the more labeled data is requiredfor good generalization performance. Since obtaining largeamounts of labeled data often requires tedious manual work,people often employ Amazon Mechanical Turk for obtaininglabeled data. To overcome the labeling effort, the authorsintroduce another machine learning framework called self-taught learning, which has as core idea the fact that unlabeleddatasets are much easier to obtain. Therefore, in the newsetup, both labeled and unlabeled data is used for training aclassifier. In Self-Taught learning, labeled data is transformedinto a higher level representation learned through the use ofa large number of unlabeled data. The new representation isused for training a classifier on the labeled data. This setupis envisioned to have a bigger advantage when the availablelabeled data is scarce.

According to the type of data available (labeled and/or un-labeled) there are a number of training/testing setups currentlyused:• Supervised Classification - training labels are provided• Semi-Supervised Classification - training labels are not

provided, but test data comes from the same class distri-bution as the training data

• Transfer Learning - training labels are provided but thetest data is assumed to be from a different distribution

• Self-Taught Learning - training data does not have labelsand it is also not assumed to come from the samedistribution as the test data

An intuitive example is shown in Fig 1, where the orangebounded boxes represent labeled data.

1) Problem formulation: Self-taught learning can be ex-pressed as a two step optimization problem.

In the first step, the unlabeled data x(i)u ∈ Rn, i = 1..k isused to find a set of bases bi ∈ Rn, i = 1..s and a set ofsparse activations a(i)j with a(i) ∈ Rs such that x(i) can bereconstructed from the s bases vectors b and its correspondingactivation weights a(i) ∈ Rs given by the formula:

mina,b

∑ki=1 ||x

(i)u −

∑j a

(i)j bj ||22 + β||a(i)||1

s.t.||bj ||2 ≤ 1, j = 1..s(3)

Page 3: EDIC RESEARCH PROPOSAL 1 Machine Learning … exam... · Machine Learning techniques for predicting molecular properties ... people often employ Amazon Mechanical Turk for ... introduce

EDIC RESEARCH PROPOSAL 3

Fig. 1. Different learning setups

The equation above can be easily optimized using the al-gorithm from [4] by alternating the optimization in a andb, e.g. fix one variable and optimize the other one at everystep. The problem is equivalent with a least square L1-regularized problem in variable a and a least square con-strained problem in variable b. The L1-regularization enforcessparse representations and it was preferred to other forms ofregulatization because experimentally it appeared to performbetter across different datasets. We note that PCA can alsobe used for finding a new representation of the input, but itleads to different solutions for which the activations are linearcombination of the input. Moreover, PCA can not be used forgenerating more bases than the input dimension.

In the second step, the bases b learned in the previous stepare used for finding a higher level representation for labeledinput data x

(i)l , i = 1..m with corresponding target values

y(i)l , i = 1..m. The new representation given by the set of

activation vectors a(x(i)l ), i = 1..m is obtained by solving

another L1- regularized optimization problem:

a(x(i)l ) = arg min

a(i)||x(i)l −

∑j

a(i)j bj ||22 + β||a(i)||1. (4)

As a last step, a classifier is trained on the pairs{a(x

(i)l ), y(i)}, i = 1..m and compared with a classifier (same

type e.g. SVM) trained on the initial pairs {x(i)l , y(i)}i = 1..m.2) Experimental Results: Self-taught learning was exten-

sively tested across a range of different applications such ascomputer vision with raw pixel intensities as features, naturallanguage processing with bag of words features and speechrecognition with frequency histograms. Using an SVM as theclassifier with sparse activation features as input instead of theinitial raw features performed better in all except one tasks.This suggests that such a problem setup can be employedto datasets belonging to different domains besides the onespresented here.

C. Information-Theoretic Regret Bounds for Gaussian ProcessOptimization in the Bandit Setting

Many applications require the optimizion of a black boxfunction f : D → R that is expensive to evaluate. Amongthem we mention: active user modelling, hierarchical rein-forcement learning[1] and more recently, hyper parameteroptimization of machine learning models [11]. This problemcan be posed as a multi-armed bandit problem where thefunction to be estimated is sampled from a Gaussian Process.

The optimization is done in a sequential manner, where atround t a point xt in the domain is chosen according to anacquisition function and the value f(xt) is evaluated. The goalis to maximize f with as few samples as possible. Amongthe common used acquistion functions we note the Probabiltyof Improvement, Maximum Expected Improvement or UpperConfidence Bound (called GP-UCB, the focus of this paper),with the latter two experimentally shown to perform better inpractice.

The effectiveness of a sampling scheme in finding theoptimal x∗ = arg maxx∈D f(x) (x∗ may not be unique) isevaluated using the instantaneous and the cumulative regret.The instantenous regret rt = f(x∗) − f(xt) is the error thatwe make at round t for choosing point xt instead of x∗. Thecumulative regret is the error accumulated up to the currentround RT =

∑t=1..T rt.

Besides the practical success of GP optimization, little wasknown in practice about its convergence properties for finite orinfinte input dimensional space. The main contribution of thepaper is to show sublinear convergence rates of the cumulativeregret for GP-UCB in T , the number of rounds. In other words,as T goes to infinity, we expect the cumulative regret to be 0:limT→∞

RT

T = 0.The bound on the cumulative regret RT is obtained in

two steps. First bound RT using the information gain, thenthe information gain is bound using the empirical spectrumoperator and the kernel operator spectrum.

1) Gaussian Processes basics: A Gaussian Process(GP) defines a distribution over functions f , f ∼GP (u(x), k(x, x′)), in a similar way in which probabilitydensities define distributions over random variables. It is fullyspecified by its mean u(x) and its covariance matrix K. In aBayesian setting, we assume our function f is sampled froma Gaussian with prior GP (0, k(x, x′)), where k is the kernelor covariance matrix. The most common kernel types are thelinear, squared exponential and Matern kernels.

For T sampled points in the presence of noise, we have yt =f(xt) + εt, where εt ∈ N(0, σ2). The posterior over f, con-ditioned on y1..T and x1..T is given by P (fT |y1:T , x1..T ) =N(uT (x), σ2

T (x)), where:• KT is positive definite with entries k(x, x′), x, x′ ∈ AT

• kT (x) = [k(x1, x), ...k(xT , x)]T

• kT (x, x′) = k(x, x′)− kT (x)T (KT + σ2I)−1kT (x′)• uT (x) = kT (x)T (KT + σ2I)−1yT• σ2

T (x) = kT (x, x)

2) Information Gain and Experimental Design: Unlikefunction optimization where the goal is to find the maximumof a function, in experimental design the goal is to find a good

Page 4: EDIC RESEARCH PROPOSAL 1 Machine Learning … exam... · Machine Learning techniques for predicting molecular properties ... people often employ Amazon Mechanical Turk for ... introduce

EDIC RESEARCH PROPOSAL 4

approximation of the function globally with few samples aspossible. We note that the same algorithm can not be employedfor both tasks, since in function optimization we might wantto avoid sampling in the regions where we are confident thatthe function has small values.

For a sampled set x ∈ A, we define yA = fA + εA, wherefA = [f(x)]x∈A and the noise εA ∼ N(0, σ2). The mutualinformation gain is

I(yA; f) = H(yA)−H(yA|f) (5)

where H(yA) and H(yA|f) represent the entropy or uncer-tainy in yA and the uncertainty in yA if we know f. In otherwords, information gain expresses the reduction in uncertainyabout f given yA.

Using the entropy formula for a Gaussian H(N(u, σ)) =12 log|2πeΣ| we can rewrite F (A) = I(yA; f) = I(yA; fA) =12 log|I + σ−2KA|. Finding the subset A in the domain Dwith |A| ≤ |T | that maximizes I(yA; f) is NP-hard. Inpractice, greedy approximation solutions are used that find anear-optimal solution by choosing a sequence of points in Awith maximum variance at every step. Thus, the followingequivalence relation holds at round t:

xt = arg maxx∈D

F (At−1 ∪ x)⇐⇒ xt = arg maxx∈D

σt−1(x) (6)

with t ∈ 1...T .In [7] it is shown that for any submodular function F (A)

and set AT obtained using the greedy aproach from above, itholds that:

F (AT ) ≥ (1− 1

e) max|A|≤T

(F (A)) (7)

Thus, a greedy set creation of AT leads to a solution close tothe optimal.

3) GP-UCB: For the function optimization problem, a goodsampling sequence xt is one that makes a tradeoff betweenexploration and exploitation of the search space. By choosingpoints with high variance (exploitation) we try to reducethe overall uncertainty. By choosing points with high mean(exploration) we try to concentrate samples in a region werewe are more confident that is close to the optimal. This idea isexpressed by choosing in every round t, a point xt according toan acquistion function xt = arg maxx∈D ut−1 + β

1/2t σt−1(x)

with βt depending on the kernel type. The function that isbeing maximized is also called the acquistion function orutility function and corresponds to Upper Confidence Boundcriterion.

To sum up, at every round, the GP-UCB algorithm selects apoint according to the above equation, evaluates yt = f(xt)+εand updates ut and σt using Bayes rule.

An important property of the UCB defined as above is that itdoes not choose points xt for which the upper confidence valueis smaller than the maximum lower confidence value found sofar. This can prune large subareas of the search space.

4) Regret Bounds: This paper is the first to prove sublinearconvergence rates for the cumulative regret RT with both finiteand infinite input domain D with smoothness assumptions onthe kernel. The authors show that in the limit GP-UCB has noregret limT→∞

RT

T = 0 and the convergence results for the

Kernel Linear RBF Matern

Regret RT d√T

√T (log(T )d+1) v + d(d+ 1)/2v + d(d+ 1)

TABLE IREGRET BOUNDS.

most popular kernels can be found in table I , up to polylogfactors. We notice that the dependance on the dimensionalityd of the input is weak. Namely, for the linear kernel thedependency is linear and for the squared exponential kerneld appears only as a power of log(T ).

5) Bounding the Regret using Information Gain: The firststep of the proof is to bound RT by the maximum informationgain γT = maxA⊆D,|A|=T I(yA; fA) after T rounds. Inparticular, it is shown that with high probability• Pr(RT ≤

√C1TβT γT ) ≤ 1− δ for finite D

• Pr(RT ≤√C2TβT γT + 2) ≤ 1− δ for D compact and

convexwhere δ ∈ (0, 1) and appropiate constants C1, C2, βt . Thesecond inequality holds by making further assumptions thatthe function is smooth, e.g. by enforcing that the probabilityof the partial derivatives of the function to have a large valueis small.

6) Bounding the Information Gain using the empiricialspectrum: The case of infinite input dimensionionality aretreated similarly with the finite case by assuming the existenceof a discretized set DT ⊂ D,T ∈ N with nearest neighbordistance O(1), dense in the limit. As stated previously, thevalue of γT = maxA⊂DT ,|A|=T I(yA; fA) can be bounded bythe greedy maximization value γT ≤ 1

1−e−1F (AT ) since it isa submodular function.

Choosing a point xt is equivalent with selecting a vectorvt ∈ R|D| with values 0’s except the entry at position t, thusf ∼ N(vtf, σ

2). By relaxating the constraint to ||vt|| = 1, itis shown that the vt’s are selected among the eigenvectors ofthe kernel matrix KDT

= k(x, x′), x, x′ ∈ DT . The maximuminformation gain is further bounded by an expression involvingthe eigenvalues λt of KDT

as follows:

γT ≤0.5

1− e−1max

m1,m2,...,mD

|D|∑t=1

log(1 + σ−2mtλt) (8)

where mt ∈ N ,∑

tmt = T and λ1 ≥ λ2 ≥ λ3 ≥ ... Here aworst case scenario was considered for assigning eingevectorsof KT to vt. In the next step, the information gain is boundedusing the tail spectrum of KDT

, B(T∗) =∑T∗

t=1 λt and nT =∑T∗t=1 λt for any T∗ = 1..T :

γT ≤ O(σ−2[B(T∗)T + T∗lognTT )]) (9)

From the equation above, we conclude that if the tail spec-trum B(T∗) is small, or in others words, the eingenspetrumof KDT

decays fast, the bound on γT is tighter.7) Bounding the empirical spectrum by the operator spec-

trum: For a kernel k with∫k(x, x)u(x) = 1 where u(x)

is uniformly distributed on D, Mercer’s theorem assures theexistence of a discrete kernel eigenspectrum {λs(x), φs(x)}with k(x, x′) =

∑λsφs(x)φs(x

′).

Page 5: EDIC RESEARCH PROPOSAL 1 Machine Learning … exam... · Machine Learning techniques for predicting molecular properties ... people often employ Amazon Mechanical Turk for ... introduce

EDIC RESEARCH PROPOSAL 5

Another contribution of this paper is to prove bounds onthe empirical tails spectrum B(T∗) by the kernel operatortails eigenspectrum Bk(T∗) =

∑T∗t=1 λt. Analytical expression

bounds on Bk(T∗) for the most common kernels are given bySeeger [10].

8) Conclusions: The contribution of the paper is two fold.First, a novel connection was made between experimentaldesign and GP optimization, by bounding the cumulativeregret by the maximum information gain. Secondly, sublinearregret bounds for GP-UCB optimization are shown with weakdependence on the dimensionality of the input for the firsttime. The bound is derived in two steps. First, the cumulativeregret is bounded by the maximum information gain which inturn is bounded by the tails of the empirical eigenspectrumand the kernel operator spectrum. The thighter the bound is,the more rapidly the kernel spectrum decays.

Although convergence rates are proven for infinite inputdimensional spaces as well, the complexity of the GP step thatincorporates the inversion of the covariance matrix makes thecomplexity of the algorithm cubic in the number of samples T,which makes it often impractical for high dimensional input.

III. RESEARCH PROPOSAL

I propose a new descriptor that is more robust to atomindexing, called Bag of Bonds Histogram (BoBH). Its sizeis not dependent on the number of atoms in a molecule buton the number of different atom types. Every bag in a BoBHdescriptor is a histogram of quantized distances which encodesinformation for all pairs of atoms of same composition. Thesize of the histogram bag is given by the maximum distancebetween two pairs of atoms (pertaining to that bag type). Thequantization level of the distances varies between bag typesand we experimentally found that a quantization level of 0.2or 0.25 perform well in practice, but this can vary for differentdatasets.

Thus, for a given molecule, the BoBH descriptor willencode:• for every different atom type present in the dataset, how

many times it appears in the molecule• for every pair of atoms of certain type, how many times

the distance between them is in a given quantizationinterval

An example is given in Fig 2 for a molecule C4H2O witha quantization level of q = 1. In this example, first the bagscontaining H were concatenated, then the ones containing Cand in the end the ones containing O, resulting in a descriptorof size 21. In general, the size of the descriptor is given byNA +

∑iNA∗(NA+1)

2 ∗ Dmax

q , where NA is the number ofatom types and Dmax is the maximum distance between apair of atoms.

To sum up, molecular descriptors achieve translation androtation invariance through the use of distances betweenatoms instead of the actual 3D positions. Sorted Coloumband Randomly Sorted Coloumb achieve invariance to indexpermutation by sorting the rows of the matrix according totheir norm. Bag of Bonds Histogram places atom pairs with the

Fig. 2. Bag of Bonds Histogram. Sample computation of Bag of BondsHistogram for C2H4O for a dataset which contains only atoms of type O,C and H. If the maximum distance between any two atoms in our dataset is3, the size of the BoBH descriptor with quantization level 1 is 21, as shownabove. The first entry encodes the number of H atoms, 4. The following 3elements encode the number of (H,H) distances that fall in the interval [0,1),[1,2) and [2,3).

same composition in the same bag and quantizes the distancesinto bins to bypass the sorting step. Preliminary results on thesame dataset as in [5] show promising results, obtaining aMAE of 2.6 kcal/mol.

Another issue we would like to address is that preprocessingthe molecular data to be used in a machine learning settingrequires extensive domain specific knowledge. As an exam-ple, generating the folds for cross validation using stratifiedsampling is a good strategy for obtaining good generalizationperformance but in practice, we should not expect to know therange of the test data. If we test the performance on a newmolecule whose atomization energy is not within the boundspresent in the training dataset, it is likely that the predictionwill not be very accurate. Moreover, if we have a moleculewith more atoms than any molecule present in the dataset, wewould need to retrain our model with a descriptor of differentsize.

The same problem appears across chemical compound spaceif we try to predict the atomization energy of a molecule withthe same number of atoms but with different atom types inits composition. In general, we do not have guarantees thata molecule with similar atom composition and similar atomtype have target prediction in the same range as our trainingset.

The labeled datasets1 currently used are in the order of 7k or140k samples and are of relatively small size compared withdatasets sizes in other domains. Specifically, they contain atmost 6 atom types and at most 29 atoms per molecule. Inchemoinformatics, the datasets can not be easily augumentedby using external annotation tools since it requires domain

1All datasets can be downloaded from www.quantum-mechanics.org

Page 6: EDIC RESEARCH PROPOSAL 1 Machine Learning … exam... · Machine Learning techniques for predicting molecular properties ... people often employ Amazon Mechanical Turk for ... introduce

EDIC RESEARCH PROPOSAL 6

knowledge and the labeling effort implies time intensivecomputations.

One proposed solution to the above issues is to use largeamounts of unlabeled data as in [8]. Although the self-taughtlearning problem in [8] was tested only for classification tasksand predicting properties of molecules is a regression task,employing large amounts of unlabeled molecular data forobtaining higher level representations of the molecules hasthe potential of increasing performance results across datasetsby avoiding the labeling effort.

Our problem can be formulated as a semi-supervised,transfer-learning or self-taught learning problem by makingit generalize across chemical compound space and learn newembeddings of the molecules. As an example, we could tryto learn a set of bases features from one dataset containing{H,C,O, S,N} and use this representation for a new setcontaing {H,C,O, S,Cl}.

Another problem we would like to investigate is the diffi-culty in training neural networks. The authors of [5] make useof various tricks [6] to obtain state-of-the-art performance onthe molecular datasets. A grid search approach for choosingthe hyper parameters of a model leads to exponential timecomplexity in the number of parameters. Recent work [11],[12] suggests the use of Bayesian optimization techniques(such as GP-UCB or Bayesian neural networks) for hyper-parameters tuning. In this setup, the function to be maximizedis the accuracy on the validation set while the input representsthe concatentation of the hyper parameters (learning rate,type of activation function, number of units per hidden layeretc.). The speedup in training time can be quite significant,since techniques such as GP-UCB employ an exploration -exploitation trade-off scheme that prunes large areas of thesearch space. In [12] the authors experimentally demonstrateon a NLP task, that by using a simple model tuned in aBayesian optimizatin setting can outperform the use of a morecomplex model (such as Recurrent Neural Networks). Suchtechniques which were sucessfully applied for fine tuning NLPand vision models can be employed to molecular models aswell. One further direction that can be followed is to try toreach the desired chemical accuracy level of prediction of∼ 1kcal/mol MAE for atomization energy by fine-tuningexisting models using Bayesian optimization techniques. Thiswill allow us to search over a larger space of models witha reduced computational time compared to grid search andpotentially lead to obtaining state-of-the-art performance com-bining existing setups and descriptors.

REFERENCES

[1] Eric Brochu, Vlad M. Cora, and O De Freitas. A tutorial on bayesianoptimization of expensive cost functions, withapplicationtoactiveuser-modeling andhierarchical reinforcement learning. 2009.

[2] Johannes Hachmann, Roberto Olivares-Amaya, Sule Atahan-Evrenk,Carlos Amador-Bedolla, Roel S. Snchez-Carrera, Aryeh Gold-Parker,Leslie Vogt, Anna M. Brockway, and Aln Aspuru-Guzik. The harvardclean energy project: Large-scale computational screening and design

of organic photovoltaics on the world community grid. The Journal ofPhysical Chemistry Letters, 2(17):2241–2251, 2011.

[3] O.A. von Lilienfeld K.-R. Mller K. Hansen, F. Biegler andA. Tkatchenko. Interaction Potentials in Molecules and Non-LocalInformation in Chemical Space. Phys. Rev. Lett. (March 10, 2014).,2014.

[4] Honglak Lee, Alexis Battle, Rajat Raina, and Andrew Y. Ng. Efficientsparse coding algorithms. In Advances in Neural Information ProcessingSystems 19, pages 801–808. 2007.

[5] Gregoire Montavon, Katja Hansen, Siamac Fazli, Matthias Rupp,Franziska Biegler, Andreas Ziehe, Alexandre Tkatchenko, Anatole VLilienfeld, and Klaus-Robert Muller. Learning invariant representationsof molecules for atomization energy prediction. In Advances in NeuralInformation Processing Systems, pages 440–448, 2012.

[6] Gregoire Montavon, Genevieve B. Orr, and Klaus-Robert Muller, editors.Neural Networks: Tricks of the Trade - Second Edition, volume 7700 ofLecture Notes in Computer Science. Springer, 2012.

[7] G.L. Nemhauser, L.A. Wolsey, and M.L. Fisher. An analysis of ap-proximations for maximizing submodular set functions–i. MathematicalProgramming, 14(1):265–294, 1978.

[8] Rajat Raina, Alexis Battle, Honglak Lee, Benjamin Packer, and An-drew Y. Ng. Self-taught learning: Transfer learning from unlabeleddata. In Proceedings of the 24th International Conference on MachineLearning, ICML ’07, pages 759–766, New York, NY, USA, 2007. ACM.

[9] Matthias Rupp, Alexandre Tkatchenko, Klaus-Robert Muller, andO. Anatole von Lilienfeld. Fast and accurate modeling of molecular at-omization energies with machine learning. Phys. Rev. Lett., 108:058301,Jan 2012.

[10] MW. Seeger, SM. Kakade, and DP. Foster. Information consistencyof nonparametric gaussian process methods. IEEE Transactions onInformation Theory, 54(5):2376–2382, May 2008.

[11] Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical bayesianoptimization of machine learning algorithms. pages 2951–2959, 2012.

[12] Jasper Snoek, Oren Rippel, Kevin Swersky, Ryan Kiros, NadathurSatish, Narayanan Sundaram, Md, Prabhat, and Ryan P. Adams. ScalableBayesian Optimization Using Deep Neural Networks, February 2015.

[13] Niranjan Srinivas, Andreas Krause, Sham Kakade, and Matthias Seeger.Information-theoretic regret bounds for gaussian process optimizationin the bandit setting. IEEE Transactions on Information Theory,58(5):3250–3265, May 2012.

[14] Roberto Todeschini and Viviana Consonni. Handbook of moleculardescriptors, volume 11. Wiley-VCH, 2000.