arxiv:1510.07512v3 [physics.chem-ph] 12 may 2016 · machine learning, quantum mechanics, and...

19
Machine Learning, Quantum Mechanics, and Chemical Compound Space Raghunathan Ramakrishnan 1, * and O. Anatole von Lilienfeld 1, 1 Institute of Physical Chemistry and National Center for Computational Design and Discovery of Novel Materials (MARVEL), Department of Chemistry, University of Basel, Klingelbergstrasse 80, CH-4056 Basel, Switzerland We review recent studies dealing with the generation of machine learning models of molecular and materials properties. The models are trained and validated using standard quantum chemistry results obtained for organic molecules and materials selected from chemical space at random. INTRODUCTION Over the last couple of years a number of machine learning (ML) studies have appeared which all have in common that quantum mechanical properties are being predicted based on regression models defined in chem- ical compound space (CCS). CCS refers to the com- binatorial set of all compounds that could possibly be isolated and constructed from all possible combinations and configurations of atoms, it is unfathomably huge. For example, in a 2004 Nature Insight issue, the number of small organic molecules, expected to be stable, has been estimated to exceed 10 60 [1–3]. By contrast, cur- rent records held by the Chemical Abstract Services of the American Chemical Society account for only 100 Million compounds characterized so far. Convention- ally, CCS has been a common theme in the experimen- tal branches of chemistry, as well as in chem-informatics and bioinformatics. A review of the quantum mechani- cal first principles, or ab initio, view on CCS has recently been published by one of us [4]. The quantum mechan- ical framework is crucial for the unbiased exploration of CCS since it enables, at least in principle, the free variation of nuclear charges, atomic weights, atomic con- figurations, and electron-number. These variables fea- ture explicitly in the system’s Hamiltonian. For most molecules the Born-Oppenheimer approximation is ap- plicable in their relaxed geometry, rendering the calcu- lation of solutions to the electronic Schr¨ odinger equa- tion the most time-consuming aspect. Any ab initio approximation, be it density functional theory (DFT), post-Hartree-Fock (post-HF), or quantum Monte Carlo (QMC) based, can be used to that end [5]. By virtue of the correspondence principle in quantum mechanics (QM) [6], and through the appropriate application of sta- tistical mechanics (SM) [7], it should be clear that any observable can be calculated, or at least estimated, for any chemical species in CCS—as long as sufficient com- putational resources are available. For select observables including energies, forces, or few other properties, semi- empirical quantum chemistry methods or universal or re- active force-fields (FFs) [8–11] can also be used instead of QM, provided that they are sufficiently accurate and transferable. Due to all the possible combinations of assembling many and various atoms the size of CCS scales formally with compound size as O(Z N I max ) in an exponential man- ner. Here, Z refers to the nuclear charge of an atom (Z is a natural number), and Z max corresponds to the maximal permissible nuclear charge in Mendeleev’s table, typically Z max < 100 (theoretical estimates exist though for Z< 173 [12]), and N I corresponds to the number of nuclear sites. N I is a natural number and it defines the extension of a “system”, and it can reach Avogadro number scale for assemblies of cells, or amorphous and disordered mat- ter. More rigorously defined within a Born-Oppenheimer view on QM, CCS is the set of combinations, for which the potential energy E is minimal in real atomic posi- tions {R I }, for given nuclear charges {Z I }, and positive electron number. Within this framework, we typically do not account for temperature effects, isotopes, grand- canonical ensembles, environmental perturbations, exter- nal fields, relativistic effects or nuclear quantum effects. At higher temperatures, for example, the kinetic energy of the atoms can be too large to allow for detectable life- times of molecular structures corresponding to shallow potential energy minima. For a more in-depth discussion of stability in chemistry versus physics see also Ref. [13]. The dimensionality of CCS corresponds then to 4N I +3 because for each atom there are three spatial degrees of freedom in Cartesian space, one degree of freedom corre- sponding to nuclear charge. An additional degree of free- dom specifies the system’s handedness. Finally, there are two electronic degrees of freedom: One representing the number of electrons (N e ), the other the electronic state (molecular term symbol). Rotational and translational degrees of freedom can always be subtracted. While cer- tainly huge and high-dimensional, it is important to note that CCS is by no means infinite. In fact, the num- ber of possible nuclear charges is quite finite, and due to near-sightedness long-range contributions rarely ex- ceed the range of 1 nm in neutral systems [14]. Also, molecular charges are typically not far from zero, i.e. N e I Z I . Furthermore, it should also be obvious that not all chemical elements “pair” in terms of nearest neighbors. As such, we consider it self-evident that there is a host of exclusion rules and redundancies which result in a dramatic lowering of the effective dimensionality of CCS. And hence, seen as a QM property space spanned by combinations of atomic configurations and identities, it is natural to ask how interpolative regression methods perform when used to infer QM observables, rather than arXiv:1510.07512v3 [physics.chem-ph] 12 May 2016

Upload: others

Post on 07-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: arXiv:1510.07512v3 [physics.chem-ph] 12 May 2016 · Machine Learning, Quantum Mechanics, and Chemical Compound Space Raghunathan Ramakrishnan 1,and O. Anatole von Lilienfeld y 1Institute

Machine Learning, Quantum Mechanics, and Chemical Compound Space

Raghunathan Ramakrishnan1, ∗ and O. Anatole von Lilienfeld1, †

1Institute of Physical Chemistry and National Center for Computational Design and Discovery of Novel Materials (MARVEL),Department of Chemistry, University of Basel, Klingelbergstrasse 80, CH-4056 Basel, Switzerland

We review recent studies dealing with the generation of machine learning models of molecularand materials properties. The models are trained and validated using standard quantum chemistryresults obtained for organic molecules and materials selected from chemical space at random.

INTRODUCTION

Over the last couple of years a number of machinelearning (ML) studies have appeared which all have incommon that quantum mechanical properties are beingpredicted based on regression models defined in chem-ical compound space (CCS). CCS refers to the com-binatorial set of all compounds that could possibly beisolated and constructed from all possible combinationsand configurations of atoms, it is unfathomably huge.For example, in a 2004 Nature Insight issue, the numberof small organic molecules, expected to be stable, hasbeen estimated to exceed 1060 [1–3]. By contrast, cur-rent records held by the Chemical Abstract Services ofthe American Chemical Society account for only ∼100Million compounds characterized so far. Convention-ally, CCS has been a common theme in the experimen-tal branches of chemistry, as well as in chem-informaticsand bioinformatics. A review of the quantum mechani-cal first principles, or ab initio, view on CCS has recentlybeen published by one of us [4]. The quantum mechan-ical framework is crucial for the unbiased explorationof CCS since it enables, at least in principle, the freevariation of nuclear charges, atomic weights, atomic con-figurations, and electron-number. These variables fea-ture explicitly in the system’s Hamiltonian. For mostmolecules the Born-Oppenheimer approximation is ap-plicable in their relaxed geometry, rendering the calcu-lation of solutions to the electronic Schrodinger equa-tion the most time-consuming aspect. Any ab initioapproximation, be it density functional theory (DFT),post-Hartree-Fock (post-HF), or quantum Monte Carlo(QMC) based, can be used to that end [5]. By virtueof the correspondence principle in quantum mechanics(QM) [6], and through the appropriate application of sta-tistical mechanics (SM) [7], it should be clear that anyobservable can be calculated, or at least estimated, forany chemical species in CCS—as long as sufficient com-putational resources are available. For select observablesincluding energies, forces, or few other properties, semi-empirical quantum chemistry methods or universal or re-active force-fields (FFs) [8–11] can also be used insteadof QM, provided that they are sufficiently accurate andtransferable.

Due to all the possible combinations of assemblingmany and various atoms the size of CCS scales formally

with compound size as O(ZNImax) in an exponential man-

ner. Here, Z refers to the nuclear charge of an atom (Z isa natural number), and Zmax corresponds to the maximalpermissible nuclear charge in Mendeleev’s table, typicallyZmax < 100 (theoretical estimates exist though for Z <173 [12]), and NI corresponds to the number of nuclearsites. NI is a natural number and it defines the extensionof a “system”, and it can reach Avogadro number scalefor assemblies of cells, or amorphous and disordered mat-ter. More rigorously defined within a Born-Oppenheimerview on QM, CCS is the set of combinations, for whichthe potential energy E is minimal in real atomic posi-tions {RI}, for given nuclear charges {ZI}, and positiveelectron number. Within this framework, we typicallydo not account for temperature effects, isotopes, grand-canonical ensembles, environmental perturbations, exter-nal fields, relativistic effects or nuclear quantum effects.At higher temperatures, for example, the kinetic energyof the atoms can be too large to allow for detectable life-times of molecular structures corresponding to shallowpotential energy minima. For a more in-depth discussionof stability in chemistry versus physics see also Ref. [13].The dimensionality of CCS corresponds then to 4NI+3because for each atom there are three spatial degrees offreedom in Cartesian space, one degree of freedom corre-sponding to nuclear charge. An additional degree of free-dom specifies the system’s handedness. Finally, there aretwo electronic degrees of freedom: One representing thenumber of electrons (Ne), the other the electronic state(molecular term symbol). Rotational and translationaldegrees of freedom can always be subtracted. While cer-tainly huge and high-dimensional, it is important to notethat CCS is by no means infinite. In fact, the num-ber of possible nuclear charges is quite finite, and dueto near-sightedness long-range contributions rarely ex-ceed the range of 1 nm in neutral systems [14]. Also,molecular charges are typically not far from zero, i.e.Ne ∼

∑I ZI . Furthermore, it should also be obvious

that not all chemical elements “pair” in terms of nearestneighbors. As such, we consider it self-evident that thereis a host of exclusion rules and redundancies which resultin a dramatic lowering of the effective dimensionality ofCCS. And hence, seen as a QM property space spannedby combinations of atomic configurations and identities,it is natural to ask how interpolative regression methodsperform when used to infer QM observables, rather than

arX

iv:1

510.

0751

2v3

[ph

ysic

s.ch

em-p

h] 1

2 M

ay 2

016

Page 2: arXiv:1510.07512v3 [physics.chem-ph] 12 May 2016 · Machine Learning, Quantum Mechanics, and Chemical Compound Space Raghunathan Ramakrishnan 1,and O. Anatole von Lilienfeld y 1Institute

2

solving the conventional quantum chemistry approxima-tions to Schrodinger’s equation (SE) [15].

Within materials and chem-informatics, ML and in-ductive reasoning are known for their use in so calledquantitative structure property (or activity) relationships(QSPR/QSAR). Despite a long tradition of ML methodsin pharmaceutical applications [16–20], and many successstories for using them as filters in order to rapidly con-dense large molecular libraries [21], their overall useful-ness for molecular design has been questioned [22]. In ouropinion, these models fail to systematically generalize tolarger and more diverse sets of molecules, compounds, orproperties because they lack a rigorous link to the under-lying laws of QM and SM. By contrast, within the realmof theoretical physical chemistry, the fitting of functionsto model physical relationships is well established andhardly controversial. The use of regression methods hasa long-standing track record of success in theoreticalchemistry for rather obvious reasons: The cost of solv-ing SE has been and still is substantial (due to polyno-mial scaling with number of electrons in the molecule),and the dimensionality of the movements of atoms ishigh, i.e. directly proportional to the number of atomsin the system. The issue becomes particularly stringentin the context of accurately fitting potential energy sur-faces (PES) to rapidly calculate spectroscopic properties,perform molecular dynamics (MD), or calculate phononsin solids. Fitting PES is a classic problem in molecu-lar quantum chemistry, and has triggered substantial re-search ever since Herzberg’s seminal contributions. Thestudies of Wagner, Schatz and Bowman introduced themodern computing perspective in the 80s [23, 24]. Neuralnetwork fits to PES started to appear with Sumpter andNoid in 1992 [25], followed by Lorenz, Gross and Schefflerin 2004 [26], Manzhos and Carrington in 2006 [27], andBehler and Parrinello in 2007 [28], and others [29]. Thelatter contributions mostly use the resulting models forextended MD calculations. Kernel based ML methods forfitting potentials have also been used for similar problemsalready in 1996 [30], and more recently in 2010 [31]. Formore information on fitting force-fields using flexible andhighly parametric functions, rather than models inspiredby physical models of interatomic interactions with min-imal parameter sets, the reader is referred to the recentreviews in Refs. [32–34].

ML models trained across chemical space are signifi-cantly less common in physical chemistry. Notable ex-ceptions, apart from the papers we review here-within,include (in chronological order) attempts to use ML forthe modeling of enthalpies of formation [35], density func-tionals [36], melting points [37], basis-set effects [38], re-organization energies [39], chemical reactivity [40], sol-ubility [41], polymer properties [42], crystal proper-ties [43, 44], or frontier orbital eigenvalues [45]. Auto-matic generation of ML models, trained across chemi-cal space, of the expectation value of the quantum me-

chanical Hamiltonian of a realistic molecule, i.e. the ob-servable energy, has only been accomplished within therigorous realm of physical chemistry [46]. More specif-ically, this seminal contribution demonstrated that onecan also use ML instead of solving the SE within theBorn-Oppenheimer approximation in order to find thepotential energy of a system as the eigenvalue of the

electronic Hamiltonian, H({ZI ,RI})Ψ7−→ E, using its

wavefunction Ψ for a given set of nuclear charges {ZI}and atomic positions {RI}. In other words, it is pos-sible to generate a regression model which (i) directlymaps nuclear charges and positions to potential energy,

{ZI ,RI}ML7−→ E, and (ii) can be improved systemati-

cally through mere addition of training instances. Theelectronic Hamiltonian H for solving the non-relativistictime-independent SE within the Born-Oppenheimer ap-proximation and without external electromagnetic fields,HΨ = EΨ, of any compound with a given charge,Q = Np − Ne, is uniquely determined by its externalpotential, v(r) =

∑I ZI/|r − RI |, i.e. by {RI , ZI}. As

such, relying on SE, as opposed to alternative interatomicor coarse-grained based ad hoc models, has the inherentadvantage of universality and transferability—an aspectwhich is conserved in our ML models. This aspect is cru-cial for the usefulness of the ML models generated in thecontext of explorative studies of CCS, e.g. when devis-ing virtual compound design strategies. Throughout ourwork we strive to maintain and apply the same funda-mental QM based rigor that also underlies the success ofquantum chemistry for modeling molecules, liquids, andsolids.

Paradigm Our Ansatz to chemistry relies on combin-ing conventional QM based ”Big Data” generation withsupervised learning. The paradigm rests squarely on in-ductive reasoning, being applied in the rigorous contextof supervised learning and QM. Generally speaking, webelieve that this ML approach can be applied successfullyto the modeling of physical properties of chemicals if

(i) there is a rigorous cause and effect relationshipconnecting system to property, such as the quan-tum mechanical expectation value of a ground-stateproperty being coupled to the system’s Hamilto-nian via its wavefunction;

(ii) the representation accounts for all relevant degreesof freedom necessary to reconstruct the system’sHamiltonian;

(iii) the query scenario is interpolative in nature, mean-ing that for any query the relevant degrees of free-dom assume values in between (or very close to)training instances;

(iv) sufficient training data is available.

Here, we refer to ML as an inductive supervised-learning approach that does not require any a priori

Page 3: arXiv:1510.07512v3 [physics.chem-ph] 12 May 2016 · Machine Learning, Quantum Mechanics, and Chemical Compound Space Raghunathan Ramakrishnan 1,and O. Anatole von Lilienfeld y 1Institute

3

knowledge about the functional relationship that is tobe modeled, and that improves as more training data isbeing added. This approach corresponds to the mathe-matically rigorous implementation of inductive reasoningin the spirit of Refs. [47–50]: Given sufficient examplesof input and corresponding output variables any func-tion relating the two can be modeled through statistics,and subsequently be used to infer solutions for new in-put variables—as long as they fall into the interpolatingregime. The amount of training data necessary to beconsidered “sufficient” depends on (a) the desired accu-racy of the ML model, (b) its domain of applicability,and (c) its efficiency. Overfitting and transferability canbe controlled through careful regularization and cross-validation procedures. As such, it offers a mathemati-cally rigorous way to circumvent the need to establish anassumed approximate physical model in terms of input-parameters or constraints. In chemistry, the historicalimportance of inductive reasoning is considerable. Ex-amples include the discovery of Mendeleev’s table, Ham-mett’s equation [51, 52], Pettifor’s structure maps [53], oralso the Evans-Polanyi, Hammond’s, or Sabatier’s princi-ples, currently popular as volcano plots used for compu-tational heterogeneous catalyst design [54]. Philosophi-cally speaking, even Newton’s law and the postulates ofquantum mechanics rely on inductive reasoning. For arigorous QM based ML approach, however, all inherentheuristic aspects should be rooted out in order to arriveat truly transferable model functions whose accuracy andapplicability can be improved systematically, i.e. throughmere addition of more data.

We reiterate that in any data-driven model generationprocess, a large number of training examples are requiredto develop converged, well performing models. The de-tails of the origin are irrelevant to the model. If, forexample, thousands of Lennard-Jones (LJ) potential en-ergies are being provided, the ML model will attempt tomodel the LJ function. Since there is no noise in thisfunction, the noise level in the ML model will be negli-gible. If, by contrast, through some randomized process,sometimes a LJ energy, and sometimes a Morse-potentialenergy is provided as training data, the ML model mustcope with the resulting noise. Note that training datacould also have experimental origin. While it is possi-ble, of course, to train ML models on known functions,there is little merit in such models since no computationaladvantage is being gained. In our opinion, investing ina ML model becomes only worthwhile if (a) the overallproblem is combinatorially hard and costly evaluationsare repeatedly required while relatively few input vari-ables change by small amounts (for example, screeningsolvent mixtures using ab initio MD calculations); andif (b) the generation of training instances is less costlythan solving the entire problem from first principles.

In this chapter, we review some of the most recent con-tributions in the field. More specifically, we first give a

brief tutorial summary of the employed ML model in theKernel Ridge Regression section. Then we will dis-cuss in section Representation the various representa-tions (descriptors) used to encode molecular species, inparticular the molecular Coulomb-matrix (CM), sortedor its eigenvalues, as introduced in [46], and as it canalso be used in condensed phase [55], its randomized per-muted sets [56], a pair-wise bag-of-bonds variant usedto study the non-local many-body nature of exponentialkernels [57], and a molecular fingerprint descriptor basedon a Fourier series of atomic radial distribution func-tions [58]. We review quantum chemistry data of 134kmolecules [62] in section Data; and a recipe for obtainingmany properties from a single kernel [63] in section Ker-nel. Subsequently, we discuss ML models of propertiesof electrons in terms of transmission coefficients in trans-port [59] and electronic hybridization functions [60, 61]in section Electrons; as well as chemically accurate(< 1 kcal/mol) ML models of thermochemical proper-ties, electron correlation energies, reaction energies [64]and first and second electronic excitation energies [65] insection ∆-Machine Learning. In section Atoms inMolecules local, linearly scaling ML models for atomicproperties such as forces on atoms, nuclear magneticresonance (NMR) shifts, core-electron ionization ener-gies [66], as well as atomic charges, dipole-moments, andquadrupole-moments for force-field predictions [67]. MLmodels of cohesive energies of solids [55, 68] are reviewedin section Crystals. Finally, Conclusions are drawn atthe end.

KERNEL RIDGE REGRESSION

As described previously [46, 63, 65], within the MLapproach reviewed herewithin a generic but highly flex-ible model kernel function is trained across chemicalspace with as many free parameters as there are trainingmolecules. Other supervised learning models, based onneural networks or random forests, could be used just aswell. We find kernel ridge regression (KRR) appealingbecause it is simple, robust, and easy to interpret. Inthe following, our notations follow Ref. [63]. We denotematrices by upper bold, and vectors by lower bold cases,respectively. As long as all variables in SE are accountedfor in the molecular representation d, arbitrary numericalaccuracy can be achieved by the resulting model of someproperty p—provided that sufficient and diverse data ismade available through training records. The overall MLsetup is schematically presented in Fig. 1.

This approach models some property p of querymolecule q as a linear combination of similarities to Ntraining molecules t. Similarities are quantified as kernel-functions k whose arguments are the distances Dqt be-tween mutual representations dq and dt of q and t, re-

Page 4: arXiv:1510.07512v3 [physics.chem-ph] 12 May 2016 · Machine Learning, Quantum Mechanics, and Chemical Compound Space Raghunathan Ramakrishnan 1,and O. Anatole von Lilienfeld y 1Institute

4

PredictedStatisticalModel

LearningAlgorithm

Training data

New

moleculardescriptor

Kij = exp(-| di - dj | / σ)

c = (K + λI)-1 Ptrain

Σj cj exp(-| dnew - dj | / σ)

Prediction

Learning R

P

R P

FIG. 1. Flow chart showing the basic machinery of aKernel-Ridge-Regression ML setup. Vertical flow correspondsto the training the model, calculating regression coefficientsthrough inversion of the symmetric square kernel matrix withthe dimensionality of trainingset size N . Horizontal flow cor-responds to out-of-sample predictions of properties of querycompounds, using the regression coefficients to scale the dis-tance dependent contribution of each training instance. In-put compounds are denoted by R, their ML representations,a.k.a. descriptors, by d, and properties by P. σ and λ cor-respond to hyperparameters that control kernel width andnoise-level, respectively. Their optimization is discussed inthe Kernel section.

spectively,

ptarget(dq) ≈N∑t=1

ctk(Dqt) (1)

Using the sorted CM descriptor [46] results in accurateML models of atomization energies when combined witha Laplacian kernel function, k(Dqt) = exp(−Dqt/σ), withDqt being the L1 (Manhattan) norm between the two de-scriptors, |dq−dt|, and σ being the kernel width [70]. Theregression coefficients, ct, one per training molecule, areobtained through KRR over the training data resultingin

(K + λI) c = ptrain. (2)

We briefly sketch the derivation of above equation usingthe least squares error measure with Tikhonov regular-ization [71, 72] using a Tikhonov matrix Km, where m isusually 0 or 1. Let us denote the reference property val-ues of training molecules as the column vector pref = x.The KRR Ansatz for the estimated property values oftraining molecules is pest = Kc. The L2-norm of theresidual vector, penalized by regularization of regressioncoefficients, is the Lagrangian

L = ||pref − pest||22 + λcTKmc

= (x−Kc)T

(x−Kc) + λcTKmc (3)

where (·)T denotes transpose operation, and m is a realnumber. To minimize the Lagrangian, we equate itsderivative with respect to the regression coefficient vec-tor, c, to zero

d

dcL = −2xTK + 2cTK2 + 2λcTKm = 0. (4)

Here we have used the fact that the kernel matrix Kis symmetric, i.e., KT = K along with the identity,(d/dc)dTc = (d/dc) cTd = dT, where c is a columnvector and dT is a row vector. After rearranging Eq. (4)and division by 2,(

K2c + λKmc−Kx)T

= 0. (5)

Taking the transpose and subsequent multiplication fromthe left with K−1 results in(

K + λKm−1)c = x. (6)

For the values of m = 1 and 0, we arrive at the equationsemployed in most KRR based ML-studies employing thepenalty functions λcTKc, and λcTc, respectively. Whenm = 0, the penalty rightly favors a regression coefficientvector c with small norm, decreasing the sensitivity ofthe model to noisy input. On the other hand, this choiceof m increases the computational overhead by having tocompute the K2 matrix — note that Eq. (5) for m = 0is (

K2 + λI)c = Kx. (7)

For this reason, we typically employ m = 1 in Eq. (6)which leads to the regression problem

(K + λI) c = x, (8)

also stated above in Eq. (2).

REPRESENTATION

In general, the choice of the representation d of therelevant input variables is crucial, as reflected by thelearning rate, i.e., by the slope of the out-of-sample er-ror versus trainingset size curve. The more appropriatethe representation, the fewer training instances will berequired to reach a desirable threshold in terms of pre-dictive accuracy. In our original contribution [46], weproposed to encode molecules through the “Coulomb”-matrix (CM). The genesis of the CM is due to the aimto account for the atomic environment of each atom ina molecule in a way that is consistent with QM, i.e., byusing only the information that also enters the molecu-lar Hamiltonian, and by encoding it in such a way thatthe molecular Hamiltonian can always be reconstructed.A simple way to do so is to use the mutual interactionsof each atom I with all other atoms in the molecule.Within QM, interatomic energy contributions result from

Page 5: arXiv:1510.07512v3 [physics.chem-ph] 12 May 2016 · Machine Learning, Quantum Mechanics, and Chemical Compound Space Raghunathan Ramakrishnan 1,and O. Anatole von Lilienfeld y 1Institute

5

(a) solving the full electronic many-body problem (typi-cally within BO approximation and some approximationto the electron correlation) and (b) addition of the nu-clear Coulomb repulsion energy ZIZJ/|RI −RJ |. As afirst step we considered only the latter, and doing thisfor atom I in the molecule results in an array of inter-action terms, which becomes a complete atom-by-atommatrix when done for the entire molecule. While such aNI ×NI matrix with unspecified diagonal elements, andcontaining just Coulomb repulsion terms as off-diagonalelements, does uniquely represent a molecule, it turnsout to be beneficial for the learning machine to explic-itly encode the atomic identity of each atom via the nu-clear charge, ZI , or, in order to also have energy unitsfor the diagonal elements, by the potential energy es-timate of the free atom, E(ZI) ∼ 0.5Z2.4

I . As notedlater [65], these diagonal terms are similar to the totalpotential energy of a neutral atom within Thomas-Fermitheory, ETF = −0.77Z7/3, or its modifications with aZ-dependent prefactor in the range 0.4–0.7 [73].

Note that the CM is a variant of the adjacency ma-trix, i.e. it contains the complete atom-by-atom graph ofthe molecular geometry with off-diagonal elements scal-ing with the inverse interatomic distances. Inverse inter-atomic distance matrices have already previously beenproposed as descriptors of molecular shape [74]. As themolecules become larger, or when using off-diagonal el-ements with higher orders of the inverse distance , suchrepresentations turn into sparse band matrices. Periodicadaptations of the CM have also been shown to work forML models of condensed phase properties, such as cohe-sive energies [55]. Generation of the CM only requiresinteratomic distances, implying invariance with respectto translation and rotational degrees of freedom. TheCM is also differentiable with respect to nuclear chargeor atomic position, and symmetric atoms have rows andcolumns with the same elements. Consequently, whilethe full spatial and compositional information of anymolecule is being accounted for, the global chirality islost: Left- and right-handed amino acids have the sameCM. Diastereomers, however, can be distinguished. Theindexing of the atoms in the CM is arbitrary, hence onecan introduce invariance with respect to indexing by us-ing the NI eigenvalues of the CM, as it was originallydone in Ref. [46]. The eigenvalues, however, do notuniquely represent a system, as illustrated in Ref. [75].Albeit never encountered in the molecular chemical spacelibraries we employed so far, this issue can easily be re-solved by using a CM with atom-index sorted by the L2

and L1 norms of each row: First, we sort the orderingbased on L2 norm [76]. Second, whenever L2 norms oftwo rows are degenerate by coincidence, they can be or-dered subsequently based on their L1 norms. This choiceis justified by the fact that only symmetric atoms willbe degenerate for both norms. While the CM itself isdifferentiable, this sorting can lead to distances between

CMs which are not differentiable. The Frobenius normbetween two CM matrices, for example, will no longerbe differentiable whenever two non-symmetric (differingL1 norms of CM rows) atoms in one CM reach degen-eracy in the L2 norm of their rows. In other words, forthis to happen, two non-symmetric atoms in the samemolecule would have to have the same (or very similar)atomic radial distribution functions (including all com-binations of atom-types). We have not yet encounteredsuch a case in any of the molecular chemical space li-braries we employed so far. However, while moleculescorresponding to compositional and constitutional iso-mers appear to be sufficiently separated in chemical spacethis issue of differentiability might well limit the appli-cability of Frobenius based similarities that rely on CMrepresentations, or any other adjacency matrix based de-scriptor for that matter. Dealing with conformationalisomers, or reaction barriers, could therefore require ad-ditional constraints in order to meet all the criteria of theaforementioned paradigm.

A simple brute-force remedy consists of using sets ofall Coulomb matrices one can obtain by permuting atomsof the same nuclear charge. Unfortunately, for largermolecules this solution becomes rapidly prohibitive. Formedium sized molecules, however, randomized permutedsubsets have yielded some good results [56].

Recently, based on the external potential’s Fourier-transform, a new fingerprint descriptor has been pro-posed which meets all the mathematically desirable re-quirements: It uniquely encodes molecules while conserv-ing differentiability as well as invariance with respect torotation, translation, and atom indexing [58]. This de-scriptor corresponds to a Fourier series of atomic radialdistribution functions RDFI on atom I,

d ∝∑I

Z2I cos[RDFI ], (9)

where linearly independent atomic contributions are ob-tained (non-vanishing Wronskian). Alas, while this rep-resentation does result in decent ML models, its predic-tive accuracy is inferior to the atom-sorted CM for thedatasets of organic molecules tested so far. A more re-cently proposed alternative scatter transform based de-scriptor also succeeds in uniting all the desirable formalproperties of a good descriptor, but so far it has onlybeen shown to work for planar molecules [77].

If we are willing to compromise on the uniqueness prop-erty, a pair-wise variant of the CM, called bag-of-bonds(BOB) corresponding simply to the set of vectors withall off-diagonal elements for all atom-pair combinations,can be constructed [57]. Unfortunately, non-unique de-scriptors can lead to absurd results, as demonstrated inRef. [58]. In the case of BOB, arbitrarily many geome-tries can be found (any homometric pairs) for which aBOB kernel will yield degenerate property predictions.Since homometric molecules differ in energy by all in-

Page 6: arXiv:1510.07512v3 [physics.chem-ph] 12 May 2016 · Machine Learning, Quantum Mechanics, and Chemical Compound Space Raghunathan Ramakrishnan 1,and O. Anatole von Lilienfeld y 1Institute

6

teratomic many-body terms, this can lead to errors sub-stantially larger than chemical accuracy. But also thepredictive accuracy of ML models for other propertiesthan energies will suffer from representations which can-not distinguish homometric molecules. For example, con-sider the following homometric case of two clusters con-sisting of four identical atoms. The geometry of onecluster corresponds to an equilateral triangle with edge-length A and with the fourth atom in the center at inter-atomic distance B. The geometry of the second clustercorresponds to a distorted tetrahedron with one side be-ing spanned by an equilateral triangle with edge-lengthB, and the fourth atom being located above the trian-gle at interatomic distance A to the other three atoms.These two clusters will have very different multi-pole mo-ments which will be degenerate by construction withinany ML model that relies on lists of pair-wise atom-atom distances without any additional connectivity in-formation. It is important to note, therefore, that non-unique representations can lead to ML models which donot improve through the addition of more data—defyingthe original purpose of applying statistical learning tochemistry. Nevertheless, in the case of organic moleculesdrawn from GDB [78], BOB has shown slightly supe-rior performance for atomization energies, frontier orbitaleigenvalues as well as polarizabilities. The fact that it caneven effectively account for many-body energy contribu-tions through non-linear kernels, thereby outperformingany explicit effective pair-wise potential model, has alsobeen highlighted [57]. For extended or periodic systems,however, we expect the likelihood of encountering homo-metric (or near-homometric) configurations to increase,and therefore the implications of lacking uniqueness tobecome more important.

DATA

With the growing interest in developing new and ef-ficient ML strategies, all aiming towards improving thepredictive power to eventually reach standard quantumchemistry accuracy and transferability at negligible com-putational cost, it has become inevitable to employ largemolecular training sets. To provide a standard bench-mark dataset for such quantitative comparative studies,we have produced one of the largest quantum chemistrylibraries available, dubbed QM9, containing structuresand properties obtained from quantum chemistry calcu-lations for 133,885 (134 k) small organic molecules, con-taining up to 9 CONF atoms [62]. The 134 k QM9 set isa subset of the GDB-17 database published by Reymondet al. [78], which contains a list of molecular graphs of166,443,860,262 (166 G) synthetically feasible molecules.The chemical diversity of QM9 is illustrated in Fig. 2.The possibility to generate quantum chemistry Big Datasuch as the 134 k dataset has also stimulated efforts to-

wards designing efficient formats for the archival and re-trieval of large molecular datasets [83, 84].

The starting point for generating the QM9 data pub-lished in Ref. [62] was the original set of SMILES descrip-tors of the smallest 134 k molecules in GDB-17. InitialCartesian coordinates were obtained using the programCORINA [85]. The semi-empirical quantum chemistrymethod PM7 [86], and density functional theory [87] us-ing the B3LYP [88] exchange-correlation potential withthe basis set 6-31G(2df,p), were subsequently employedto relax the CORINA structures into local energy min-ima, using software programs MOPAC [89] and Gaus-sian09 (G09) [90], respectively. For all molecules, thisdataset contains multiple properties, computed at theB3LYP/6-31G(2df,p) level, including principal momentsof inertia, dipole moment, isotropic polarizability, en-ergies of frontier molecular orbitals, radial expectationvalue, harmonic vibrational wavenumbers, and thermo-chemistry quantities heat capacity (at 298.15 K), internalenergy (at 0 K and 298.15 K), enthalpy (at 298.15 K),and free energy (at 298.15 K). The dataset also includesreference energetics for the atoms H, C, N, O, and Fto enable computation of chemically more relevant en-ergetics such as atomization energy, and atomizationheat (enthalpy). For 6095 (6 k) constitutional isomerswith the stoichiometry C7H10O2, thermochemical prop-erties have also been computed at the more accurateG4MP2 level [91]. B3LYP atomization heats deviatefrom G4MP2 with a mean absolute deviation of ∼5.0kcal/mol.

For recent ML studies [57, 58, 63–66, 92, 93], the QM9134 k dataset has served as a standard benchmark pro-viding diverse organic molecules and properties. It hasbeen crucial in order to convincingly demonstrate thetransferability of ML models’ performance to thousandsof out-of-sample molecules, that were not part of train-ing. While these ML studies have used the structures,and properties from QM9 without modifications, it hassometimes been necessary to extend the dataset by in-cluding additional quantities/structures. For the ∆-MLstudy discussed below 9868 (10 k) additional diastere-omers for the stoichiometry C7H10O2 have been gen-erated for additional transferability tests [64]. For the22 k molecules constituting the QM8 subset (all moleculeswith exactly 8 CONF atoms taken out of the QM9 134 kset) low-lying electronic spectra (first two singlet-singletexcitation energies, and oscillator strengths) computedat the levels, linear-response time-dependent DFT (LR-TDDFT) [94], and linear response approximate second-order coupled cluster method (CC2) [95] have beencomputed using the code TURBOMOLE. [65] At bothlevels, Ref. [65] presented results with the large basisset def2TZVP [96], and DFT results included predic-tions based on exchange-correlation potentials PBE [97],PBE0 [98, 99], and CAM-B3LYP [100]. CC2 calculationsemployed the resolution-of-the-identity (RI) approxima-

Page 7: arXiv:1510.07512v3 [physics.chem-ph] 12 May 2016 · Machine Learning, Quantum Mechanics, and Chemical Compound Space Raghunathan Ramakrishnan 1,and O. Anatole von Lilienfeld y 1Institute

7

tion for two-electron repulsion integrals.

KERNEL

Conventionally, one of the more repetitive aspects ofKRR based studies for small training sets consists of theoptimization of hyperparameters (kernel width, σ, andregularization strength (or noise-level), λ). This is typ-ically accomplished following time-consuming, 5 or 10-fold, cross-validation (CV) procedures [101]. Albeit eas-ily parallelizable, it is a task which can also convenientlybe circumvented using simple heuristics: Within vari-ous studies training ML models on quantum chemistryresults it has been noticed that CV procedures resultin very small, if not negligible, λ values. Setting λ di-rectly to zero typically does not yield deteriorated learn-ing rates. Since λ, also known as ”noise-level”, is sup-posed to account for noise in the training data we do notconsider this observation to be surprising: Observablescalculated using numerical solutions to given approxima-tions to Schrodinger’s equation do not suffer from anynoise other than the numerical precision—multiple ordersof magnitude smaller than the properties of interest.

It is also straightforward to choose the exponential ker-nel width: A decent choice is the median of all descrip-tor distances in the training set [49], i.e., σ = cDmedian

ij ,where c = O(1). Snyder et al. [102] used this choicefor modeling density functionals with a Gaussian kernel.This choice of σ can be problematic when descriptor dis-tances exhibit a non-unimodal density distribution. InRef. [63], we interpreted the role of σ as a coordinatescaling factor to render K well-conditioned. This sugges-tion is based on the fact that for σ ≈ 0, off-diagonal ele-ments of the kernel matrix, KLaplace

ij = exp (−Dij/σ) or

KGaussij = exp

(−D2

ij/(2σ2))

vanish, resulting in a unit-kernel-matrix, K = I. On the other hand, for σ >> 1,a kernel-matrix of ones is obtained which would be sin-gular. Instead, one can select a σ value such that thelower bound of the kernel elements is 0.5 [63]. For Lapla-cian and Gaussian kernels, this implies constraining thesmallest kernel matrix element, which corresponds tothe two most distant training molecules, to be 1/2, i.e.,min(Kij) = f(max(Dij), σ) = 1/2, and solving for σ. For

the Laplacian kernel, σLaplaceopt = max(Dij)/ log(2), and

for the Gaussian kernel, σGaussopt = max(Dij)/

√2 log(2).

For the B3LYP/6-31G(2df,p) molecular structures in the134 k set with CHONF stoichiometries, max(Dij) =1189.0 a.u., and max(Dij) = 250.4 a.u., when using L1

(Manhattan) and L2 (Euclidean) distance metrics, forsorted CM descriptors, respectively. As one would expectfor this data set and this descriptor, maximal distancesare found for the molecular pair methane and hexafluoro-propane. These distances correspond to σLaplace

opt = 1715

a.u. (with L1 distance metric), and σGaussopt = 213 a.u.

(with L2 distance metric). These values are of the sameorder of magnitude as for a molecular dataset similar tothe 134 k set, i.e. σLaplace

opt ≈ 1000 a.u. employed in [63],

and 50 < σGaussopt < 500 a.u., as also noted in [4].

The numerical robustness of choosing σ exclusivelyfrom the distance distribution can be rationalized usingprecision arguments. When the lower bound of elementsin K is limited to q, i.e., q ≤ Kij ≤ 1, the lower bound ofthe conditional number κ of K follows κ ≥ 1+Nq/(1−q),where N is the trainingset size (number of rows/columnsof K). This condition is valid for any dataset. κ is theratio between the maximal and minimal eigenvalues ofK. Typically, when κ = 10t, one looses t digits of accu-racy in the ML model’s prediction, even when employ-ing exact direct solvers for regression. When using thesingle-kernel Ansatz [63] with q = 1/2, κ can be as smallas 1 + N , i.e., one can expect at best a numerical pre-cision of ∼ log[N + 1] digits. Depending on the desiredprecision, consideration of κ to dial in the appropriate qcan be important.

Using hyperparameters which are derived from a rep-resentation which uses exclusively chemical compositionand structure rather than other properties, has multipleadvantages. Most importantly, the global kernel matrixbecomes property-independent! In other words, kernelmatrices inverted once, K−1, can be used to generateregression coefficients for arbitrarily varying propertiese.g. functions of external fields [59–61]. Furthermore, MLmodels for arbitrarily different molecular properties, re-sulting from expectation values of different operators us-ing the same wavefunction, can be generated in parallelusing the same inverted kernel. More specifically,

[cp1cp2 . . . cpn ] = K−1 [p1p2 . . .pn] , (10)

where p1,p2, ...,pn, are n property vectors of trainingmolecules. The inverse of the property independent,single-kernel once obtained can also be stored as de-scribed in Ref. [63] to enable reuse. Alternatively, thekernel matrix can be once factorized as lower and up-per triangles by, say, Cholesky decomposition, K = LU,and the factors can be stored. The regression coefficientvectors can then be calculated through a forward substi-tution followed by a backward substitution

L [yp1yp2 . . .ypn ] = [p1p2 . . .pn]

U [cp1cp2 . . . cpn ] = [yp1yp2 . . .ypn ] . (11)

The single kernel idea has been used (with σ = 1000a.u. and λ = 0), to generate 13 different molecular prop-erty models for all of which out-of-sample prediction er-rors have been shown to decay systematically with train-ingset size [63]. These learning curves are on display inFig. 3, and have been obtained using the QM9 data set of134 k organic molecules for training and testing [62]. Inorder to compare errors across properties, relative meanabsolute errors (RMAE) with respect to desirable quan-tum chemistry accuracy thresholds, have been reported.

Page 8: arXiv:1510.07512v3 [physics.chem-ph] 12 May 2016 · Machine Learning, Quantum Mechanics, and Chemical Compound Space Raghunathan Ramakrishnan 1,and O. Anatole von Lilienfeld y 1Institute

8

5 k

15 k

25 k

35 k

5 10 15 20 25 30 35

Nep

#

5

CH4, NH3, H2O

20 25

5 10 15

100

300

500

O

O

O

O

H 3 C

CH 3

O

O

CH 3

O

O H 3 C

0

40

80

120

160

2.0 2.5 3.0 3.5

[kca

l/m

ol]

[Å]

1 k

1 kRg

E

FIG. 2. Distribution of molecular size (in terms of number of occupied orbitals, i.e. electron pairs, Nep). The height ofeach black box denotes the number of constitutional isomers for one out of 621 stoichiometries present in 134 k molecules.The two left-hand side insets correspond to zoom-ins for smaller compounds. The right-hand side inset zooms in on thepredominant stoichiometry, C7H10O2, and features a scatter plot of G4MP2 relative (w.r.t. global minimum) potential energiesof atomization E versus molecular radius of gyration, Rg, as well as joint projected distributions. (Reprinted from Ref. [62].Copyright (2014), Nature Publishing Group.)

For instance, the target accuracy for all energetic proper-ties is 1 kcal/mol; thresholds for other properties, as wellas further technical details, can be found in Ref. [63].

Note the systematic decay in RMAEs for all molec-ular properties in Fig. 3, independent of choice of de-scriptor. We consider this numerical evidence that accu-rate ML-models for arbitrarily many QM properties canbe built using an exclusively structure based single ker-nel matrix—as long as the latter spans a chemical spacewhich is representative of the query compound. Whilenone of the properties were predicted within the desir-able RMAE of 1, not even for the largest trainingset sizeN=40 k, we note that the linear nature of the results sug-gests that such accuracy thresholds can be achieved bysimply increasing the training set size.

Another advantage of the single kernel Ansatz is thatit enables numerical comparison of ML models of differ-ent properties, as well as different representations. Mostinterestingly, some properties are easier to learn thanothers. The electronic spread 〈R2〉 exhibits a noticeablysteeper learning rate than all other properties, startingoff with the worst out-of-sample predictions for 1k-MLmodels, and yielding best performance for the 40k-MLmodels. The magnitude of the molecular dipole moment,µ, was found to exhibit the smallest learning rate, withthe CM based kernel yielding slightly better results thanthe BOB descriptor. For all other properties, however,the BOB descriptor yields a slightly better performance.However, as already pointed out above, BOB only ac-counts for all pair-wise distances and therefore suffersfrom lack of uniqueness, and any non-unique representa-tion can lead to absurd ML results [58]. This shortcoming

appears to show up for the BOB based 40k-ML modelsof energetic properties: The learning rate levels off andit is not clear if any further improvement in accuracy canbe obtained by merely adding more data.

Conceptionally, making the distinction of propertiescorresponding to regression coefficients, and chemicalspace to the kernel is akin to the well established distinc-tions made within the postulates of quantum mechan-ics. Namely that between Hamiltonian encoding struc-ture and composition (and implicitly also wavefunction)and some observable’s operator (not necessarily Hermi-tian, resulting from correspondence principle—or not).The former leads to the wavefunction via Ritz’ variationalprinciples, which is also the eigenfunction of the opera-tors enabling the calculation of property observables aseigenvalues. As such, Ref. [63] underscores the possibil-ity to establish an inductive alternative to quantum me-chanics, given that sufficient observable/compound pairshave been provided. One advantage of this alternative isthat it sidesteps the need for many of the explicit mani-festations of the postulates of quantum mechanics, suchas Schrodinger’s equation. The disadvantage is that inorder to obtain models of properties, experimental train-ing values would have to be provided for all moleculesrepresented in the kernel.

ELECTRONS

In Ref. [59] ML models of electron transport propertieshave been introduced. More specifically, transmissioncoefficients are being modeled for a hexagonal conduc-

Page 9: arXiv:1510.07512v3 [physics.chem-ph] 12 May 2016 · Machine Learning, Quantum Mechanics, and Chemical Compound Space Raghunathan Ramakrishnan 1,and O. Anatole von Lilienfeld y 1Institute

9

RM

AE

RM

AE

CM BOB

1

2

4

8

16

U0

UHG

Cv

ZPVEω1

1

2

4

8

16

1 k 110 k

N

µα

<R2>εHOMOεLUMO

Δε

1 k 10 k

N

FIG. 3. Learning curves for single-kernel ML models (with σ = 1000 a.u. and λ = 0) for thirteen molecular propertiesincluding thermochemical (TOP) as well as electronic (BOTTOM) properties of molecules. For each trainingset size, a singlekernel has been inverted and trained and tested on thirteen properties. Left and Right refer to the use of the aforementioneddescriptors Coulomb matrix (CM) and bag-of-bonds (BOB), respectively. Out-of-sample relative mean absolute errors (RMAE),i.e., MAE relative to standard quantum chemistry accuracy thresholds, are plotted for test and training sets drawn from the112 k GDB-17 subset of molecules with exactly 9 atoms CONF [78]. See Ref. [63] for further details of quantum chemistryaccuracy thresholds. See Refs. [57, 63] for a color version.

tion channel consisting of atoms with one orbital. Suchsystems are representative for carbon nano-ribbons, andmight have potential applications as molecular wires. Forvariable kinetic energy incoming electrons are being in-jected through two ballistic leads, coupled on each sideof the wire without reflection. The wire corresponds inlength to four times an armchair graphene nanoribbonunit cell, and in width to fourteen atoms. Depending oncomposition and geometry of five randomly distributedimpurities in the nano-ribbon, the transmission proba-bility of electrons (being either transmitted or scatteredback) can vary. Reference values are obtained for mul-tiple energy values and thousands of compounds usinga tight-binding Hamiltonian, the Landauer-Buttiker ap-proach, and advanced/retarded Green’s functions [79–81]to calculate the transmission probability [82]. ML mod-els have been trained for three experiments, (i) at fixedgeometry, the impurities only differ through their cou-pling (hopping rates) to adjacent sites (off-diagonal el-ements γij in the Hamiltonian), (ii) at fixed geometry,impurity elements in the Hamiltonian also differ in on-site energies (εi). (iii) as (ii) but with randomized po-sitions. The nano-ribbon impurity representation d is

derived from the Hamiltonian, and inspired by the CM:It is a matrix containing the on-site energies along thediagonal, and hopping elements or inverse distances asoff-diagonal elements. It is sorted by the rows’ norm inorder to introduce index invariance (enabling the removalof symmetrically redundant systems). The kernel func-tion is a Laplacian, and its width is set to a constant,characteristic to the length-scale of training instance dis-tances. The kernel width (noise-level) is set to zero. Forall three experiments, training sets with 10,000 (10 k)randomized examples have been calculated. In each casethe prediction error of ML models of transmission co-efficients is shown to decrease systematically for out-of-sample impurity combinations as the trainingset size isbeing increased to up to N =8 k instances. Furthermore,as one would expect, the prediction error increases withthe complexity of the function that is being modeled,i.e., in experiment (i) it is smaller than in (ii), for whichit is smaller than in (iii).

Another interesting aspect of said study is the model-ing of T not only as a function of impurities, representedin d, but also as a function of an external parameter,namely the kinetic energy of the incoming electron, E.

Page 10: arXiv:1510.07512v3 [physics.chem-ph] 12 May 2016 · Machine Learning, Quantum Mechanics, and Chemical Compound Space Raghunathan Ramakrishnan 1,and O. Anatole von Lilienfeld y 1Institute

10

For each of the three experiments, T was discretized onan E grid, and ML models have been obtained for eachE value. Since the kernel is independent of E, this im-plies that the role of learning a property as a function ofsome external parameter is assumed by the regression co-efficients, while the role of varying impurities is assumedby the kernel. More specifically, the effective model wasT est(dnew, E) =

∑j cj(E) exp(−|dnew − dj |/σ). The re-

sulting coefficients {cj(E)} are functions of E for anyof the training compounds. Exemplary {cj(E)} are ondisplay in Fig. 4 for the two nano-ribbon training com-pounds for which coefficients vary the most, as well asfor the two nano-ribbons for which they vary the least.Overall, the variation of the coefficients is rather smooth,suggesting a “well-tempered” ML model within which therole of each training instance responds reasonably to ex-ternal tuning of the target function. However, when itcomes to energy values in the vicinity of the step-wisevariation of the transmission coefficient, (energy indices12 to 15), strong fluctuations are observed for all fourparameters. We note that also the standard deviation(corresponding to the distributions shown as insets inFig. (2) in Ref. [59]), coincides with this movement. InRefs. [60, 61] similar ideas have been applied to the mod-eling of hybridization functions, relevant to dynamicalmean field theory.

∆-MACHINE LEARNING

An efficient strategy to improve the predictive accu-racy of ML models for computational chemistry applica-tions, is to augment predictions from inexpensive base-line theories with ML models that are trained on theerror in the baseline model’s prediction with respect toa more accurate targetline prediction. This approach,introduced as “∆-ML” [64], assumes that baseline andtargetline predictions have good rank-ordering correla-tion. A flow chart illustrating the idea is featured inFig. 5. The choice of baseline model can tremendouslybenefit from expert knowledge, such as DFT approxima-tions following the hierarchy of “Jacob’s ladder” [103], orthat the accuracy of correlated many-body methods typi-cally increases with the scaling of computational cost, e.g.CCSD being more accurate/expensive than MP2 thanHF. But also the first order approximation to the lowestexcitation energy using just the HOMO-LUMO gap fromsingle reference theories has shown promise. However, westate on the outset that for large-scale predictions, theoverhead associated with computing more sophisticatedbaseline properties can become non-negligible.

Within ∆-ML, property p1 corresponding to some tar-getline level of theory is modeled as the sum of a closelyrelated property p2 (an observable which not necessarilyresults from the same operator as the one used to calcu-late p1) corresponding to an inexpensive baseline theory

and the usual ML contribution, i.e. a linear combinationof kernel-functions

ptarget1 (dtarget

q ) ≈ pbase2 (dbase

q ) +

N∑t=1

ctk(|dbaseq − dbase

t |).

(12)Note that for predictions of targetline properties of newcompounds the molecular structures computed at thebaseline level of theory suffice. As such, the ML contri-bution accounts not only for the change in level of theory(i.e. vertical changes going from baseline to targetline inFig. 5) or the change in operator used to calculate theproperty, but also for all changes due to structural dif-ferences between baseline and targetline structures. It islikely that this approach is successful for energies whensuch changes are small. Fortunately, molecular geome-tries are often already very well accounted for when usinginexpensive semi-empirical quantum chemistry methods.

The regression problem within the ∆-ML Ansatz cor-responds simply to

(K + λI) c = ptarget1 − pbase

2 = ∆ptargetbase . (13)

Note that for pbase2 = c ∆p reduces to ptarget

1 , i.e. werecover the ordinary ML problem of directly modeling atargetline property (shifted by c).

The first ∆-ML study, Ref. [64], reports for severalbaseline, targetline combinations and demonstrates thefeasibility to model various molecular properties rang-ing from electronic energy, thermochemistry energetics,molecular entropy, as well as electron correlation energy.Using a trainingset size of 1 k, the direct ML model,trained to model B3LYP heats of atomization, leads to aMAE of 13.5 kcal/mol for 133 k out-of-sample predictionsfor molecules from QM9. When relying on the widelyused semi-empirical quantum chemistry model PM7, theMAE decreases to 7.2 kcal/mol. Combining the two ap-proaches into a 1 k ∆B3LYP

PM7 -ML model reduces the MAEto 4.8 kcal/mol. When increasing the data available tothe ML model to 10 k training examples, the MAE of theresulting 10 k ∆B3LYP

PM7 -ML model drops to 3 kcal/mol forthe remaining 122 k out-of-sample molecules. Such ac-curacy, when combined with the execution speed of thebaseline PM7, for geometry relaxation, as well as calcula-tion of atomization enthalpy, has enabled the screening ofatomization enthalpies for 124 k out-of-sample moleculesat a level of accuracy that is typical for B3LYP—withinroughly 2 CPU weeks. Note that the computational costfor obtaining the corresponding B3LYP atomization en-thalpies brute-force for the exact same molecules con-sumed roughly 15 CPU years of modern compute hard-ware.

In Ref. [64] the ∆-ML approach has also been shownto reach chemical accuracy, i.e. 1 kcal/mol for atomiza-tion energies. More specifically, the performance of a 1 k∆G4MP2

B3LYP -ML model has been investigated for predicting

Page 11: arXiv:1510.07512v3 [physics.chem-ph] 12 May 2016 · Machine Learning, Quantum Mechanics, and Chemical Compound Space Raghunathan Ramakrishnan 1,and O. Anatole von Lilienfeld y 1Institute

11

FIG. 4. Training coefficients in ML models of transmission coefficients plotted as a function of kinetic energy index of theincoming electron, as published in Ref. [59]. Shown are the coefficients of four (out of 8000) nano-ribbon compounds used fortraining (NR 1-4). The materials were selected because their coefficient exhibit strongest (NR 2 and NR 3) or weakest (NR 1and 4) fluctuations in E. The standard deviation (SD) over all 8000 training compounds is shown as well.

Training dataR(base)

P(target-base)

LearningAlgorithm

StatisticalModel

New R(base)+

P(base)

PredictedP(target)

target

base

property

Learn

ing

Prediction

R

FIG. 5. Left: Flow chart for ∆-machine learning. The input to the model typically consists of structures at a baseline level,R(base), and properties are modeled using training data corresponding to properties and geometries consistent with targetlinelevel, R(target). Right: Baseline and targetline property surfaces, the ∆-ML model (arrow) accounts for both, difference inproperty and geometry due to difference in level of theory.

the stability ordering among 10 k out-of-sample diastere-omers generated from all the constitutional isomers ofC7H10O2 present in the 134 k dataset. The 10 isomersclosest in energy to the lowest lying isomer, predicted by∆G4MP2

B3LYP -ML, are shown together with validating G4MP2and B3LYP predictions in Fig. 6. We note that the∆G4MP2

B3LYP -ML predictions typically agree with G4MP2 byless than 1 kcal/mol, a substantial improvement overthe B3LYP predictions. Furthermore, the ML correctionalso rectifies the incorrect B3LYP ordering of the isomers(B3LYP predicts the eighth isomer to be lower in energy

than isomers 3-7). This latter point is crucial for study-ing competing structures such as they would occur underreactive conditions in chemical reactions of more complexmolecules.

Table I gives an overview for mean absolute out-of-sample prediction errors for 1 k ∆-ML predictions of en-ergetics in various datasets as well as for various baselineand targetline combinations. There is a general trend:The more sophisticated the baseline the smaller the pre-diction error. For predicting G4MP2 atomization heats,the pure density functional method PBE baseline results

Page 12: arXiv:1510.07512v3 [physics.chem-ph] 12 May 2016 · Machine Learning, Quantum Mechanics, and Chemical Compound Space Raghunathan Ramakrishnan 1,and O. Anatole von Lilienfeld y 1Institute

12

MH [kcal/mol]

1.0

3.0

4.0

5.0

6.0

7.0

8.0

9.0

2.0

G4

MP2 B

3LY

P

H = -1933.5 kcal/mol

ML

2

3

1

45 6 7

8 910

FIG. 6. Calculated reaction enthalpies at 298.15 K between the most stable molecule with C7H10O2 stoichiometry (inset),and its ten energetically closest isomers in increasing order, according to targetline method G4MP2 (black). 1 k ∆G4MP2

B3LYP -MLmodel predictions are blue, B3LYP is red. (Reprinted with permission from [64], Copyright (2015), ACS Publishing.)

in a MAE < 1 kcal/mol. The hybrid functional B3LYPquenches the error further down by 0.2 kcal/mol, but atthe cost of the added overhead associated with the eval-uation of the exact exchange terms. Such high accuracymight still be worth the while for reliably forecasting thestability ranks of competing isomerization reaction prod-ucts, which often require quantitative accuracy beyondthat of DFT.

In a more recent study, the applicability of the ∆-MLidea to the modeling of lowest electronic excitation en-ergies at the CC2 level using TDDFT baseline valuesfor 22 k GDB-8 molecules has also been investigated [65].Results from that are reported in Table I. While compar-ing the performance of 1 k models for the lowest two exci-tation energies, E1, and E2, one notes that the predictionis slightly worse for E2. Both ∆-ML studies [64, 65] foundthe distribution of errors in the ML predictions to be uni-modal, irrespective of the nature of the distribution of thedeviation between the targetline and baseline. This pointis illustrated in Fig. 7 for the modeling of E1, and E2 forthe 22 k GDB-8 set. Ref. [65] also showed how systematicshifts in target property, can be included explicitly in thebaseline to improve ML prediction accuracy.

ATOMS IN MOLECULES

A number of molecular observables manifest them-selves as properties of atoms perturbed by the localchemical environment in molecules. Common examplesinclude NMR chemical shifts, and core-level ionizationenergies. Some interesting atomic properties that aremore difficult to measure experimentally include par-tial charges on atoms, forces on atoms for MD simula-tion, components of tensor properties such as dipole and

quadruple moments, atomic polarizability, etc. To di-rectly model atomic forces, as opposed to first modelingthe energies over suitably sampled molecular structuresand subsequent differentiation, has the advantage thatthe oscillations between training data points, inherentto any statistical machine learning model, are not am-plified through differentiation. Also, the generation offorce training data does not imply additional investmentin terms of ab initio calculations, because of the avail-ability of analytic derivatives by virtue of the Hellmann-Feynman theorem [104]: The evaluation of atomic forcesfor all atoms in a molecule typically incurs computationalefforts similar to calculating the total potential energyof just one molecule—at least within mean-field theoriessuch as DFT. But even at the levels of correlated wave-function methods such as CCSD and QCISD(T), the costof computing analytic derivatives is only 4–5 times higherthan that of computing a single point energy in the caseof small molecules [105].

One of the challenges in the modeling of propertiesof atoms in molecules consists of finding a suitable localchemical descriptor. For vector valued atomic proper-ties, an additional challenge is the choice of an appro-priate reference frame. After resolving these issues, thepredictive accuracy of ML models trained on sufficientlylarge data sets are transferable to similar atomic environ-ments, even when placed in larger systems. In Ref. [66],such ML models have been introduced for NMR chem-ical shifts of carbon and hydrogen (13C σ, 1H σ), forcore-level ionization energies of C atoms, and for atomicforces. All atomic properties used for training and testedwere calculated at the PBE0/def2TZVP level for manythousands of organic GDB molecules [78]. For the forces,a principal axis system centered on the query atom waschosen as reference frame, and three independent ML

Page 13: arXiv:1510.07512v3 [physics.chem-ph] 12 May 2016 · Machine Learning, Quantum Mechanics, and Chemical Compound Space Raghunathan Ramakrishnan 1,and O. Anatole von Lilienfeld y 1Institute

13

TABLE I. Comparison of mean absolute errors (MAEs, in kcal/mol) for ∆ML-predicted energetics (E) across datasets, andbaseline/targetline combinations for a fixed trainingset size of 1 k: H is enthalpy of atomization; E0 is the energy of theelectronic ground state; Ej is jth singlet-singlet transition energy; gap refers to difference in the energies of HOMO and LUMO;shifted refers to constant shifts of -0.31 eV and 0.19 eV that have been added to TDPBE/def2SVP predicted E1 for moleculeswith saturated σ and unsaturated π-chromophores, respectively. Zero baseline corresponds to directly modeling the targetlineproperty. For datasets with N molecules, MAEs in predictions for N -1 k out-of-sample molecules are reported. The last columnindicates the reference where these numbers have been taken from.

dataset E baseline targetline MAE [kcal/mol]

134 k GDB-9 H 0 B3LYP/6-31G(2df,p) 13.5 [64]H PM7 B3LYP/6-31G(2df,p) 4.8 [64]

6 k C7H10O2 H 0 G4MP2 5.1 [64]H PM7 G4MP2 3.7 [64]H PBE/6-31G(2df,p) G4MP2 0.9 [64]H B3LYP/6-31G(2df,p) G4MP2 0.7 [64]

6 k C7H10O2 E0 HF/6-31G(d) CCSD(T)/6-31G(d) 2.9 [64]E0 MP2/6-31G(d) CCSD(T)/6-31G(d) 1.0 [64]E0 CCSD/6-31G(d) CCSD(T)/6-31G(d) 0.4 [64]

22 k GDB-8 E1 0 CC2/def2TZVP 12.1 [65]E1 gap, TDPBE/def2SVP CC2/def2TZVP 9.2 [65]E1 E1, TDPBE/def2SVP CC2/def2TZVP 3.8 [65]E1 E1, TDPBE/def2SVP, shifted CC2/def2TZVP 3.0 [65]

22 k GDB-8 E2 E2, TDPBE/def2SVP CC2/def2TZVP 5.5 [65]

0

1k

2 4 6 8 10

E1 [eV]

TDPBE0CC2

2 4 6 8 10

E2 [eV]

0

2k

4k

-1 0 1 2

cou

nt d

ensi

ty

ΔE1 [eV]

TDPBE01k Δ-ML5k Δ-ML

-1 0 1 2

ΔE2 [eV]

FIG. 7. Density distributions of electronic excitation energies (top) and predicted errors (bottom). Top: Densities of first(left), and second (right) singlet-singlet transition energies of 17 k organic molecules with up to eight CONF atoms, at theCC2/def2TZVP targetline, and TDPBE0/def2SVP baseline levels of theory. Bottom: Densities of errors in predicting thevalues of E1 and E2 using ∆CC2

TDPBE0-ML models based on 1 k (orange), and 5 k (red) training molecules. (Reprinted withpermission from [65], Copyright (2015), AIP Publishing LLC.)

models were trained, one for each element of the atomicforces projected into this local coordinate system. Whilein many quantum chemistry codes the molecular geome-try is oriented along the principal axis system by default,popularly known as standard orientation, for the pur-pose of learning atomic forces reference frames have tobe found for each atom independently. Cartesian coordi-nates of the query atom as well as its neighboring atoms

within a certain cut-off radius and sorted by their dis-tances to the query atom were also oriented according tothe same reference frame. A localized atomic variant ofthe CM was used with first row/column correspondingto the query atom, and with indices of all other atomsbeing sorted by their distance to the query atom. Fur-thermore, a cut-off radius was optimized for each atomicproperty — atoms that lie beyond this cut-off do not fea-

Page 14: arXiv:1510.07512v3 [physics.chem-ph] 12 May 2016 · Machine Learning, Quantum Mechanics, and Chemical Compound Space Raghunathan Ramakrishnan 1,and O. Anatole von Lilienfeld y 1Institute

14

ture in the atomic CM. Non-equilibrium geometries, nec-essary to compute non-zero training and testing forces,were sampled via stochastic displacements along normalmodes. For computing NMR chemical shifts tetramethyl-silane (TMS) was chosen as the reference compound, andshifts in C (1s) core-level energies were considered usingmethane as the reference.

All aforementioned local atomic properties were mod-eled using Laplacian kernel and Manhattan distance.When trained on 9 k randomly sampled atoms presentin 9 k random molecules from the 134 k QM9 set, theresulting models were validated with out-of-sample pre-dictions of the same properties for 1 k atoms drawn atrandom from the remaining 125 k molecules not part oftraining. The accuracy achieved reached the order ofmagnitude of the reference DFT method. Specifically,a relative mean square error (RMSE) of 4 units was ob-tained for the properties of C atoms (13C σ (ppm), C 1s σ(ppm), FC m.a.u.), while for the force on H atoms, FH ,an RMSE of 0.9 was obtained. For 1H σ with a propertyrange of only 10 ppm, an RMSE of 0.1 ppm has beenreported. Very similar errors were noted when trainedon short polymers made from the GDB-9 molecules, andtested on longer saturated polymers of length up to 36nm. Applicability of the local ML model across CCS wasdemonstrated by accurate predictions of 13C σ-values forall the 847 k C atoms present in the 134 k molecular set.

Refs. [106–108] also describe ML models of atomicforces which can be used to speed up geometry relaxationor molecular dynamics calculations. Note that theseforces are always being modeled within the same chemi-cal system. In other words, they can neither be trainednor applied universally throughout chemical space, andthey inherently lack the universality of the ML modelsreviewed here within. In Ref. [67], atomic ML modelsare introduced which can be used as parameters in uni-versal FFs, i.e. throughout chemical space. More specifi-cally, transferable ML models of atomic charges, dipole-moments, and multipole-moments were trained and usedas dynamically and chemically adaptive parameters inotherwise fixed parametrized FF calculations. The in-termolecular interactions accounted for through use ofsuch dynamic electrostatics, effectively accounting forinduced moments (or polarizable charges), in combina-tion with simple estimates of many-body dispersion [109],resulted in much improved predictions of properties ofnew molecules. In Ref. [110] an atomic parameter, usedwithin semi-empirical quantum chemistry, has been im-proved through use of ML models trained in chemicalspace.

The inherent capability of atomic property models toscale to larger molecules provides another argument whybuilding a ML model of atomic forces is advantageousover building a ML model of energies with subsequentdifferentiation—the assumptions about how to partitionthe system are more severe in case of the latter. As such,

while an atomic ML model will not be able to extrapo-late to entirely new chemical environments that were notpart of its training training set, it is well capable of beingused for the prediction of an arbitrarily large diversity ofmacro- and supra-molecular structures such as polymersor complex crystal phases—as long as the local chemicalenvironments are similar. This point has been investi-gated and confirmed in Ref. [66]: For up to 35 nm (over2 k electrons), the error is shown to be independent ofsystem size. The computational cost remains negligiblefor such systems, while conventional DFT based calcula-tions of these properties require multiple CPU days.

CRYSTALS

Various attempts have already been made to apply MLto the prediction of crystal properties. In Ref. [43], a ra-dial distribution function based descriptor is proposedthat leads to ML learning models which interpolates be-tween pre-defined atom-pairs. Ref. [44] exploits chem-ical informatics based concepts for the design of crys-talline descriptors which require knowledge of previouslycalculated QM properties, such as the band-gap. Ghir-inghelli et al. report on the design of new descriptorsfor binary crystals in Ref. [111]. Through genetic pro-gramming they solve the inverse question of which atomicproperties should be combined how in order to define agood descriptor. A periodic adaptation of the CM, aswell as other variants, while leading to learning ratesthat can systematically be improved upon, did lead toreasonable ML models for large training sets [55]. Veryrecently, however, a superior descriptor for ML modelsof properties of pristine crystals has been proposed [68].This descriptor encodes the atomic population on eachWyckoff-symmetry site for a given crystal structure. Assuch, the resulting ML model is restricted to all thosecrystals which are part of the same symmetry group. Allchemistries within that group, however, can be accountedfor. At this point one should note that already for binarycyrstals the number of possible chemistries dramaticallyexceeds the number of possible crystal symmetries, notmentioning ternaries or solid mixtures with more com-ponents. The applicability of the crystal ML model hasbeen probed by screening DFT formation energies of allthe 2 M Elpasolite (ABC2D6) crystals one can possiblythink of using all main-group elements up to Bi. Usingan ML model trained on only 10 k DFT energies, theestimated mean absolute deviation of the out-of-sampleML prediction from DFT formation energies is less than0.1 eV/atom. When compared to experiment or quan-tum Monte Carlo results, such level of predictive accu-racy corresponds roughly to what one would also expectfor solids from generalized gradient approximated DFTestimates.

Page 15: arXiv:1510.07512v3 [physics.chem-ph] 12 May 2016 · Machine Learning, Quantum Mechanics, and Chemical Compound Space Raghunathan Ramakrishnan 1,and O. Anatole von Lilienfeld y 1Institute

15

CONCLUSIONS AND OUTLOOK

We have reviewed recent contributions which combinestatistical learning with quantum mechanical conceptsfor the automatic generation of ML models of materi-als and molecular properties. All studies reviewed havein common that after training on a given data set, MLmodels are obtained that are transferable in the sensethat they can infer solutions for arbitrarily many (CCSis unfathomably large as molecular size increases) out-of-sample instances with an accuracy similar to the oneof the deductive reference method used to generate thetraining set—for a very small, if not negligible, compu-tational overhead. Note that the genesis of the train-ing data is irrelevant for this Ansatz: While all stud-ies rely on pre-calculated results obtained from legacyquantum chemistry approximations which typically re-quire substantial CPU time investments, training datacould just as well have come from experimental measure-ments. Depending on the reliability and uncertainty ofthe experimental results, however, we would expect thenoise-parameter λ (see Eq. 2), to become significant insuch a set-up. Training-data obtained from computa-tional simulation does not suffer from such uncertainties(apart from the noise possibly incurred due to lack of nu-merical precision), resulting in practically noise-free datasets.

Comparing performance for molecules, nano-ribbonmodels, atoms, or crystals, one typically observes satis-fying semi-quantitative accuracy for ML models trainedon at least one thousand training instances. Accuraciescompetitive with current state-of-the-art models whichcan reach or exceed experimental accuracy, are typicallyfound only after training on many thousands of exam-ples. The main appeal of the ML approach reviewed isdue to its simplicity and computational efficiency: Af-ter training, the evaluation of the ML model for a newquery instance according to Eq. (1.1) typically consumesonly fractions of a CPU second, while solving SE in or-der to evaluate a quantum mechanical observable easilyrequires multiple CPU hours. As such, this approachrepresents a promising complementary assistance when itcomes to bruteforce high-throughput quantum mechanicsbased screeening of new molecules and materials.

In summary, the reviewed studies, as well as previ-ously published Refs [46, 56, 70, 112], report on success-ful applications of supervised learning concepts to theautomatic generation of analytic materials and molecu-lar property models using large numbers of parametersand flexible basis functions. Quantum properties consid-ered so far include

atomization energies and thermochemical proper-ties such as atomization enthalpies and entropieswith chemical accuracy [64];

molecular electronic properties such as electron cor-

relation energy, frontier orbital eigenvalues (HOMOand LUMO), electronic spread, dipole-moments,polarizabilities, excitation energies [57, 63–65];

atomic properties such as charges, dipole moments,and quadrupole moments, nuclear chemical shifts,core electron excitations, atomic forces [66, 67];

formation energies of crystals [55, 68];

electron transport as well as hybridization func-tions [59–61].

Performance is always being measured out-of-sample, i.e.,for “unseen” molecules not part of training, and at neg-ligible computational cost. As such, we note that thereis overwhelming numerical evidence suggesting that suchML approaches can be used as accurate surrogate mod-els for predicting important quantum mechanical prop-erties with sufficient accuracy, yet unprecedented com-putational efficiency–provided sufficient training data isavailable. For all studies, data requisition has resultedthrough use of well established and standardized quan-tum chemical atomistic simulation protocols on modernsuper computers. This has also been exemplified by gen-erating a consistent set of quantum mechanical structuresand properties of 134 k organic molecules [62] drawn fromthe GDB-17 dataset [78]. The more conceptual role ofrepresentations and kernels, studied in Refs [57, 58, 63],have been reviewed as well.

The limitations of ML models lie in their inherentlack of transferability to new regimes. Inaccurate out-of-sample predictions of properties of molecules or materi-als which differ substantially from the training set shouldtherefore be expected. Fortunately, however, most rele-vant properties and chemistries are quite limited in termsof stoichiometries (at most ∼100 different atom types)and geometries (typically, for semi-conductors inter-atomic distances larger than 15A will hardly contributeto partitioned properties such as atomic charges or in-teratomic bonding; while interatomic distances smallerthan 0.5A are exotic). Furthermore, “extrapolation” isnot a verdict but can sometimes be circumvented. Asdiscussed above, due to the localized nature of some ob-servables, atomic ML models can be predictive even if theatom is part of a substantially larger molecule. It hasbeen demonstrated that these property models will betransferable to larger molecules containing similar localchemical environments, and that they scale with systemsize—as long as the underlying roperty is sufficiently welllocalized.

Future work will show if ML models can also be con-structed for other challenging many-body problems in-cluding vibrational frequencies, reaction barriers and rateconstants, hyperpolarizabilities, solvation, free energysurfaces, electronic specrat, Rydberg states etc. Openmethodological questions include optimal representations

Page 16: arXiv:1510.07512v3 [physics.chem-ph] 12 May 2016 · Machine Learning, Quantum Mechanics, and Chemical Compound Space Raghunathan Ramakrishnan 1,and O. Anatole von Lilienfeld y 1Institute

16

and kernel functions, as well as the selection bias presentin most (if not all) databases.

The writing of this review was funded by the SwissNational Science foundation (No. PP00P2 138932).

[email protected][email protected]

[1] P. Kirkpatrick and C. Ellis, Nature 432, 823 (2004).Chemical space.

[2] C. Lipinski and A. Hopkins, Nature 432, 855 (2004).Navigating chemical space for biology and medicine.

[3] M. Dobson, Nature 432, 824 (2004). Chemical spaceand biology.

[4] O. A. von Lilienfeld, Int. J. Quantum Chem. 113, 1676(2013). First principles view on chemical compoundspace: Gaining rigorous atomistic control of molecularproperties.

[5] K. Burke, “Any ab initio method must either be void ofempirical parameters, or at least have parameters thatdo not depend on the system being studied.” Oral com-munication, IPAM, UCLA. (2011).

[6] T. Helgaker, P. Jørgensen, and J. Olsen. MolecularElectronic-Structure Theory (John Wiley & Sons, 2000).

[7] M. E. Tuckerman. Statistical mechanics: Theory andmolecular simulation (Oxford University Press, 2010).

[8] A. K. Rappe, C. J. Casewit, K. S. Colwell, W. A. G. III,and W. M. Skid, J. Am. Chem. Soc. 114, 10024 (1992).UFF, a full periodic table force field for molecular me-chanics and molecular dynamics simulations.

[9] A. C. T. van Duin, S. Dasgupta, F. Lorant, and W. A.Goddard III, J. Phys. Chem. A 105, 9396 (2001).ReaxFF: A reactive force field for hydrocarbons.

[10] S. Grimme, J. Chem. Theory Comput. 10, 4497 (2014).A general quantum mechanically derived force field(QMDFF) for molecules and condensed phase simula-tions.

[11] P. L. A. Popelier, Int. J. Quantum Chem. 115, 1005(2015). QCTFF: On the construction of a novel proteinforce field.

[12] P. Pyykko, Phys. Chem. Chem. Phys. 13, 161 (2011).A suggested periodic table up to z [less-than-or-equal]172, based on dirac-fock calculations on atoms and ions.

[13] R. Hoffmann, P. Schleyer, and H. Schaefer, Angew.Chem. Int. Ed. 47, 7164 (2008). Predicting molecules–more realism, please!

[14] E. Prodan and W. Kohn, Proceedings of the NationalAcademy of Sciences of the United States of America102, 11635 (2005). Nearsightedness of electronic matter.

[15] F. Jensen. Introduction to Computational Chemistry(John Wiley, West Sussex, England, 2007).

[16] H. Kubinyi. 3D QSAR in drug design. Theory, meth-ods and applications (ESCOM, Science Publishers B.V.,Leiden, 1993).

[17] J.-L. Faulon, D. P. Visco, Jr., and R. S. Pophale,J. Chem. Inf. Comp. Sci. 43, 707 (2003). The signa-ture molecular descriptor. 1. Using extended valence se-quences in QSAR and QSPR studies.

[18] A. Golbraikh, M. Shen, Y. D. Xiao, K. H. Lee, andA. Tropsha, J. Comput. Aided Mol. Des. 17, 241 (2003).Rational selection of training and test sets for the de-

velopment of validated QSAR models.[19] O. Ivanciuc, J. Chem. Inf. Comp. Sci. 40, 1412 (2000).

QSAR comparative study of wiener descriptors forweighted molecular graphs.

[20] P. Baldi, K.-R. Muller, and G. Schneider, Molecular In-formatics 30, 751 (2011). Editorial: Charting ChemicalSpace: Challenges and Opportunities for Artificial In-telligence and Machine Learning.

[21] M. N. Drwal and R. Griffith, Drug Discovery Today:Technologies 10, e395 (2013). Combination of ligand-and structure-based methods in virtual screening.

[22] G. Schneider, Nature Reviews 9, 273 (2010). Virtualscreening: An endless staircase?

[23] A. F. Wagner, G. C. Schatz, and J. M. Bowman, J.Chem. Phys. 74, 4960 (1981). The evaluation of fittingfunctions for the representation of an O(3P)+H2 poten-tial energy surface. I.

[24] G. C. Schatz, Rev. Mod. Phys. 61, 669 (1989). The ana-lytical representation of electronic potential-energy sur-faces.

[25] B. G. Sumpter and D. W. Noid, Chemical Physics Let-ters 192, 455 (1992). Potential energy surfaces formacromolecules. A neural network technique.

[26] S. Lorenz, A. Gross, and M. Scheffler, Chem. Phys.Lett. 395, 210 (2004). Representing high-dimensionalpotential-energy surfaces for reactions at surfaces byneural networks.

[27] S. Manzhos and T. Carrington, Jr., J. Chem. Phys.125, 084109 (2006). A random-sampling high dimen-sional model representation neural network for buildingpotential energy surfaces.

[28] J. Behler and M. Parrinello, Phys. Rev. Lett. 98, 146401(2007). Generalized neural-network representation ofhigh-dimensional potential-energy surfaces.

[29] S. A. Ghasemi, A. Hofstetter, S. Saha, andS. Goedecker, Phys. Rev. B 92, 045131 (2015). Inter-atomic potentials for ionic systems with density func-tional accuracy based on charge densities obtained by aneural network.

[30] T. Ho and H. Rabitz, J. Chem. Phys. 104, 2584 (1996).A general method for constructing multidimensionalmolecular potential energy surfaces from abinitio cal-culations.

[31] A. P. Bartok, M. C. Payne, R. Kondor, and G. Csanyi,Phys. Rev. Lett. 104, 136403 (2010). Gaussian approxi-mation potentials: The accuracy of quantum mechanics,without the electrons.

[32] J. Behler, Int. J. Quantum Chem. 115, 1032 (2015).Constructing high-dimensional neural network poten-tials: A tutorial review.

[33] S. Manzhos, R. Dawes, and T. Carrington, Int. J. Quan-tum Chem. 115, 1012 (2015). Neural network-based ap-proaches for building high dimensional and quantumdynamics-friendly potential energy surfaces.

[34] V. Botu and R. Ramprasad, Int. J. Quantum Chem.115, 1074 (2015). Adaptive machine learning frameworkto accelerate ab initio molecular dynamics.

[35] L. Hu, X. Wang, L. Wong, and G. Chen, J. Chem. Phys.119, 11501 (2003). Combined first-principles calculationand neural-network correction approach for heat of for-mation.

[36] X. Zheng, L. Hu, X. Wang, and G. Chen, Chem.Phys. Lett. 390, 186 (2004). A generalized exchange-correlation functional: The neural-networks approach.

Page 17: arXiv:1510.07512v3 [physics.chem-ph] 12 May 2016 · Machine Learning, Quantum Mechanics, and Chemical Compound Space Raghunathan Ramakrishnan 1,and O. Anatole von Lilienfeld y 1Institute

17

[37] M. Karthikeyan, R. C. Glen, , and A. Bender, J. Chem.Inf. Model. 45, 581 (2005). General Melting Point Pre-diction Based on a Diverse Compound Data Set andArtificial Neural Networks.

[38] R. M. Balabin and E. I. Lomakina, Phys. Chem. Chem.Phys. 13, 11710 (2011). Support vector machine regres-sion (LS-SVM)—an alternative to artifical neural net-works (ANNs) for the analysis of quantum chemistrydata.

[39] M. Misra, D. Andrienko, B. Baumeier, J.-L. Faulon,and O. A. von Lilienfeld, J. Chem. Theory Comput.7, 2549 (2011). Toward quantitative structure-propertyrelationships for charge transfer rates of polycyclic aro-matic hydrocarbons.

[40] M. A. Kayala and P. Baldi, J. Chem. Inf. Model. 52,2526 (2012). ReactionPredictor: Prediction of ComplexChemical Reactions at the Mechanistic Level Using Ma-chine Learning.

[41] A. Lusci, G. Pollastri, and P. Baldi, J. Chem. Inf.Model. 53, 1563 (2013). Deep Architectures and DeepLearning in Chemoinformatics: The Prediction of Aque-ous Solubility for Drug-Like Molecules.

[42] G. Pilania, C. Wang, X. Jiang, S. Rajasekaran, andR. Ramprasad, Scientific reports 3 (2013). Acceleratingmaterials property predictions using machine learning.

[43] K. T. Schutt, H. Glawe, F. Brockherde, A. Sanna, K. R.Muller, and E. K. U. Gross, Phys. Rev. B 89, 205118(2014). How to represent crystal structures for machinelearning: Towards fast prediction of electronic proper-ties.

[44] B. Meredig, A. Agrawal, S. Kirklin, J. E. Saal, J. W.Doak, A. Thompson, K. Zhang, A. Choudhary, andC. Wolverton, Phys. Rev. B 89, 094104 (2014). Com-binatorial screening for new materials in unconstrainedcomposition space with machine learning.

[45] E. O. Pyzer-Knapp, K. Li, and A. Aspuru-Guzik, Ad-vanced Functional Materials (2015). Learning from theharvard clean energy project: The use of neural net-works to accelerate materials discovery.

[46] M. Rupp, A. Tkatchenko, K.-R. Muller, and O. A. vonLilienfeld, Phys. Rev. Lett. 108, 058301 (2012). Fastand accurate modeling of molecular atomization ener-gies with machine learning.

[47] K.-R. Muller, S. Mika, G. Ratsch, K. Tsuda, andB. Scholkopf, IEEE Transactions on Neural Networks12, 181 (2001). An introduction to kernel-based learn-ing algorithms.

[48] T. Hastie, R. Tibshirani, and J. Friedman. The Ele-ments of Statistical Learning: data mining, inferenceand prediction. Springer series in statistics (Springer,New York, N.Y., 2001).

[49] B. Scholkopf and A. J. Smola. Learning with Kernels(MIT Press, Cambridge, 2002).

[50] C. E. Rasmussen and C. K. I. Williams.Gaussian Processes for Machine Learning,www.GaussianProcess.org (MIT Press, Cambridge,2006).

[51] L. P. Hammett, Chem. Rev. 17, 125 (1935). Some rela-tions between reaction rates and equilibrium constants.

[52] L. P. Hammett, J. Am. Chem. Soc. 59, 96 (1937). Theeffect of structure upon the reactions of organic com-pounds. Benzene derivatives.

[53] D. G. Pettifor, J. Phys. C: Solid State Phys. 19, 285(1986). The structures of binary compounds: I. Phe-

nomenological structure maps.[54] J. K. Nørskov, T. Bligaard, J. Rossmeisl, and C. H.

Christensen, Nature Chemistry 1, 37 (2009). Towardsthe computational design of solid catalysts.

[55] F. Faber, A. Lindmaa, O. A. von Lilienfeld, andR. Armiento, Int. J. Quantum Chem. 115, 1094 (2015).Crystal structure representations for machine learningmodels of formation energies.

[56] G. Montavon, M. Rupp, V. Gobre, A. Vazquez-Mayagoitia, K. Hansen, A. Tkatchenko, K.-R. Muller,and O. A. von Lilienfeld, New Journal of Physics 15,095003 (2013). Machine learning of molecular electronicproperties in chemical compound space.

[57] K. Hansen, F. Biegler, R. Ramakrishnan, W. Pronobis,O. A. von Lilienfeld, K.-R. Muller, and A. Tkatchenko,J. Phys. Chem. Lett. 6, 2326 (2015). Machine learn-ing predictions of molecular properties: Accurate many-body potentials and nonlocality in chemical space.

[58] O. A. von Lilienfeld, R. Ramakrishnan, M. Rupp, andA. Knoll, Int. J. Quantum Chem. 115, 1084 (2015).Fourier series of atomic radial distribution functions:A molecular fingerprint for machine learning models ofquantum chemical properties.

[59] A. Lopez-Bezanilla and O. A. von Lilienfeld, Phys.Rev. B 89, 235411 (2014). Modeling electronic quan-tum transport with machine learning.

[60] L.-F. Arsenault, A. Lopez-Bezanilla, O. A. von Lilien-feld, and A. J. Millis, Phys. Rev. B 90, 155136 (2014).Machine learning for many-body physics: the case ofthe anderson impurity model.

[61] L.-F. Arsenault, O. A. von Lilienfeld, and A. J. Millis,Phys. Rev. Lett. (2015). Machine learning for many-body physics: efficient solution of dynamical mean-fieldtheory.

[62] R. Ramakrishnan, P. Dral, M. Rupp, and O. A. vonLilienfeld, Scientific Data 1, 140022 (2014). Quan-tum chemistry structures and properties of 134 kilomolecules.

[63] R. Ramakrishnan and O. A. von Lilienfeld, CHIMIA 69,182 (2015). Many molecular properties from one kernelin chemical space.

[64] R. Ramakrishnan, P. Dral, M. Rupp, and O. A. vonLilienfeld, J. Chem. Theory Comput. 11, 2087 (2015).Big data meets quantum chemistry approximations:The ∆-machine learning approach.

[65] R. Ramakrishnan, M. Hartmann, and E. T. O. A. vonLilienfeld, J. Chem. Phys. 143, 084111 (2015). Elec-tronic spectra from TDDFT and machine learning inchemical space.

[66] M. Rupp, R. Ramakrishnan, and O. A. von Lilienfeld, J.Phys. Chem. Lett. 6, 3309 (2015). Machine learning forquantum mechanical properties of atoms in molecules.

[67] T. Bereau, D. Andrienko, and O. A. von Lilienfeld, J.Chem. Theory Comput. 11, 3225 (2015). Transferableatomic multipole machine learning models for small or-ganic molecules.

[68] F. Faber, A. Lindmaa, O. A. von Lilienfeld, andR. Armiento, Phys. Rev. Lett. (2015). Machine learningenergies for 2M Elpasolite (ABC2D6) crystals.

[69] R. Ramakrishnan, G. Montavon, A. Tkatchenko, K.-R.Muller, and O. A. von Lilienfeld. Quantum-Machine-v2015.01: A parallel framework for training molecularproperties in chemical space with quantum chemical ac-curacy and transferability.

Page 18: arXiv:1510.07512v3 [physics.chem-ph] 12 May 2016 · Machine Learning, Quantum Mechanics, and Chemical Compound Space Raghunathan Ramakrishnan 1,and O. Anatole von Lilienfeld y 1Institute

18

[70] K. Hansen, G. Montavon, F. Biegler, S. Fazli, M. Rupp,M. Scheffler, O. A. von Lilienfeld, A. Tkatchenko, andK.-R. Muller, J. Chem. Theory Comput. 9, 3404 (2013).Assessment and validation of machine learning methodsfor predicting molecular atomization energies.

[71] A. N. Tikhonov and V. Arsenin. Solutions of ill-posedproblems (Vh Winston, 1977).

[72] P. C. Hansen. Discrete inverse problems: insight andalgorithms. vol. 7 (SIAM, 2010).

[73] R. G. Parr and W. Yang. Density functional theoryof atoms and molecules (Oxford Science Publications,1989).

[74] R. Todeschini and V. Consonni. Handbook of MolecularDescriptors (Wiley-VCH, Weinheim, 2009).

[75] J. E. Moussa, Phys. Rev. Lett. 109, 059801 (2012).Comment on “Fast and accurate modeling of molecularenergies with machine learning”.

[76] M. Rupp, A. Tkatchenko, K.-R. Muller, and O. A. vonLilienfeld, Phys. Rev. Lett. 109, 059802 (2012). Reply tocomment on “Fast and accurate modeling of molecularenergies with machine learning”.

[77] M. Hirn, N. Poilvert, and S. Mallat, (2015). Quantumenergy regression using scattering transforms.

[78] L. Ruddigkeit, R. van Deursen, L. Blum, and J.-L. Rey-mond, J. Chem. Inf. Model. 52, 2684 (2012). Enumera-tion of 166 billion organic small molecules in the chem-ical universe database gdb-17.

[79] R. Landauer, IBM Journal of Research and Develop-ment 1, 223 (1957). Spatial variation of currents andfields due to localized scatterers in metallic conduction.

[80] R. Landauer, Philosophical Magazine 21, 863 (1970).Electrical resistance of disordered one-dimensional lat-tices.

[81] M. Buttiker, Phys. Rev. Lett. 57, 1761 (1986). Four-terminal phase-coherent conductance.

[82] D. S. Fisher and P. A. Lee, Phys. Rev. B 23, 6851(1981). Relation between conductivity and transmissionmatrix.

[83] M. Alvarez-Moreno, C. De Graaf, N. Lopez, F. Maseras,J. M. Poblet, and C. Bo, J. Chem. Inf. Model. 55, 95(2014). Managing the computational chemistry big dataproblem: The iochem-bd platform.

[84] M. J. Harvey, N. J. Mason, A. McLean, P. Murray-Rust, H. S. Rzepa, and J. J. Stewart, J. Cheminform.7, 1 (2015). Standards-based curation of a decade-olddigital repository dataset of molecular information.

[85] J. Sadowski, J. Gasteiger, and G. Klebe, J. Chem. Inf.Model. 34, 1000 (1994). Comparison of automatic three-dimensional model builders using 639 x-ray structures.

[86] J. J. P. Stewart, J. Mol. Modeling 19, 1 (2013). Op-timization of parameters for semiempirical methods vi:more modifications to the nddo approximations and re-optimization of parameters.

[87] W. Kohn and L. J. Sham, Phys. Rev. 140, A1133(1965). Self-consistent equations including exchange andcorrelation effects.

[88] P. J. Stevens, F. J. Devlin, C. F. Chabalowski, and M. J.Frisch, J. Phys. Chem. 98, 11623 (1993).

[89] MOPAC2009, James J. P. Stewart, Stewart Com-putational Chemistry, Colorado Springs, CO, USA,http://openmopac.net (2008).

[90] M. J. Frisch, G. W. Trucks, H. B. Schlegel, G. E.Scuseria, M. A. Robb, J. R. Cheeseman, G. Scalmani,V. Barone, B. Mennucci, G. A. Petersson, et al.. Gaus-

sian 09 revision D.01.[91] L. A. Curtiss, P. C. Redfern, and K. Raghavachari, J.

Chem. Phys. 127, 124105 (2007). Gaussian-4 theory us-ing reduced order perturbation theory.

[92] T. D. Huan, A. Mannodi-Kanakkithodi, and R. Ram-prasad, Phys. Rev. B 92, 014106 (2015). Acceleratedmaterials property predictions and design using motif-based fingerprints.

[93] P. O. Dral, O. A. von Lilienfeld, and W. Thiel, J. Chem.Theory Comput. 11, 2120 (2015). Machine learning ofparameters for accurate semiempirical quantum chemi-cal calculations.

[94] F. Furche and R. Ahlrichs, J. Chem. Phys. 117, 7433(2002). Adiabatic time-dependent density functionalmethods for excited state properties.

[95] C. Hattig and F. Weigend, J. Chem. Phys. 113, 5154(2000). CC2 excitation energy calculations on largemolecules using the resolution of the identity approx-imation.

[96] F. Weigend and R. Ahlrichs, Phys. Chem. Chem. Phys.7, 3297 (2005). Balanced basis sets of split valence,triple zeta valence and quadruple zeta valence qualityfor H to Rn: design and assessment of accuracy.

[97] J. P. Perdew, K. Burke, and M. Ernzerhof, Phys. Rev.Lett. 77, 3865 (1996). Generalized gradient approxima-tion made simple.

[98] J. P. Perdew, M. Ernzerhof, and K. Burke, J. Chem.Phys. 105, 9982 (1996). Rationale for mixing exact ex-change with density functional approximations.

[99] C. Adamo and V. Barone, J. Chem. Phys. 110, 6158(1999). Toward reliable density functional methodswithout adjustable parameters: The PBE0 model.

[100] T. Yanai, D. P. Tew, and N. C. Handy, Chem.Phys. Lett. 393, 51 (2004). A new hybrid exchange–correlation functional using the coulomb-attenuatingmethod (CAM-B3LYP).

[101] M. Rupp, Int. J. Quantum Chem. 115, 1058 (2015).Machine learning for quantum mechanics in a nutshell.

[102] J. C. Snyder, M. Rupp, K. Hansen, L. Blooston, K.-R. Muller, and K. Burke, J. Chem. Phys. 139, 224104(2013). Orbital-free bond breaking via machine learning.

[103] J. P. Perdew and K. Schmidt. in AIP Conference Pro-ceedings (IOP Institute of Physics Publishing LTD,2001). pp. 1–20.

[104] R. P. Feynman, Phys. Rev. 56, 340 (1939). Forces inmolecules.

[105] R. Ramakrishnan and G. Rauhut, J. Chem. Phys. 142,154118 (2015). Semi-quartic force fields retrieved frommulti-mode expansions: Accuracy, scaling behavior andapproximations.

[106] Z. Li, J. R. Kermode, and A. De Vita, Phys. Rev. Lett.114, 096405 (2015). Molecular dynamics with on-the-flymachine learning of quantum-mechanical forces.

[107] M. Caccin, Z. Li, J. R. Kermode, and A. De Vita, Int.J. Quantum Chem. 115, 1129 (2015). A framework formachine-learning-augmented multiscale atomistic simu-lations on parallel supercomputers.

[108] V. Botu and R. Ramprasad, Phys. Rev. B 92, 094306(2015). Learning scheme to predict atomic forces andaccelerate materials simulations.

[109] T. Bereau and O. A. von Lilienfeld, J. Chem. Phys. 141,034101 (2014). Toward transferable interatomic van derwaals interactions without electrons: The role of multi-pole electrostatics and many-body dispersion.

Page 19: arXiv:1510.07512v3 [physics.chem-ph] 12 May 2016 · Machine Learning, Quantum Mechanics, and Chemical Compound Space Raghunathan Ramakrishnan 1,and O. Anatole von Lilienfeld y 1Institute

19

[110] P. O. Dral, O. A. von Lilienfeld, and W. Thiel, J. Chem.Theory Comput. 11, 2120 (2015). Machine learning ofparameters for accurate semiempirical quantum chemi-cal calculations.

[111] L. M. Ghiringhelli, J. Vybiral, S. V. Levchenko,C. Draxl, and M. Scheffler, Phys. Rev. Lett. 114, 105503

(2015). Big data of materials science: Critical role of thedescriptor.

[112] G. Montavon, K. Hansen, S. Fazli, M. Rupp, F. Biegler,A. Ziehe, A. Tkatchenko, O. A. von Lilienfeld, and K.-R. Muller, NIPS proceedings 25, 449 (2012). Learninginvariant representations of molecules for atomizationenergy prediction.