computer-aided molecular design using tabu search

11
Computers and Chemical Engineering 29 (2005) 337–347 Computer-aided molecular design using Tabu search B. Lin a,1 , S. Chavali b , K. Camarda b , D.C. Miller a,a Department of Chemical Engineering, Rose-Hulman Institute of Technology, 5500 Wabash Avenue, Terre Haute, IN 47803, USA b Department of Chemical and Petroleum Engineering, University of Kansas, 1530 W 15th, 4132 Learned Hall, Lawrence, KS 66045, USA Received 11 December 2003; received in revised form 25 October 2004; accepted 26 October 2004 Abstract A detailed implementation of the Tabu search (TS) algorithm for computer-aided molecular design (CAMD) of transition metal catalysts is presented in this paper. Previous CAMD research has applied deterministic methods or genetic algorithms to the solution of the optimization problems which arise from the search for a molecule satisfying a set of property targets. In this work, properties are estimated using correlations based on connectivity indices, which allows the TS algorithm to use several novel operators to generate neighbors, such as swap and move, which would have no effect with a traditional group contribution-based approach. In addition, the formulation of the neighbor generation process guarantees that molecular valency and connectivity constraints are met, resulting in a complete molecular structure. Results on two case studies using TS are compared with a deterministic approach and show that TS is able to provide a list of good candidate molecules while using a much smaller amount of computation time. © 2004 Elsevier Ltd. All rights reserved. Keywords: Tabu search; Computer-aided molecular design; Computation time; Transition metal catalysts 1. Introduction Computer-aided molecular design (CAMD) has the po- tential to greatly decrease the time and effort required to de- velop new molecular entities by reducing the need for costly and time-consuming trial-and-error experiments. Instead, a slate of candidate molecules which are predicted to have the desired properties can be used as a starting point for more fo- cused experimental synthesis. This general method is appli- cable to such varied products as catalysts, polymers, solvents, and detergents. Hairston (1998) recently reported that a com- putational algorithm has been successfully used to design a new cancer-fighting pharmaceutical. The CAMD methodology consists of solving both a for- ward and a backward problem (Venkatasubramanian, Chan, Corresponding author. Tel.: +1 812 877 8506; fax: +1 812 877 8992. E-mail addresses: [email protected] (B. Lin), [email protected] (D.C. Miller). 1 Present address: CAPEC, Department of Chemical Engineering, Tech- nical University of Denmark, Lyngby, DK-2800, Denmark. & Caruthers, 1994). The forward problem predicts properties based on molecule structure; the backward step identifies a structure to obtain a molecule with a given set of target proper- ties. Property prediction is usually based on either group con- tribution methods or topological indices. The group contribu- tion approach has been most widely reported (Gani, Nielsen, & Fredenslund, 1991; Harper & Gani, 2000; Harper, Hostrup, & Gani, 2003; Sahinidis & Tawarmalani, 2000; Sahinidis, Tawarmalani, & Yu, 2003; Vaidyanathan & El-Halwagi, 1996; Venkatasubramanian et al., 1994). Constaninou and Gani (1994) and Constantinou, Gani, and O’Connell (1995) described a two-level group contribution method which uti- lizes molecular structure information to estimate the phys- ical and thermodynamic properties of pure components. A three-level group contribution method proposed by Marrero and Gani (2001) exhibits improved accuracy and applicabil- ity to deal with bio-chemically and environmentally-related compounds. Because most group contribution methods can- not adequately account for steric effects (Wang & Milne, 1994), several researchers have begun using topological 0098-1354/$ – see front matter © 2004 Elsevier Ltd. All rights reserved. doi:10.1016/j.compchemeng.2004.10.008

Upload: b-lin

Post on 26-Jun-2016

243 views

Category:

Documents


7 download

TRANSCRIPT

Page 1: Computer-aided molecular design using Tabu search

Computers and Chemical Engineering 29 (2005) 337–347

Computer-aided molecular design using Tabu search

B. Lina,1, S. Chavalib, K. Camardab, D.C. Millera,∗a Department of Chemical Engineering, Rose-Hulman Institute of Technology, 5500 Wabash Avenue, Terre Haute, IN 47803, USA

b Department of Chemical and Petroleum Engineering, University of Kansas, 1530 W 15th, 4132 Learned Hall, Lawrence, KS 66045, USA

Received 11 December 2003; received in revised form 25 October 2004; accepted 26 October 2004

Abstract

A detailed implementation of the Tabu search (TS) algorithm for computer-aided molecular design (CAMD) of transition metal catalysts ispresented in this paper. Previous CAMD research has applied deterministic methods or genetic algorithms to the solution of the optimizationproblems which arise from the search for a molecule satisfying a set of property targets. In this work, properties are estimated using correlationsbased on connectivity indices, which allows the TS algorithm to use several novel operators to generate neighbors, such as swap and move,w enerationp sults on twoc ecules whileu©

K

1

tvasdccapn

w

d

n

iesfies aper-con-ibu-,,,i,d5)uti-hys-ts. A

bil-tedcan-

gical

0d

hich would have no effect with a traditional group contribution-based approach. In addition, the formulation of the neighbor grocess guarantees that molecular valency and connectivity constraints are met, resulting in a complete molecular structure. Rease studies using TS are compared with a deterministic approach and show that TS is able to provide a list of good candidate molsing a much smaller amount of computation time.2004 Elsevier Ltd. All rights reserved.

eywords:Tabu search; Computer-aided molecular design; Computation time; Transition metal catalysts

. Introduction

Computer-aided molecular design (CAMD) has the po-ential to greatly decrease the time and effort required to de-elop new molecular entities by reducing the need for costlynd time-consuming trial-and-error experiments. Instead, alate of candidate molecules which are predicted to have theesired properties can be used as a starting point for more fo-used experimental synthesis. This general method is appli-able to such varied products as catalysts, polymers, solvents,nd detergents.Hairston (1998)recently reported that a com-utational algorithm has been successfully used to design aew cancer-fighting pharmaceutical.

The CAMD methodology consists of solving both a for-ard and a backward problem (Venkatasubramanian, Chan,

∗ Corresponding author. Tel.: +1 812 877 8506; fax: +1 812 877 8992.E-mail addresses:[email protected] (B. Lin),

[email protected] (D.C. Miller).1 Present address: CAPEC, Department of Chemical Engineering, Tech-ical University of Denmark, Lyngby, DK-2800, Denmark.

& Caruthers, 1994). The forward problem predicts propertbased on molecule structure; the backward step identistructure to obtain a molecule with a given set of target proties. Property prediction is usually based on either grouptribution methods or topological indices. The group contrtion approach has been most widely reported (Gani, Nielsen& Fredenslund, 1991; Harper & Gani, 2000; Harper, Hostrup& Gani, 2003; Sahinidis & Tawarmalani, 2000; SahinidisTawarmalani, & Yu, 2003; Vaidyanathan & El-Halwag1996; Venkatasubramanian et al., 1994). Constaninou anGani (1994)andConstantinou, Gani, and O’Connell (199described a two-level group contribution method whichlizes molecular structure information to estimate the pical and thermodynamic properties of pure componenthree-level group contribution method proposed byMarreroand Gani (2001)exhibits improved accuracy and applicaity to deal with bio-chemically and environmentally-relacompounds. Because most group contribution methodsnot adequately account for steric effects (Wang & Milne,1994), several researchers have begun using topolo

098-1354/$ – see front matter © 2004 Elsevier Ltd. All rights reserved.oi:10.1016/j.compchemeng.2004.10.008

Page 2: Computer-aided molecular design using Tabu search

338 B. Lin et al. / Computers and Chemical Engineering 29 (2005) 337–347

Nomenclature

fi,j,k a partitioned binary adjacency matrix showingwhen basic groupsi and j are bonded with akth-multiplicity bond

x′ best new neighbor solutionx* the best solution ever foundzi an existence vector showing wither theith

group exists in the moleculeElec sum of the electronegativities of all the groupsElec(i) electronegativity of groupi in electronvolts

(eV)LC50 lethal concentration killing 50% of the test pop-

ulation, a measure of toxicityNMo the number of rows allotted to all the molyb-

denum groupsNTot total number of basic groupsNBond total number of types of bonds of the basic

groupsNmax,i maximum number of basic groups allowed in

moleculeOxid sum of the oxidation states of all the Mo groupsOxid(i) the oxidation state of the of theith groupPm the estimated value of propertymPscale

m a scale factor used to weight the importance ofone property relative to another

Ptargetm the target value for propertym

R set of all targeted properties

Greek letters0χ zero-order molecular connectivity indices0χv zero-order molecular valence connectivity in-

dices1χ first-order molecular connectivity indices1χv first-order molecular valence connectivity in-

dices2χ second-order molecular connectivity indices2χv second-order molecular valence connectivity

indicesδ simple atomic connectivity indices that refer to

the number of bonds which can be formed withother groups

δv atomic valence connectivity indices that de-scribe the electronic structure of each basicgroup

indices in an effort to obtain more accurate property predic-tion (Camarda & Maranas, 1999; Kier & Hall, 1976; Raman& Maranas, 1998; Siddhaye, Camarda, Topp, & Southard,2000). Whichever approach is chosen, existing structure-property information is regressed to form an empirical modelrelating the molecule structure to the properties of inter-est. The inverse problem is essentially a mixed integer op-

timization problem, whether it is solved explicitly and de-terministically as inMaranas (1996), stochastically via a ge-netic algorithm-based approach as inVenkatasubramanianet al. (1994), or via a generate and test approach such asthose ofFriedler, Fan, Katotai, and Dallos (1998); Gordeeva,Molchanova, and Zefirov (1990); Harper and Gani (2000)andHarper et al. (2003).

Many researchers have reported solutions to the back-ward problem to determine the molecular structure. Com-binatorial and heuristic-based enumeration approacheshave been reported (Gani & Brignole, 1983; Joback &Stephanopoulos, 1989). Kier, Lowell, and Frazer (1993)useda graph reconstruction approach to determine feasible molec-ular structures with bounded physical property values, whileVaidyanathan and El-Halwagi (1996)described an inter-val analysis approach for the computer-aided synthesis ofpolymers and blends. CAMD problems have recently beenformulated as mixed-integer linear/non-linear programming(MILP/MINLP) problems.Maranas (1996)transformed thenon-convex MINLP formulation into a tractable MILP modelby expressing integer variables as a linear combination of bi-nary variables and replacing the products of continuous andbinary variables with linear inequality constraints. Using con-nectivity indices,Camarda and Maranas (1999)described aconvex MINLP representation for solving several polymerd -a prox-i ch-a ,a ascTb ento beend nisticm -n areeK ap-p rac-t idg SAa

om-b cals with-o omo listso elpg AMDs olu-t als tiona ionsn de-t ing

esign problems.Churi and Achenie (1996)solved a refrigernt design problem with an augmented penalty-outer ap

mation (AP/OA) algorithm. A reduced dimension brannd-bound algorithm was presented byOstrovsky, Acheniend Sinha (2003)to design optimal solvents usedleaning agents in the printing industry.Sahinidis andawarmalani (2000), Sahinidis et al. (2003)reported aranch-and-reduce algorithm for identifying a replacemf Freon. Stochastic optimization approaches have alsoeveloped as alternate strategies for rigorous determiethods.Venkatasubramanian et al. (1994)employed a geetic algorithm for polymer design in which propertiesstimated via group contribution methods.Marcoulaki andokossis (1998)described a simulated annealing (SA)roach for the design of refrigerants and liquid–liquid ext

ion solvents.Wang and Achenie (2002)presented a hybrlobal optimization approach that combines the OA andlgorithms for several solvent design problems.

Tabu search (TS) is a heuristic approach for solving cinatorial optimization problems by using a guided, loearch procedure to explore the entire solution spaceut becoming easily trapped in local optima. It differs frther stochastic optimization techniques by maintainingf previous solutions (usually termed “memory”) that huide the search process. These lists are useful for Cince they provide a direct method to track near-optimal sions. The ability of TS to efficiently find a set of near-optimolutions is particularly useful since all property prediclgorithms have limited accuracy and problem formulatormally do not include all relevant properties. Thus, the

ermination of the global optimum is not as critical as find

Page 3: Computer-aided molecular design using Tabu search

B. Lin et al. / Computers and Chemical Engineering 29 (2005) 337–347 339

a set of near-optimal solutions. Deterministic approaches cangenerate such a list by multiple MINLP solves and applica-tion of integer cuts, but for large problems this becomes pro-hibitively computationally expensive. Other stochastic ap-proaches, as well as generate and test strategies, can alsogenerate such a near-optimal candidate list. By identifying arange of potential target molecules, TS avoids missing po-tentially useful molecules and allows the use of other cri-teria (such as ease of synthesis) to perform a final rankingof the candidates.Transition metal catalysts are an importantclass of molecule for creating other molecules efficiently withminimal impact on the environment; however, the catalyststhemselves are often extremely toxic and harmful to the en-vironment. For example, most propylene oxide is producedusing homogenous catalysts containing molybdenum, whichare effective but also harmful to the environment. Since inmany cases a significant amount of such catalyst is lost tothe environment (Allen & Shonnard, 2002), new materialswhich show improved catalytic activity and less toxicity arehighly desired. This paper describes a framework for usingTS to design transition-metal catalysts.

2. Property prediction via connectivity indices

In order for a molecular design algorithm to be successful,p sol-u tedw uta-t lystc ing an lem.W val-u le, toc truc-t con-n st byp as aw

byR sb f or-g city.F ini-m ofcK eo ext,w c-t dM de-s pertyv st har-m

ionm prop-

Table 1Atomic connectivity values of groups in CHCl2F

1 2 3

Group CH F Clδ 3 1 1δv 3 7 0.77778

erties, as inGani et al. (1991), Venkatasubramanian et al.(1994), andMaranas (1996). The works ofConstaninou andGani (1994)andMarrero and Gani (2001)extend the basicgroup contribution approach by considering combinations offunctional groups and thus take into account second-ordereffects when used to predict physical properties. Connectiv-ity indices, however, take into account the entire molecu-lar structure of a compound. By using higher-order connec-tivity indices, third-order and higher structural effects canbe included in structure property relations.Harper and Gani(2000)andHarper et al. (2003)have combined group con-tribution with connectivity indices to include these higher-order effects.Raman and Maranas (1998)andHarper et al.(2003)have shown that these higher order effects are ableto give more accurate property descriptions. Furthermore,when a molecular design problem is solved using these in-dices, a complete molecular structure is obtained, and no sec-ondary problem must be solved to recover the final molecu-lar structure.Harper and Gani (2000)andMeniai, Newshamand Khalfaoui (1998)have also reported the design of com-plete molecular structures. Connectivity indices also pro-vide a method to compute the properties of mixtures, sincethe connectivity indices of individual compounds can becombined to estimate mixture properties in certain cases(Johnson, Lin, Miller, & Camarda, 2002; Kim, Min, Lee,K

de-c asicg culesa ogena f hy-d overe

T in-d medw ec-t achb ativ-i nces c-t nt int asicg n bec

en-s c-u ic

hysical properties of interest, such as density, toxicity,bility, or reactivity within a given system must be estimaith a reasonable accuracy using only a very small comp

ional effort. Once the properties of a transition metal cataan be predicted from structure, the problem of designew catalyst can be formulated as an optimization probe employ connectivity indices, which are numerical

es that describe the electronic structure of a molecuharacterize the molecule and to correlate its internal sure with physical properties of interest. These indicesect the molecular structure to the properties of intereroviding a mathematical description of the moleculehole.Molecular connectivity indices were first introduced

andic (1975). Kier and Hall (1976)reported correlationetween connectivity indices and many key properties oanic compounds, such as density, solubility, and toxiurthermore, these indices can be computed with a mum of computational effort. Many of the applications

onnectivity indices have been reviewed byTrinajstic (1983).ier et al. (1993)andGordeeva et al. (1990)reported the usf connectivity indices within a molecular design conthile Raman and Maranas (1998)first incorporated conne

ivity indices into an optimization framework.Camarda anaranas (1999)used connectivity indices as property

criptors to design polymers that have pre-specified proalues.Siddhaye et al. (2000)employed connectivity indiceo predict the physical properties for the design of novel paceutical products via combinatorial optimization.In earlier molecular design work, group contribut

ethods were used to estimate the values of physical

im, & Jeong, 1992).Computational property estimation algorithms first

ompose a molecule into smaller units. Then, a set of broups that can potentially be part of the candidate molere defined. A basic group is defined as a single non-hydrtom in a given valence state, bonded to some number orogen atoms. Atomic connectivity indices are definedach basic group.

Basic groups of the molecule CHCl2F are shown inable 1. Theδ values are the simple atomic connectivityices that refer to the number of bonds which can be forith other groups. Theδv values are atomic valence conn

ivity indices that describe the electronic structure of easic group, including lone-pair electrons and electroneg

ty. For transition metals, which can assume multiple valetates, the definition ofδv is based on the number of elerons participating in the bonding, instead of those presehe outer shell. Once the atomic connectivity indices of broups are defined, the molecular connectivity indices caomputed for the entire molecule.

Molecule structure is expressed with a hydroguppressed graph (as shown inFig. 1). The zero-order molelar connectivity indices0χ and0χv are the sum of each bas

Page 4: Computer-aided molecular design using Tabu search

340 B. Lin et al. / Computers and Chemical Engineering 29 (2005) 337–347

Table 2Calculation of molecular connectivity indices

Zero 0χ = ∑i ∈ V δ

1/2i

0χv = ∑i ∈ V δ

v−1/2i V–the set of all vertices

First 1χ = ∑(i,j) ∈ E (δv

i δvj )−1/2 1χv = ∑

(i,j) ∈ E (δvi δ

vj )−1/2 E–the set of all edges

Second 2χ = ∑(i,j,k) ∈ T (δiδjδk)−1/2 2χv = ∑

(i,j,k) ∈ T (δvi δ

vjδ

vk)−1/2 T–the set of all triplets

Fig. 1. Hydrogen-suppressed graph of CHCl2F.

group (the sum over all vertices), which describe the identityof the groups in a given molecule (seeTable 2).

Higher order connectivity indices can also be defined andhave been used to give a more precise description of molecu-lar structure (Kier and Hall, 1976). In this work, we have usedthe second-order connectivity indices,2χ and 2χv, to giveadded accuracy to correlations. These two indices are sumsover each of the triplets in the molecule, that is, over eachpossible combination of three bonded groups (seeFig. 1).

Once the equations defining the (molecular) connectivityindices are in place, these indices can be used in empirical cor-relations to predict the physical properties of novel transition-metal catalysts.Table 3shows the basic groups employed inthe design of a molybdenum catalyst for an epoxidation reac-tion. (To limit the search space, we set an upper bound on themaximum number of groups in a molecule.) The correlationderived in this work for density is:

ρ = −55351+ 758000χ − 76630χv + 409011χ

+17841χv − 720462χ − 6072χv − 24695(0χ)2

+6490χ0χv − 12271(1χ)2 − 65.4(1χv)

2 − 1793(2χ)2

+8.9(2χv)2 + 723231χ2χ (1)

This correlation was developed from regressing 23 Mo-c ciento cor-r hoseo ch as

3. Problem formulation

When zeroth, first- and second-order connectivity indicesare employed for property estimation, a molecule is repre-sented mathematically using sets of binary variables. First, avector of binary variables,zi , is defined. An element of thisvector equals one if theith group exists in the molecule; oth-erwise, it equals zero. Second, a partitioned adjacency matrixwith elementsfi,j,k is determined. When basic groupsi andjare bonded with akth-multiplicity bond,fi,j,k = 1; otherwise,fi,j,k = 0. These sets of variables provide sufficient informa-tion to compute the connectivity indices and, thus, estimatemolecular properties. Along with these definitions, propertycorrelations using the connectivity indices are included in theoverall formulation.

The objective is to determine the molecule that minimizesthe difference between the target property values. This canbe written as:

min Obj =∑

m ∈ R

1

Pscalem

|Pm − Ptargetm | (2)

whereR is the set of all targeted properties,Pm the estimatedvalue of propertym, Pscale

m a scale factor used to weight theimportance of one property relative to another, andP

targetm the

target value for propertym.ained

m eingc con-s val-u LPm

ford tion.TS encyme blesi ly,

TB

4

G Cl

δ 1δ 0.777N 3

entered complexes and resulted in a correlation coeffif 0.962 and a sum of squared error of 0.944. Using thiselation, an optimization problem has been formulated wptimal solutions are molecules which most closely matet of target property values.

able 3asic groups and atomic connectivity values of case 1

1 2 3

roup OH

5 6 1v 0.13889 0.17143 5

max,i 2 2 3

Structural constraints are added to ensure the obtolecules are fully connected and satisfy valency by b

onnected with the appropriate types of bonds. Othertraints include bounds on the variables and propertyes. The resulting formulation is a large, non-convex MINodel for CAMD.Table 3shows the set of 8 basic group types used

esigning a molybdenum catalyst for an epoxidation reache total number of available groups is

∑8i=1Nmax,i = 22.

ince only single bonds are involved, the size of the adjacatrix f is 22.22.1 = 484 elements. Each element,fi.j.k, can beither 0 or 1, and the total number of independent varia

s∑22

i=1i − 1 + 22 = 253. In order to apply TS effective

5 6 7 8

O CH3 CH2 NH2

2 1 2 178 6 1 2 3

3 3 3 3

Page 5: Computer-aided molecular design using Tabu search

B. Lin et al. / Computers and Chemical Engineering 29 (2005) 337–347 341

the original MINLP model ofSiddhaye et al. (2000)has beenaltered to handle basic groups and the connectivity betweenthem rather than dealing with the elements of the adjacencymatrix directly.

4. TS implementation

TS begins by determining an initial solution. Additionalsolutions (termed neighbors) are generated by modifying theexisting solution through a sequence of moves. The best newneighbor (x* ) is used as the starting point for the next iterationunless it is on a Tabu list. Thus, even if no neighbor solutionsare better than the initial solution, the best one is still chosenas the starting point for the next iteration. A record of the bestsolution ever found (x* ) is separately maintained. In addition,the Tabu lists provide an adaptive memory that guides thesearch by taking advantage of historical information. Thismemory enables TS to make strategic choices and achieveresponsive exploration.

The standard TS algorithm (Glover & Laguna, 1997; Lin,Chavali, Camarda & Miller, 2003; Lin & Miller, 2004b) isadapted to represent the molecule efficiently. Each solution isa molecule consisting of a series of fully connected groups.Therefore, the initial solution is constructed by connectingb , op-e pro-c theb d-i ,

TP

S

1

2

3

4

5

6

a mainchain is defined as the list within a molecule with thelargest number of groups. It is determined during the processof constructing a molecule. The length of sidechains (list ofgroups connecting to the mainchain) is constrained to be lessthan or equal to that of the mainchain. For all the catalystdesign test cases, at least one Mo group must be present.

4.1. Building an initial solution

• Step 1. One of the basic groups is selected as a root group.For simplicity, a group with only one bond,NH2, is cho-sen.

• Step 2. Another group is added to the root group. If thesecond group has only one bond, i.e.,OH, there are noempty bonds in the molecule. Since the molecule is fullyconnected, the initial solution, hydroxylamine (H2N OH),would be obtained; however, this violates the constraintrequiring at least one Mo atom. Thus, another group mustbe chosen. In this example, this step adds basic group 1from Table 3, resulting in molecule fragment for step 2 asshown inTable 4.

• Step 3. Now, the molecule consists of two groups. Addi-tional groups are needed to connect to the empty bonds.Since the length of the mainchain is currently 2, the lengthof the sidechains can be either 1 or 2. When the length is

sule,t it is

• lting

• .

•n isreecule

val-u solu-t ule.

4

Lo-c rrentm lizeds at ares di-v ocalo

1cule

asic groups together. In generating neighbor solutionsrators are also designed to handle basic groups. Theedure of building an initial solution is described usingasic groups listed inTable 3. The progression of the buil

ng process is shown inTable 4. In the CAMD framework

able 4rocedures for building an initial solution

tep Solution

NH2−

it

1, another group with a single bond,OH, is attached ashown inTable 4. As sidechains are added to the molecthe mainchain identity and length are updated so thaalways the longest chain in the molecule.Step 4. The length of the second sidechain is 2, resuin a branch that consists of two groups. Thus,O Cl areadded.Step 5. Another sidechain of length 1,CH3, is addedOne free bond is still existing in the molecule.Step 6. If a group with only one bond is selected, i.e.OH,the molecule is fully connected, and the initial solutioobtained as shown inTable 4. If the selected group has mothan one bond, steps 2–5 are repeated until the molis fully connected.

A set of candidate solutions will be generated and eated at each iteration. These candidates (or neighbor

ions) will be constructed by modifying the current molec

.2. Neighbor generation operators

Two sets of operators are defined: local and global.al operators maintain the main backbone of the cuolecule, and are implemented for the purpose of loca

earch; global operators generate neighbor solutions thignificantly different from the current one in order toersify the solutions at each iteration. There are six lperators:

. ReplaceThis operator replaces a group in the current molewith another available basic group. IfOH is chosen,

Page 6: Computer-aided molecular design using Tabu search

342 B. Lin et al. / Computers and Chemical Engineering 29 (2005) 337–347

Table 5Neighbor generation operators

Step Solution

Replace

Insert

Delete

Swap

Move

can be replaced byCl, OH, or CH3. If the group Ois replaced by CH2 , a new molecule will be obtained(seeTable 5).

2. InsertA basic group that is available will be inserted in front ofthe selected group in the current molecule. Suppose thegroup, CH3, is chosen, a group with two single bondscan be inserted on the sidechain. The new molecule byinserting CH2 is shown inTable 5.

3. DeleteThis operator deletes the selected group in the currentsolution. Therefore, the operator cannot be applied to agroup with only a single bond, since it will results in aninfeasible molecule. In this example, only the molybde-num or the oxygen group can be deleted. The molecule inTable 5shows the result of the deletion ofO .

4. SwapTwo groups of the current molecule are exchanged byapplying theswapoperator. Suppose the groupCH3 isselected, another group with only a single bond will bechosen. The molecule after swappingCl with CH3 isshown inTable 5. Since molecules obtained from swap-ping two groups connecting to the same “parent group”are equivalent, groups to be swapped are required to bedifferent and not connected to the same “parent”.

5tingd

as

6. CombinationThis operator combines the 5 operators to form a new one.A random number between 1 and 5 is first generated. Forexample, 3, then 3 of the 5 operators, such asMove, Insert,Swap, will be applied sequentially to the current moleculeand generate a neighbor solution.The following three global operators help TS perform adiversified search to investigate the whole solution spacesufficiently:

7. SideChainRebuildIf the selected group is located on a sidechain of themolecule, the sidechain will be replaced by a newly-generated one.

8. MainChainRebuildIf a mainchain group is selected, all groups that are locatedafter it will be deleted and the rest part of the molecule isreconstructed.

9. Total RebuildIf the first or the last group of the molecule is selected, thewhole molecule will be discarded and replaced by a newmolecule that is built following the steps of generating aninitial solution.

The two sets of operators are selected based on the stageof locating a solution. Global operators are more frequentlyselected at the starting stage to favor a diversified search andl of thes

ja-c tivitya c in-d icesc aree achi den)a pertyo omt tinu-o se ofa eatesa

4

t off haina yTi re

. MoveThis operator moves the selected group within the exismolecule. Suppose the groupO is selected to be moveafter the group NH2. A new molecule is obtainedshown inTable 5.

ocal operators are selected more frequently at the endearch process to locate the final solution precisely.

After obtaining a molecule, all elements of the adency matrix can be determined according to the connecmong the groups. Then, with the provided simple atomiices and valence indices, the molecular connectivity indan be calculated, and the properties of this moleculestimated with the property-structure correlations. At e

teration, certain solutions are classified as Tabu (forbidnd added to Tabu lists. At the same time, the Tabu prof other solutions will expire, and they will be removed fr

he Tabu lists. In this way, Tabu lists are updated conusly and adapt to the current state of the search. The udaptive memory enables TS to exhibit learning and crmore flexible and effective search.

.3. Tabu lists

In the TS algorithm, each molecule consists of a seully connected groups, which are classified into maincnd sidechain groups.Fig. 2 shows a solution located bS based on basic groups listed inTable 3; the mainchain

s identified as: ClCH2 Mo Cl with 3 sidechains that a

Fig. 2. A sample molecule for case 1 located by TS.

Page 7: Computer-aided molecular design using Tabu search

B. Lin et al. / Computers and Chemical Engineering 29 (2005) 337–347 343

Table 6Recency-based Tabu list

No. Tabu tenure Objective function Group list

1 22 0.0036

2 20 0.0699

3 15 0.5752

4 14 0.5666

5 10 0.5041

6 9 0.0094

all of length 1, the group OH. If both the mainchain andsidechain groups, as well as connectivity relations are in-cluded in Tabu lists, the update and maintenance of the listswill be time-consuming. Therefore, both recency-based andfrequency-based Tabu lists only keep track of mainchaingroups.

The recency-based Tabu list records recently visited solu-tions, and is called short-term memory. At each iteration, ifthe best neighbor is not better than the best solution found sofar (x* ), it is classified as Tabu and added to the recency-basedTabu list. The Tabu property remains active throughout theTabu tenure, which is empirically set equal to the total num-ber of basic groups available for building the molecule (22for this case).

A segment of the recency-based Tabu list is shown isTable 6. The first column shows the number of iterations dur-ing which a solution will be kept on the recency-based Tabulist. The first solution on the recency-based Tabu list with anobjective function 0.0036 has the following mainchain:

It will be released from the list after 22 iterations, unlessa best neighbor with the same mainchain is located withinthis period. In this case, the Tabu property is overridden byt ,1

em-o sitedm witha ncyi dexd earchp time

abul of0 ll be

Table 7Frequency-based Tabu list

No. Frequency index Objective function Group list

1 0.025 0.4848

2 0.103 0.1405

3 0.085 0.0648

4 12.75 0.0094

5 0.590 0.4879

6 3.600 0.5666

replaced by new solutions if the frequency-based Tabu list isfull. The 4th solution with the following mainchain structure:

has been visited most frequently, since its frequency index(12.75) is much larger than that of the other solutions.

Because the Tabu lists track the occurrence of mainchaingroups, the danger of auto isomorphism negatively impactingthe algorithm is reduced. Even without considering auto iso-morphism, it is possible that some neighbor solutions will bethe same as others. To avoid becoming stuck with a best solu-tion always being a rearranged version of the same molecule,the Tabu list will recognize the mainchain and force the selec-tion of a new starting point for the next generation of neigh-bors.

4.4. Intensification and diversification

Based on knowledge of the current search status as pro-vided through the Tabu lists, intensification and diversifica-tion strategies are used to control the search area. Intensifica-tion is carried out by generating neighbor solutions employ-ing the local operators. Global operators increase diversifi-cation and are implemented based on frequency-based Tabul thef resh-o S isd on tooo newr era-t ifica-t thers art isn for-m

he improved-bestaspiration criterion (Glover and Laguna997).

The frequency-based Tabu list provides long-term mry and keeps track of the solutions that have been viost frequently. The frequency Tabu property is denotedn index. When a previous solution is revisited, its freque

ndex will be incremented; otherwise, the frequency inecreases. This allows TS to determine whether the srocess has become trapped in a specific area for a long

Table 7shows six solutions on the frequency-based Tist. The 1st solution with an objective function value.4848 has the lowest frequency index. This solution wi

.

ists. If the maximum frequency index of solutions onrequency-based Tabu list is larger than a predefined thld (in this case, the threshold is empirically set to 22), Teemed to have been searching around a specific solutiften. Thus, the current solution will be replaced by aandomly generated solution by applying the global opors, and the search process will be restarted. This diversion strategy is similar to the restart mechanism used in otochastic optimization approaches; however, the restot purely random. Instead, it is guided by historical ination.

Page 8: Computer-aided molecular design using Tabu search

344 B. Lin et al. / Computers and Chemical Engineering 29 (2005) 337–347

In some cases, TS cannot locate improved solutions af-ter the restart operation and additional procedures will berequired. For example, if the current best solution has theobjective function of 0.0094 and TS initiates a restart, if thebest neighbor after several iterations is not less than 0.0094,another intensification strategy will pull the search back tothe formerly promising area, by assigning the current solu-tion to the best neighbor recorded before the restart opera-tion.

4.5. Aspiration criterion

An aspiration criterion based on the sigmoid function isemployed to invalidate Tabu property in certain cases andhelps to maintain an appropriate balance between diversi-fication and intensification (Lin and Miller, 2004a, 2004b).Since an intensified search is favored at the end, and a broadlydiversified search, at the beginning, the aspiration criterionallows the Tabu property to be overridden more towards theend of the search. This helps to ensure that the Tabu listsproperly encourage a broad search early on.

4.6. Constraint handling

thatg rateso con-s rationa lobalo le ort ptyb isfiedi tiono isea

5

isd pa-r

1 itera-ups

23 s the

ona ingR

5.1. Case 1

The first case is the design of a molybdenum catalyst foran epoxidation reaction. Eight basic groups (seeTable 3) areused to define the search space. The purpose of this case is toevaluate the effectiveness of the formulation and to compareTS with a deterministic solution method. Thus, only a singletarget property, density, is used. While density is an importantproperty of a homogenous catalyst, it does not define the cat-alytic activity or any other critical properties and, thus, cannotalone define an effective catalyst. However, density is closelyrelated to solubility for these systems, and solubility definesthe amount of potential uptake by humans. Therefore, a com-bination of density and toxicity would be needed to assessexposure risk. The target density value is 4172 kg/m3, whichis the density of a commonly used homogenous transition-metal catalyst. The correlation for density this property isgiven by Eq.(1).

The number of neighbor solutions is 8× 22 = 176. The 5best molecules and the probability that a molecule is foundare obtained from 100 trials. Sub-optimal solutions are eitherthe final output of a test or an intermediate result of certainiteration. If the sub-optimal solution is the final result of atrial, the probability is incremented by 1%. Since interme-diate solutions of some trials may be even better than thefi mals tw tion,w s toa als.S % oft /mi be-t than0 veralp rifi-c

mals prox-i ikelyt . Fur-t , haven mu-l toc

-s viat so-l ion.T of5

5

(seeT ther

In this application, constraints are handled in a wayuarantees the feasibility of the molecule. Since TS open basic groups directly, both valence and connectivitytraints are handled during the process of neighbor genes previously described. For example, both local and gperators ensure the proper type of bond (single, doub

riple) to be connected. In addition, the number of emonds is checked to guarantee that valency will be sat

n the current solution. Groups that would cause a violaf either connectivity or valency are not allowed. Otherwll groups can interconnect.

. Case studies

The effectiveness of the TS algorithm for CAMDemonstrated with the following two case studies. Theameters for each case are as follows:

. the number of neighbor solutions generated at eachtion is equal to the product of the number of basic groand the maximum number of groups in a molecule;

. the number of iterations is 200;

. the length of the Tabu list is defined to be the same amaximum number of groups in a molecule;

The TS algorithm is compiled using the gcc compilerPentium III 1.0 GHz CPU, 1024 MB memory PC runnedhat Linux 7.1.

nal result of other trials, the probability of such sub-optiolutions is denoted as <1% (seeTable 8) to indicate that ias located, but was never a final solution. The best soluith the objective function value of 0.000111, corresponddensity of 4172.46, which is only found once in 100 triolution 3 (0.000238) has 7 basic groups and is found 80

he time. The corresponding density value of 4172.99 kg3

s very close to the best solution. Since the differenceween the best solution and the 5th best solution is less.2%, this shows that TS can successfully determine seromising catalyst molecules for further experimental veation.

It is especially useful to identify and record near-optiolutions since the density correlation has an error of apmately 4%. Thus, near-optimal solutions are almost as lo be strong candidates for synthesis as the optimal onehermore, many other factors, such as ease of synthesisot been taken into account within the optimization for

ation. Thus, a user would like to have multiple optionshoose from.

TS provided the results inTable 8in 90 s on a PIII peronal PC. In comparison, using Outer Approximationhe DICOPT solver in GAMS, only an integer feasibleution was found after 20 min on a Sun Ultra 10 workstathe structure (seeFig. 3) resulted in an objective function.67.

.2. Case 2

The second case study involves 10 basic groupsable 9) and forces the number of Mo groups to be ei

Page 9: Computer-aided molecular design using Tabu search

B. Lin et al. / Computers and Chemical Engineering 29 (2005) 337–347 345

Table 8Results of case 1 with TS approach

Structure N Objective value Probability (%)

0 18 0.000111 1

1 13 0.000115 <1

2 7 0.000238 80

3 20 0.000571 <1

4 16 0.000659 <1

5 7 0.001361 19

1 or 2. The properties to be optimized are electronegativity,oxidation state and toxicity. The electronegativity is calcu-

Fig. 3. Integer solution for case 1 found with OA algorithm.

lated as follows:

Elec=NMo∑

i=1

NTot∑

j>i

NBond∑

k=1

fi,j,k|Elec(i) − Elec(j)| (3)

where Elec(i) is the electronegativity of groupi, fi,j,k an ele-ment of the adjacency matrix,NMo the number of rows allot-ted to all the molybdenum groups,NTot the total number ofbasic groups (20 in this example),NBond the type of bonds ofthe basic groups (2 in this example, single and double). Thelower bound is 0 and the upper bound is 3.84 eV. Oxidationstate is the sum of the oxidation states of all the molybdenum

Table 9Basic groups for case 2

1 2 3 4 5 6 7 8 9 10

Group Mo S O O NH2 NH CH2 Cl

δ 2 3 3 1 1 2 1 2 2 1δv 0.05128 0.07895 0.10811 0.66667 6 6 3 4 2 0.77778Elec. (ev) 2.16 2.19 2.24 2.58 3.44 3.04 3.04 2.55 2.55 3.16Oxid. 2 3 4 – – – – – – –Nmax,i 2 2 2 2 2 2 2 2 2 2

Page 10: Computer-aided molecular design using Tabu search

346 B. Lin et al. / Computers and Chemical Engineering 29 (2005) 337–347

groups:

Oxid =NMo∑

i=1

ziOxid(i) (4)

where Oxid(i) is the oxidation state of theith group andziis an element of the existence vector. The target value of theoxidation state is the set{6,7,8}. Toxicity is determined withthe following correlation for the LC50:

log(LC50) = −27.8 + 5.490χ + 10.60χv − 6.651χ

−2.931χv − 4.402χv + 0.3682χv

−0.710(0χv)2 + 5.59(1χ)

2 − 0.0761(1χv)2

−2.100χv1χ + 0.5160χv1χv − 0.02950χv2χv

(5)

The LC50 was developed using 34 data points obtained fromSyracuse Research Corporation (2000). The resulting cor-relation has a correlation coefficient of 0.953 and a sum ofsquared error of 2.083.

In this example, the objective function is the weightedsummation of these properties:

Obj = 0.6 LC50 + 0.3 Elec+ 0.1 Oxid (6)

T couldb studyt pactt

l inedf -t 40t ain-i mals w):C .( ion,m thisc ber ofn on.

theD lu-t ltra1 -j mm m tot ime,w solu-t l toa t cat-

alysts, since the list of near-optimal solutions provides op-tions which can be narrowed down by employing other fac-tors such as cost, ease of synthesis, and estimated values ofphysical and chemical properties not included in the originaldesign.

6. Conclusions

A detailed implementation of the TS algorithm to CAMDproblems is presented in this paper. Although other optimiza-tion approaches have been applied to CAMD with propertiespredicted using group contribution techniques, the TS algo-rithm implemented with novel neighbor-generating opera-tors and combined with property prediction via connectivityindex-based correlations provides a powerful technique forgenerating lists of near-optimal molecular candidates for agiven application. In addition, the Tabu lists help TS searchthe solution space both in a diversified way, to cover the entiresearch space, and in an intensified manner, to locate the finalsolution precisely. Moreover, TS is able to locate a large num-ber of near optimal solutions within a short time as shown inthe two case studies.

A

oun-d s ad-d eer-i

R

A llyw

C de-ry

C m--

C for

C n ofh a

.F rial

ertiesng

G for

G tion

he weights on the properties are freely adjustable ande based on the preference of the designer. For this case

hese weights are simply examples and do not directly imhe conclusions of the paper.

The number of neighbor solutions is 10× 20 = 200. Theength of the Tabu list is 20. The 5 best molecules obtarom 100 TS trials are shown inTable 10. The optimal soluion with an objective function of 1.286401 is obtained inrials and the second optimal solution is located the remng 60% of the time. The difference between the optiolution and the 10th sub-optimal solution (shown belol NH O Mo O Mo NH CH2 Cl is smaller than 1%

The value of its objective function is 1.297251.) In additore than 60 near-optimal solutions are obtained for

ase study, which shows that TS can locate a large numear-optimal solutions for further experimental verificati

In comparison, employing outer approximation viaICOPT solver in GAMS, only an integer feasible so

ion was found after 20 min of CPU time on a Sun U0 workstation. The structure (seeFig. 4) resulted in an ob

ective function of 5.55. While the deterministic algorithay be able to find and guarantee the global optimu

his problem after expending a large amount of CPU te see that TS can generate a list of near-optimal

ions within a short amount of time. This is very usefuresearcher searching for novel alternatives to curren

Fig. 4. Integer solution for case 2 found with OA algorithm.

,

cknowledgements

This project was supported by the National Science Fation through grant CTS-0224887. B. Lin acknowledgeitional support from the Department of Chemical Engin

ng of Rose-Hulman Institute of Technology.

eferences

llen, D., & Shonnard, D. (2002).Green Engineering: EnvironmentaConscious Design of Chemical Processes. Upper Saddle River, NeJersey: Prentice-Hall.

amarda, K. V., & Maranas, C. D. (1999). Optimization in polymersign using connectivity indices.Industrial and Engineering ChemistResearch, 38, 1884–1892.

huri, N., & Achenie, L. E. K. (1996). A novel mathematical programing model for computer aided molecular design.Industrial and Engineering Chemistry Research, 35(10), 3788–3794.

onstaninou, L., & Gani, R. (1994). New group contribution methodestimating properties of pure components.AIChE Journal, 40(10),1697–1710.

onstantinou, L., Gani, R., & O’Connell, R. J. P. (1995). Estimatiothe acentric factor and the liquid molar volume at 298 K througnew group contribution method.Fluid Phase Equilibria, 103, 11–22

riedler, F., Fan, L. T., Katotai, L., & Dallos, A. (1998). A combinatoapproach for generating candidate molecules with desired propbased on group contribution.Computers and Chemical Engineeri,22(6), 809–817.

ani, R., & Brignole, E. A. (1983). Molecular design of solventsliquid extraction based on UNIFAC.Fluid Phase Equilibria, 13,331–340.

ani, R., Nielsen, B., & Fredenslund, A. (1991). A group contribuapproach to computer-aided molecular design.AIChE Journal, 37(9),1318–1332.

Page 11: Computer-aided molecular design using Tabu search

B. Lin et al. / Computers and Chemical Engineering 29 (2005) 337–347 347

Table 10Solutions of TS for case 2

No. Structure NGrp Objective Probability (%)

0 Cl Mo O Mo O CH2 CH2 Cl 8 1.286401 401 Cl O Mo O Mo NH CH2 Cl 8 1.288636 602 Cl CH2 O Mo O Mo NH Cl 8 1.288698 <13 Cl O Mo O Mo NH CH2 CH2 Cl 9 1.288911 <14 Cl O Mo O Mo NH Cl 7 1.288913 <1

Glover, F., & Laguna, M. (1997).Tabu Search. Boston: Kluwer AcademicPublishers.

Gordeeva, E. V., Molchanova, M. S., & Zefirov, N. S. (1990). Generalmethodology and computer program for the exhaustive restoring ofchemical structures by molecular connectivity indexes. solution of theinverse problem in QSAR/QSPR.Tetrahedron Computer Methodol-ogy, 3, 389.

Hairston, D. W. (1998). New molecules get on the fast track.ChemicalEngineering, 30–33.

Harper, P. M., & Gani, R. (2000). A multi-step and multi-level approachfor computer aided molecular design.Computers and Chemical En-gineering, 24(2–7), 677–683.

Harper, P. M., Hostrup, M., & Gani, R. (2003). A hybrid CAMDmethod in computer aided molecular design. In L. E. K. Achenie,R. Gani, & V. Venkatasubramanian (Eds.),Theory and Practice: 12(pp. 122–169). Amsterdam: Elsevier.

Joback, K. G., & Stephanopoulos, G. (1989). Designing molecules pos-sessing desired physical property values. In J. J. Siirola, I. Grossmann,& Geo. Stephanopoulos (Eds.),Proceedings of the Foundations ofComputer-Aided Process Design (FOCAPD)(pp. 363–387).

Johnson, K., Lin, B., Miller, D. C., & Camarda, K. V. (2002). Moleculardesign of polymer coatings using optimization techniques. InPoster246b, AIChE Annual Meeting.

Kier, L. B., & Hall, L. H. (1976).Molecular Connectivity in Chemistryand Drug Research. New York: Academic Press.

Kier, L. B., Lowell, H. H., & Frazer, J. F. (1993). Design of moleculesfrom Quantitative Structure-Activity Relationship Model. 1. Informa-tion Transfer between Path and Vertex Degree Counts.Journal ofChemical Information and Computer Sciences, 33, 142.

Kim, U. -R., Min, K. -S., Lee, M. -J., Kim, S. -H., & Jeong, B. -om-ctiv-

L buW.ium

L the-ng

L cal

Maranas, C. D. (1996). Optimal computer-aided molecular design: a poly-mer design case study.Industrial and Engineering Chemistry Re-search, 35, 3403–3414.

Marcoulaki, E. C., & Kokossis, A. C. (1998). Molecular design synthesisusing stochastic optimization as a tool for scoping and screening.Computers and Chemical Engineering, 22, 11–18.

Marrero, J., & Gani, R. (2001). Group contribution based estimation ofpure component properties.Fluid Phase Equilibria, 183, 183–208.

Meniai, A. H., Newsham, D. M. T., & Khalfaoui, B. (1998). Solventdesign for liquid extraction using calculated molecular interaction pa-rameters.Transactions of IchemE, 76(A11), 942–950.

Ostrovsky, G. M., Achenie, L. E. K., & Sinha, M. (2003). A reduceddimension branch-and-bound algorithm for molecular design.Com-puters and Chemical Engineering, 27(4), 551–567.

Raman, V. S., & Maranas, C. D. (1998). Optimization in product designwith properties correlated with topological indices.Computers andChemical Engineering, 22(6), 747–763.

Randic, M. (1975). On characterization of molecular branching.Journalof the American Chemical Society, 97, 6609–6615.

Sahinidis, N. V., & Tawarmalani, M. (2000). Applications of global opti-mization to process and molecular design.Computers and ChemicalEngineering, 24, 2157–2169.

Sahinidis, N. V., Tawarmalani, M., & Yu, M. (2003). Design of alter-native refrigerants via global optimization.AIChE Journal, 49(7),1761–1774.

Siddhaye, S., Camarda, K. V., Topp, E., & Southard, M. (2000). De-sign of novel pharmaceutical product via combinatorial optimization.Computers and Chemical Engineering, 24, 701–704.

Syracuse Research Corporation.Syracuse Research Corporation. (2000).User’s Guide for the Physprop Database. New York: Syracuse Re-

TV yn-

V uter-d

W on-l

W onng

J. (1992). the study of physical properties for the organic cpounds and their binary mixture according to molecular conneity method. Journal of the Korean Chemical Society, 36(4), 485–495.

in, B., Chavali, S., Camarda, K., & Miller, D. C. (2003). Using Tasearch to solve MINLP problems for PSE. In B. Chen & A.Westerberg (Eds.),Proceedings of the Eighth International Symposon Process Systems Engineering, 2003(pp. 541–546). Elsevier.

in, B., & Miller, D. C. (2004a). Solving heat exchanger network synsis problems with Tabu search.Computers and Chemical Engineeri,28(8), 1451–1464.

in, B., & Miller, D. C. (2004b). Tabu search algorithm for chemiprocess optimization.Computers and Chemical Engineering, 28(11),2287–2306.

search Corp., North Syracuse.rinajstic, N. (1983).Chemical Graph Theory. CRC Press.aidyanathan, R., & El-Halwagi, M. M. (1996). Computer-Aided S

thesis of Polymers and Blends with Target Properties.Industrial andEngineering Chemistry Research, 35(2), 627–634.

enkatasubramanian, V., Chan, K., & Caruthers, J. M. (1994). Compaided molecular design using genetic algorithm.Computers anChemical Engineering, 18(9), 833–844.

ang, S., & Milne, G. W. A. (1994). Graph theory and group ctributions in the estimation of boiling points.Journal of ChemicaInformation and Computer Science, 34, 1242–1250.

ang, Y., & Achenie, L. E. K. (2002). A hybrid global optimizatiapproach for solvent design.Computers and Chemical Engineeri,26, 1415–1425.