computer-aided molecular design using tabu search molecular design (camd) has the potential to...

33
Computer-Aided Molecular Design Using Tabu Search B. Lin, 1 S. Chavali, 2 K. Camarda 2 and D. C. Miller 1,* 1 Department of Chemical Engineering, Rose-Hulman Institute of Technology, 5500 Wabash Avenue, Terre Haute, IN 47803, USA 2 Department of Chemical and Petroleum Engineering, University of Kansas, 1530 W. 15th, 4132 Learned Hall, Lawrence, KS 66045, USA Abstract A detailed implementation of the Tabu Search (TS) algorithm for computer-aided molecular design (CAMD) of transition metal catalysts is presented in this paper. Previous CAMD research has applied deterministic methods or genetic algorithms to the solution of the optimization problems which arise from the search for a molecule satisfying a set of property targets. In this work, properties are estimated using correlations based on connectivity indices, which allows the TS algorithm to use several novel operators to generate neighbors, such as swap and move, which would have no effect with a traditional group contribution-based approach. In addition, the formulation of the neighbor generation process guarantees that molecular valency and connectivity constraints are met, resulting in a complete molecular structure. Results on two case studies using TS are compared with a deterministic approach and show that TS is able to provide a list of good candidate molecules while using a much smaller amount of computation time. Introduction

Upload: hahuong

Post on 03-Apr-2018

219 views

Category:

Documents


3 download

TRANSCRIPT

Computer-Aided Molecular Design Using Tabu Search

B. Lin,1 S. Chavali,2 K. Camarda2 and D. C. Miller1,*

1Department of Chemical Engineering, Rose-Hulman Institute of Technology, 5500 Wabash Avenue, Terre Haute, IN 47803, USA 2Department of Chemical and Petroleum Engineering, University of Kansas, 1530 W. 15th, 4132 Learned Hall, Lawrence, KS 66045, USA

Abstract

A detailed implementation of the Tabu Search (TS) algorithm for computer-aided

molecular design (CAMD) of transition metal catalysts is presented in this paper.

Previous CAMD research has applied deterministic methods or genetic algorithms to the

solution of the optimization problems which arise from the search for a molecule

satisfying a set of property targets. In this work, properties are estimated using

correlations based on connectivity indices, which allows the TS algorithm to use several

novel operators to generate neighbors, such as swap and move, which would have no

effect with a traditional group contribution-based approach. In addition, the formulation

of the neighbor generation process guarantees that molecular valency and connectivity

constraints are met, resulting in a complete molecular structure. Results on two case

studies using TS are compared with a deterministic approach and show that TS is able to

provide a list of good candidate molecules while using a much smaller amount of

computation time.

Introduction

Computer-aided molecular design (CAMD) has the potential to greatly decrease the time

and effort required to develop new molecular entities by reducing the need for costly and

time-consuming trial-and-error experiments. Instead, a slate of candidate molecules

which are predicted to have the desired properties can be used as a starting point for more

focused experimental synthesis. This general method is applicable to such varied

products as catalysts, polymers, solvents, and detergents. Hairston (1998) recently

reported that a computational algorithm has been successfully used to design a new

cancer-fighting pharmaceutical.

The CAMD methodology consists of solving both a forward and a backward

problem (Venkatasubramanian et al., 1994). The forward problem predicts properties

based on molecule structure; the backward step identifies a structure to obtain a molecule

with a given set of target properties. Property prediction is usually based on either group

contribution methods or topological indices. The group contribution approach has been

most widely reported (Gani et al., 1991; Venkatasubramanian et al., 1994; Maranas,

1996; Vaidyanathan & El-Halwagi, 1996; Harper & Gani, 2000; Sahinidis &

Tawarmalani, 2000; Harper et al., 2003; Sahinidis et al., 2003). Constantinou & Gani

(1994) and Constantinou et al. (1995) described a two level group contribution method

which utilizes molecular structure information to estimate the physical and

thermodynamic properties of pure components. A three-level group contribution method

proposed by Marrero & Gani (2001) exhibits improved accuracy and applicability to deal

with bio-chemically and environmentally-related compounds. Because most group

contribution methods cannot adequately account for steric effects (Wang & Milne, 1994),

several researchers have begun using topological indices in an effort to obtain more

accurate property prediction (Kier & Hall, 1976; Raman & Maranas, 1998; Camarda &

Maranas, 1999; Siddhaye et al., 2000; Harper et al., 2003). Whichever approach is

chosen, existing structure-property information is regressed to form an empirical model

relating the molecule structure to the properties of interest. The inverse problem is

essentially a mixed integer optimization problem, whether it is solved explicitly and

deterministically as in Maranas (1996), stochastically via a genetic algorithm-based

approach as in Venkatasubramaniam et al. (1994), or via a generate and test approach

such as those of Gordeeva et al. (1990), Friedler et al. (1998), Harper & Gani (2000) and

Harper et al. (2003).

Many researchers have reported solutions to the backward problem to determine

the molecular structure. Combinatorial and heuristic-based enumeration approaches have

been reported (Gani & Brignole, 1983; Joback & Stephanopoulos, 1989). Kier et al.

(1993) used a graph reconstruction approach to determine feasible molecular structures

with bounded physical property values, while Vaidynathan & El-Halwagi (1996)

described an interval analysis approach for the computer-aided synthesis of polymers and

blends. CAMD problems have recently been formulated as mixed-integer linear/nonlinear

programming (MILP/MINLP) problems. Maranas (1996) transformed the nonconvex

MINLP formulation into a tractable MILP model by expressing integer variables as a

linear combination of binary variables and replacing the products of continuous and

binary variables with linear inequality constraints. Using connectivity indices, Camarda

& Maranas (1999) described a convex MINLP representation for solving several polymer

design problems. Churi & Achenie (1996) solved a refrigerant design problem with an

augmented penalty-outer approximation (AP/OA) algorithm. A reduced dimension

branch-and-bound algorithm was presented by Ostrovsky et al. (2003) to design optimal

solvents used as cleaning agents in the printing industry. Sahinidis et al. (2000, 2003)

reported a branch-and-reduce algorithm for identifying a replacement of Freon.

Stochastic optimization approaches have also been developed as alternate strategies for

rigorous deterministic methods. Venkatasubramaniam et al. (1994) employed a genetic

algorithm (GA) for polymer design in which properties are estimated via group

contribution methods. Marcoulaki & Kokossis (1998) described a simulated annealing

(SA) approach for the design of refrigerants and liquid-liquid extraction solvents. Wang

& Achenie (2002) presented a hybrid global optimization approach that combines the OA

and SA algorithms for several solvent design problems.

Tabu search (TS) is a heuristic approach for solving combinatorial optimization

problems by using a guided, local search procedure to explore the entire solution space

without becoming easily trapped in local optima. It differs from other stochastic

optimization techniques by maintaining lists of previous solutions (usually termed

“memory”) that help guide the search process. These lists are useful for CAMD since

they provide a direct method to track near-optimal solutions. The ability of TS to

efficiently find a set of near-optimal solutions is particularly useful since all property

prediction algorithms have limited accuracy and problem formulations normally do not

include all relevant properties. Thus, the determination of the global optimum is not as

critical as finding a set of near-optimal solutions. Deterministic approaches can generate

such a list by multiple MINLP solves and application of integer cuts, but for large

problems this becomes prohibitively computationally expensive. Other stochastic

approaches, as well as generate and test strategies, can also generate such a near-optimal

candidate list. By identifying a range of potential target molecules, TS avoids missing

potentially useful molecules and allows the use of other criteria (such as ease of

synthesis) to perform a final ranking of the candidates.

Transition metal catalysts are an important class of molecule for creating other

molecules efficiently with minimal impact on the environment; however, the catalysts

themselves are often extremely toxic and harmful to the environment. For example, most

propylene oxide is produced using homogenous catalysts containing molybdenum, which

are effective but also harmful to the environment. Since in many cases a significant

amount of such catalyst is lost to the environment (Allen & Shonnard, 2002), new

materials which show improved catalytic activity and less toxicity are highly desired.

This paper describes a framework for using TS to design transition-metal catalysts.

Property Prediction via Connectivity Indices

In order for a molecular design algorithm to be successful, physical properties of interest,

such as density, toxicity, solubility, or reactivity within a given system must be estimated

with a reasonable accuracy using only a very small computational effort. Once the

properties of a transition metal catalyst can be predicted from structure, the problem of

designing a new catalyst can be formulated as an optimization problem. We employ

connectivity indices, which are numerical values that describe the electronic structure of

a molecule, to characterize the molecule and to correlate its internal structure with

physical properties of interest. These indices connect the molecular structure to the

properties of interest by providing a mathematical description of the molecule as a whole.

Molecular connectivity indices were first introduced by Randić (1975). Kier &

Hall (1976) reported correlations between connectivity indices and many key properties

of organic compounds, such as density, solubility, and toxicity. Furthermore, these

indices can be computed with a minimum of computational effort. Many of the

applications of connectivity indices have been reviewed by Trinajstic (1983). Kier et al.

(1993) and Gordeeva et al. (1990) reported the use of connectivity indices within a

molecular design context, while Raman & Maranas (1998) first incorporated connectivity

indices into an optimization framework. Camarda and Maranas (1999) used connectivity

indices as property descriptors to design polymers that have pre-specified property

values. Siddhaye et al. (2000) employed connectivity indices to predict the physical

properties for the design of novel pharmaceutical products via combinatorial

optimization.

In earlier molecular design work, group contribution methods were used to

estimate the values of physical properties, as in Gani, et al.(1991), Venkatasubramanian

et al. (1994), and Maranas (1996). The works of Constantinou & Gani (1994) and

Marrero & Gani (2001) extend the basic group contribution approach by considering

combinations of functional groups and thus take into account second-order effects when

used to predict physical properties. Connectivity indices, however, take into account the

entire molecular structure of a compound. By using higher-order connectivity indices,

third-order and higher structural effects can be included in structure property relations.

Harper & Gani (2000) and Harper et al. (2003) have combined group contribution with

connectivity indices to include these higher-order effects. Raman & Maranas (1998) and

Harper et al. (2003) have shown that these higher order effects are able to give more

accurate property descriptions. Furthermore, when a molecular design problem is solved

using these indices, a complete molecular structure is obtained, and no secondary

problem must be solved to recover the final molecular structure. Harper & Gani (2000)

and Meniai et al. (1998) have also reported the design of complete molecular structures.

Connectivity indices also provide a method to compute the properties of mixtures, since

the connectivity indices of individual compounds can be combined to estimate mixture

properties in certain cases (Kim et al., 1992; Johnson et al., 2002).

Computational property estimation algorithms first decompose a molecule into

smaller units. Then, a set of basic groups that can potentially be part of the candidate

molecules are defined. A basic group is defined as a single non-hydrogen atom in a given

valence state, bonded to some number of hydrogen atoms. Atomic connectivity indices

are defined over each basic group.

Basic groups of the molecule CHCl2F are shown in Table 1. The values are the

simple atomic connectivity indices that refer to the number of bonds which can be formed

with other groups. The v values are atomic valence connectivity indices that describe

the electronic structure of each basic group, including lone-pair electrons and

electronegativity. For transition metals, which can assume multiple valence states, the

definition of v is based on the number of electrons participating in the bonding, instead

of those present in the outer shell. Once the atomic connectivity indices of basic groups

are defined, the molecular connectivity indices can be computed for the entire molecule.

Molecule structure is expressed with a hydrogen-suppressed graph (as shown in

Figure 1). The zero order molecular connectivity indices 0 and V0 are the sum of each

basic group (the sum over all vertices), which describe the identity of the groups in a

given molecule (see Table 2).

Higher order connectivity indices can also be defined and have been used to give

a more precise description of molecular structure (Kier & Hall, 1976). In this work, we

have used the second-order connectivity indices, 2 and V2 , to give added accuracy to

correlations. These two indices are sums over each of the triplets in the molecule, that is,

over each possible combination of three bonded groups (see Figure 1).

Once the equations defining the (molecular) connectivity indices are in place,

these indices can be used in empirical correlations to predict the physical properties of

novel transition-metal catalysts. Table 3 shows the basic groups employed in the design

of a molybdenum catalyst for an epoxidation reaction. (To limit the search space, we set

an upper bound on the maximum number of groups in a molecule.) The correlation

derived in this work for density is:

0 0 1 1 2 2

2 2 20 0 0 1 1

2 22 2 1 2

55351 75800 7663 40901 1784 72046 607

24695 649 12271 65.4

1793 8.9 72323

v v v

v v

v

(1)

This correlation was developed from regressing 23 Mo-centered complexes and resulted

in a correlation coefficient of 0.962 and a sum of squared error of 0.944. Using this

correlation, an optimization problem has been formulated whose optimal solutions are

molecules which most closely match a set of target property values.

Problem Formulation

When zeroth, first and second order connectivity indices are employed for

property estimation, a molecule is represented mathematically using sets of binary

variables. First, a vector of binary variables, iz , is defined. An element of this vector

equals one if the ith group exists in the molecule; otherwise, it equals zero. Second, a

partitioned adjacency matrix with elements kjif ,, is determined. When basic groups i and

j are bonded with a kth-multiplicity bond, 1,, kjif ; otherwise, 0,, kjif . These sets of

variables provide sufficient information to compute the connectivity indices and, thus,

estimate molecular properties. Along with these definitions, property correlations using

the connectivity indices are included in the overall formulation.

The objective is to determine the molecule that minimizes the difference between

the target property values. This can be written as:

targetscale

1min Obj m mm R m

P PP

(2)

where R is the set of all targeted properties, Pm is the estimated value of property m,

Pmscale is a scale factor used to weight the importance of one property relative to another,

and Pmtarget is the target value for property m.

Structural constraints are added to ensure the obtained molecules are fully

connected and satisfy valency by being connected with the appropriate types of bonds.

Other constraints include bounds on the variables and property values. The resulting

formulation is a large, nonconvex MINLP model for CAMD.

Table 3 shows the set of 8 basic group types used for designing a molybdenum

catalyst for an epoxidation reaction. The total number of available groups is

221

, i

iMaxN . Since only single bonds are involved, the size of the adjacency matrix f

is elements 48412222 . Each element, , ,i j kf , can be either 0 or 1, and the total

number of independent variables is 25322122

1

ii . In order to apply TS effectively,

the original MINLP model of Siddhaye et al. (2000) has been altered to handle basic

groups and the connectivity between them rather than dealing with the elements of the

adjacency matrix directly.

TS Implementation

TS begins by determining an initial solution. Additional solutions (termed

neighbors) are generated by modifying the existing solution through a sequence of

moves. The best new neighbor (x’) is used as the starting point for the next iteration

unless it is on a tabu list. Thus, even if no neighbor solutions are better than the initial

solution, one of these is still chosen as the starting point for the next iteration. A record of

the best solution ever found (x*) is separately maintained. In addition, the tabu lists

provide an adaptive memory that guides the search by taking advantage of historical

information. This memory enables TS to make strategic choices and achieve responsive

exploration.

The standard TS algorithm (Glover & Laguna, 1997; Lin et al., 2003; Lin &

Miller, 2004b) is adapted to represent the molecule efficiently. Each solution is a

molecule consisting of a series of fully connected groups. Therefore, the initial solution is

constructed by connecting basic groups together. In generating neighbor solutions,

operators are also designed to handle basic groups. The procedure of building an initial

solution is described using the basic groups listed in Table 3. The progression of the

building process is shown in Table 4. In the CAMD framework, a mainchain is defined as

the list within a molecule with the largest number of groups. It is determined during the

process of constructing a molecule. The length of sidechains (list of groups connecting to

the mainchain) is constrained to be less than that of the mainchain. For catalyst design

test cases, the number of Mo groups is constrained to be either 1 or 2.

Building an initial solution

Step 1. One of the basic groups is selected as a root group. For simplicity, a group with

only one bond, –NH2, is chosen.

Step 2. Another group is added to the root group. If the second group has only one bond,

i.e., –OH, there are no empty bonds in the molecule. Since the molecule is fully

connected, the initial solution, hydroxylamine (H2N–OH), would be obtained;

however, this violates the constraint requiring at least one Mo atom. Thus, another

group must be chosen. In this example, this step adds basic group #1 from Table

3, resulting in molecule fragment for step 2 shown in Table 4.

Step 3. Now, the molecule consists of two groups. Additional groups are needed to

connect to the empty bonds. Since the length of the mainchain is 2, the length of

the sidechains can be either 1 or 2. When the length is 1, another group with a

single bond, –OH, is attached as shown in Table 4. As sidechains are added to the

molecule, the mainchain identity and length are updated so that it is always the

longest chain in the molecule.

Step 4. The length of the second sidechain is 2, resulting in a branch that consists of two

groups. Thus, –O–Cl, are added.

Step 5. Another sidechain of length 1, –CH3, is added. One free bond is still existing in

the molecule.

Step 6. If a group with only one bond is selected, i.e. –OH, the molecule is fully

connected, and the initial solution is obtained as shown in Table 4. If the selected

group has more than one bond, steps 2-5 are repeated until the molecule is fully

connected.

A set of candidate solutions will be generated and evaluated at each iteration. These

candidates (or neighbor solutions) will be constructed by modifying the current molecule.

Neighbor generation operators

Two sets of operators are defined: local and global. Local operators maintain the

main backbone of the current molecule, and are implemented for the purpose of localized

search; global operators generate neighbor solutions that are significantly different from

the current one in order diversify the solutions at each iteration. There are six local

operators:

1. Replace

This operator replaces a group in the current molecule with another available basic

group. If –OH is chosen, it can be replaced by –Cl, –OH, or –CH3. If the group –

O– is replaced by–CH2–, a new molecule will be obtained (see Table 5)

2. Insert

A basic group that is available will be inserted in front of the selected group in the

current molecule. Suppose the group, –CH3, is chosen, a group with two single

bonds can be inserted on the sidechain. The new molecule by inserting –CH2– is

shown in Table 5.

3. Delete

This operator deletes the selected group in the current solution. Therefore, the

operator cannot be applied to a group with only a single bond, since it will results in

an infeasible molecule. In this example, only the molybdenum or the oxygen group

can be deleted. The molecule in Table 5 shows the result of the deletion of –O–.

4. Swap

Two groups of the current molecule are exchanged by applying the swap operator.

Suppose the group –CH3 is selected, another group with only a single bond will be

chosen. The molecule after swapping –Cl with –CH3 is shown in Table 5. Since

molecules obtained from swapping two groups connecting to the same “parent

group” are equivalent, groups to be swapped are required to be different and not

connected to the same “parent”.

5. Move

This operator moves the selected group within the existing molecule. Suppose the

group –O– is selected to be moved after the group –NH2. A new molecule is

obtained as shown in Table 5.

6. Combination

This operator combines the 5 operators to form a new one. A random number

between 1 and 5 is first generated. For example, 3, then 3 of the 5 operators, such as

Move, Insert, Swap, will be applied sequentially to the current molecule and

generate a neighbor solution.

The following three global operators help TS perform a diversified search to investigate

the whole solution space sufficiently:

7. SideChain_Rebuild

If the selected group is located on a sidechain of the molecule, the sidechain will be

replaced by a newly-generated one.

8. MainChain_Rebuild

If a mainchain group is selected, all groups that are located after it will be deleted

and the rest part of the molecule is reconstructed.

9. Total_Rebuild

If the first or the last group of the molecule is selected, the whole molecule will be

discarded and replaced by a new molecule that is built following the steps of

generating an initial solution.

The two sets of operators are selected based on the stage of locating a solution. Global

operators are more frequently selected at the starting stage to favor a diversified search

and local operators are selected more frequently at the end of the search process to locate

the final solution precisely.

After obtaining a molecule, all elements of the adjacency matrix can be determined

according to the connectivity among the groups. Then, with the provided simple atomic

indices and valence indices, the molecular connectivity indices can be calculated, and the

properties of this molecule are estimated with the property-structure correlations. At each

iteration, certain solutions are classified as tabu (forbidden) and added to tabu lists. At the

same time, the tabu property of other solutions will expire, and they will be removed

from the tabu lists. In this way, tabu lists are updated continuously and adapt to the

current state of the search. The use of adaptive memory enables TS to exhibit learning

and creates a more flexible and effective search.

Tabu lists

In the TS algorithm, each molecule consists of a set of fully connected groups,

which are classified into mainchain and sidechain groups. Figure 2 shows a solution

located by TS based on basic groups listed in Table 3; the mainchain is identified as:

MoCl CH2 Cl

with 3 sidechains that are all of length 1, the group –OH. If both the mainchain and

sidechain groups, as well as connectivity relations are included in tabu lists, the update

and maintenance of the lists will be time-consuming. Therefore, both recency-based and

frequency-based tabu lists only keep track of mainchain groups.

The recency-based tabu list records recently visited solutions, and is called short-

term memory. At each iteration, if the best neighbor is not better than the current

solution, it is classified as tabu and added to the recency-based tabu list. The tabu

property remains active throughout the tabu tenure, which is empirically set equal to the

total number of basic groups available for building the molecule (22 for this case).

A segment of the recency-based tabu list is shown is Table 6. The first column

shows the number of iterations during which a solution will be kept on the recency-based

tabu list. The first solution on the recency-based tabu list with an objective function

0.0036 has the following mainchain:

MoNH2 CH2 CH3

It will be released from the list after 22 iterations, unless a best neighbor with the same

mainchain is located within this period. In this case, the tabu property is overridden by

the improved-best aspiration criterion (Glover & Laguna, 1997).

The frequency-based tabu list provides long-term memory and keeps track of the

solutions that have been visited most frequently. The frequency tabu property is denoted

with an index. When a previous solution is revisited, its frequency index will be

incremented; otherwise, the frequency index decreases. This allows TS to determine

whether the search process has become trapped in a specific area for a long time.

Table 7 shows six solutions on the frequency-based tabu list. The 1st solution with

an objective function value of 0.4848 has the lowest frequency index. This solution will

be replaced by new solutions if the frequency-based tabu list is full. The 4th solution with

the following mainchain structure:

MoOH O CH3

has been visited most frequently, since its frequency index (12.75) is much larger than

that of the other solutions.

Because the tabu lists track the occurrence mainchain groups, the danger of auto

isomorphism negatively impacting the algorithm is reduced. Even without considering

auto isomorphism, it is possible that some neighbor solutions will be the same as others.

To avoid becoming stuck with a best solution always being a rearranged version of the

same molecule, the tabu list will recognize the mainchain and force the selection of a new

starting point for the next generation of neighbors.

Intensification and diversification

Based on knowledge of the current search status as provided through the tabu

lists, intensification and diversification strategies are used to control the search area.

Intensification is carried out by generating neighbor solutions employing the local

operators. Global operators increase diversification and are implemented based on

frequency-based tabu lists. If the maximum frequency index of solutions on the

frequency-based tabu list is larger than a predefined threshold (in this case, the threshold

is empirically set to 22), TS is deemed to have been searching around a specific solution

too often. Thus, the current solution will be replaced by a new randomly generated

solution by applying the global operators, and the search process will be restarted. This

diversification strategy is similar to the restart mechanism used in other stochastic

optimization approaches; however, the restart is not purely random. Instead, it is guided

by historical information.

In some cases, TS cannot locate improved solutions after the restart operation and

additional procedures will be required. For example, if the current best solution has the

objective function of 0.0094 and TS initiates a restart, if the best neighbor after several

iterations is not less than 0.0094, another intensification strategy will pull the search back

to the formerly promising area, by assigning the current solution to the best neighbor

recorded before the restart operation.

Aspiration Criterion

An aspiration criterion based on the a sigmoid function is employed to invalidate

tabu property in certain cases and helps to maintain an appropriate balance between

diversification and intensification (Lin & Miller, 2004a,b). Since an intensified search is

favored at the end, and a broadly diversified search, at the beginning, the aspiration

criterion allows the tabu property to be overridden more towards the end of the search.

This helps to ensure that the tabu lists properly encourage a broad search early on.

Constraint Handling

In this application, constraints are handled in a way that guarantees the feasibility

of the molecule. Since TS operates on basic groups directly, both valence and

connectivity constraints are handled during the process of neighbor generation as

previously described. For example, both local and global operators ensure the proper type

of bond (single, double or triple) to be connected. In addition, the number of empty bonds

is checked to guarantee that valency will be satisfied in the current solution. Groups that

would cause a violation of either connectivity or valency are not allowed. Otherwise all

groups can interconnect.

Case Studies

The effectiveness of the TS algorithm for CAMD is demonstrated with the

following two case studies. The parameters for each case are as follows:

1. The number of neighbor solutions generated at each iteration is equal to the

product of the number of basic groups and the maximal number of groups in a

molecule;

2. The number of iterations is 200;

3. The length of the tabu list is defined to be the same as the maximal number of

groups in a molecule;

The TS algorithm is compiled using the gcc compiler on a Pentium III 1.0 Ghz CPU,

1024 MB memory PC running Redhat Linux 7.1.

Case 1

The first case is the design of a molybdenum catalyst for an epoxidation reaction. Eight

basic groups (see Table 3) are used to define the search space. The purpose of this case is

to evaluate the effectiveness of the formulation and to compare TS with a deterministic

solution method. Thus, only a single target property, density, is used. While density is an

important property of a homogenous catalyst, it does not define the catalytic activity or

any other critical properties and, thus, cannot alone define an effective catalyst. However,

density is closely related to solubility for these systems, and solubility defines the amount

of potential uptake by humans. Therefore, a combination of density and toxicity would be

needed to assess exposure risk. The target density value is 4172 kg/m3, which is the

density of a commonly used homogenous transition-metal catalyst. The correlation for

density this property is given by eq. (1).

The number of neighbor solutions is 8 * 22 = 176. The 5 best molecules and the

probability that a molecule is found are obtained from 100 trials. Suboptimal solutions

are either the final output of a test or an intermediate result of certain iteration. If the

suboptimal solution is the final result of a trial, the probability is incremented by 1%.

Since intermediate solutions of some trials may be even better than the final result of

other trials, the probability of such suboptimal solutions is denoted as < 1% (see Table 8)

to indicate that it was located, but was never a final solution. The best solution, with the

objective function value of 0.000111, corresponds to a density of 4172.46, which is only

found once in 100 trials. Solution 3 (0.000238) has 7 basic groups and is found 80% of

the time. The corresponding density value of 4172.99 kg/m3 is very close to the best

solution. Since the difference between the best solution and the 5th best solution is less

than 0.2%, this shows that TS can successfully determine several promising catalyst

molecules for further experimental verification.

It is especially useful to identify and record near-optimal solutions since the

density correlation has an error of approximately 4%. Thus, near-optimal solutions are

almost as likely to be strong candidates for synthesis as the optimal one. Furthermore,

many other factors, such as ease of synthesis, have not been taken into account within the

optimization formulation. Thus, a user would like to have multiple options to choose

from.

TS provided the results in Table 8 in 90 seconds on a PIII personal PC. In

comparison, using Outer Approximation via the DICOPT solver in GAMS, only an

integer feasible solution was found after 20 minutes on a Sun Ultra 10 workstation. The

structure (see Figure 3) resulted in an objective function of 5.67.

Case 2

The second case study involves 10 basic groups (see Table 9). The properties to be

optimized are electronegativity, oxidation state and toxicity. The electronegativity is

calculated as follows:

, ,1 1

* ( ) ( )Mo Tot BondN N N

i j li j i l

Elec f Elec i Elec j

(3)

where )(iElec is the electronegativity of group i, f is an element of the adjacency matrix,

MoN is the number of rows allotted to all the molybdenum groups, TotN is the total

number of basic groups (20 in this example), BondN is type of bonds of the basic groups

(2 in this example, single and double). The lower bound is 0 and the upper bound is 3.84

eV. Oxidation state is the sum of the oxidation states of all the molybdenum groups:

1

* ( )MoN

ii

Oxid z Oxid i

(4)

where )(iOxid is the oxidation state of the ith group and z(i) is an element of the existence

vector. The target value of the oxidation state is the set 8,7,6 . Toxicity is

determined with the following correlation for the LC50:

0 0 1 1 2 250

0 2 1 2 1 2

0 1 0 1 0 2

log( ) 27.8 5.49 10.6 6.65 2.93 4.40 0.368

0.710( ) 5.59( ) 0.0761( )2.10 0.516 0.0295

v v v v

v v

v v v v v

LC

(5)

The LC50 was developed using 34 data points obtained from Syracuse Research

Corporation (2000). The resulting correlation has a correlation coefficient of 0.953 and a

sum of squared error of 2.083.

In this example, the objective function is the weighted summation of these

properties:

500.6 0.3 0.1Obj LC Elec Oxid (6)

The weights on the properties are freely adjustable and could be based on the preference

of the designer. For this case study, these weights are simply examples and do not

directly impact the conclusions of the paper.

The number of neighbor solutions is 10 * 20 = 200. The length of the tabu list is

20. The 5 best molecules obtained from 100 TS trials are shown in Table 10. The optimal

solution with an objective function of 1.286401 is obtained in 40 trials and the second

optimal solution is located the remaining 60% of the time. The difference between the

optimal solution and the 10th sub-optimal solution (shown below):

Cl ONH ClMo MoO CH2NH

is smaller than 1%. (The value of its objective function is 1.297251.) In addition, more

than 60 near optimal solutions are obtained for this case study, which shows that TS can

locate a large number of near-optimal solutions for further experimental verification.

In comparison, employing Outer Approximation via the DICOPT solver in

GAMS, only an integer feasible solution was found after 20 minutes of CPU time on a

Sun Ultra 10 workstation. The structure (see Figure 4) resulted in an objective function of

5.55. While the deterministic algorithm may be able to find and guarantee the global

optimum to this problem after expending a large amount of CPU time, we see that TS can

generate a list of near-optimal solutions within a short amount of time. This is very useful

to a researcher searching for novel alternatives to current catalysts, since the list of near-

optimal solutions provides options which can be narrowed down by employing other

factors such as cost, ease of synthesis, and estimated values of physical and chemical

properties not included in the original design.

Conclusions

A detailed implementation of the TS algorithm to CAMD problems is presented

in this paper. Although other optimization approaches have been applied to CAMD with

properties predicted using group contribution techniques, the TS algorithm implemented

with novel neighbor-generating operators and combined with property prediction via

connectivity index-based correlations provides a powerful technique for generating lists

of near-optimal molecular candidates for a given application. In addition, the tabu lists

help TS search the solution space both in a diversified way, to cover the entire search

space, and in an intensified manner, to locate the final solution precisely. Moreover, TS is

able to locate a large number of near optimal solutions within a short time as shown in

the two case studies.

Acknowledgements

This project was supported by the National Science Foundation through grant CTS-

0224887. B. Lin acknowledges additional support from the Department of Chemical

Engineering of Rose-Hulman Institute of Technology.

Nomenclature

simple atomic connectivity indices that refer to the number of bonds which

can be formed with other groups

v atomic valence connectivity indices that describe the electronic structure of

each basic group

0 zero order molecular connectivity indices

V0 zero order molecular valence connectivity indices

2 second-order molecular connectivity indices

V2 second-order molecular valence connectivity indices

iz An existence vector showing wither the ith group exists in the molecule

kjif ,, A partitioned binary adjacency matrix showing when basic groups i and j are

bonded with a kth-multiplicity bond

R set of all targeted properties

Pm the estimated value of property m

Pmscale a scale factor used to weight the importance of one property relative to

another

Pmtarget the target value for property m

x’ best new neighbor solution

x* the best solution ever found

)(iElec electronegativity of group i in electronvolts (eV)

MoN the number of rows allotted to all the molybdenum groups

TotN total number of basic groups

BondN type of bonds of the basic groups (1 = single, 2 = double)

Oxid sum of the oxidation states of all the Mo groups

)(iOxid the oxidation state of the of the ith group

LC50 Lethal concentration killing 50% of the test population, a measure of toxicity

References

Allen, D. & D. Shonnard (2002). Green Engineering: Environmentally Conscious Design

of Chemical Processes. Upper Saddle River, New Jersey: Prentice-Hall.

Camarda, K.V. & C.D. Maranas (1999). Optimization in Polymer Design using

Connectivity Indices. Industrial & Engineering Chemistry Research 38: 1884-

1892.

Churi, N. & L.E.K. Achenie (1996). A Novel Mathematical Programming Model for

Computer Aided Molecular Design. Ind. Eng. Chem. Res. 35(10): 3788-3794.

Constaninou, L. & R. Gani (1994). New Group Contribution Method for Estimating

Properties of Pure Components. AIChE Journal 40(10): 1697-1710.

Constantinou, L., R. Gani & J.P. O'Connell (1995). Estimation of the Acentric Factor and

the Liquid Molar Volume AT 298K Through a New Group Contribution Method.

Fluid Phase Equilibria 103: 11-22.

Friedler, F., L.T. Fan, L. Katotai & A. Dallos (1998). A Combinatorial Approach for

Generating Candidate Molecules with Desired Properties Based on Group

Contribution. Computers & Chemical Engineering 22(6): 809-817.

Gani, R. & E.A. Brignole (1983). Molecular Design of Solvents for Liquid Extraction

based on UNIFAC. Fluid Phase Equilibria 13: 331-340.

Gani, R., B. Nielsen & A. Fredenslund (1991). A Group Contribution Approach to

Computer-Aided Molecular Design. AIChE Journal 37(9): 1318-1332.

Glover, F. & M. Laguna (1997). Tabu Search. Boston: Kluwer Academic Publishers.

Gordeeva, E.V., M.S. Molchanova & N.S. Zefirov (1990). General Methodology and

Computer Program for the Exhaustive Restoring of Chemical Structures by

Molecular Connectivity Indexes. Solution of the Inverse Problem in

QSAR/QSPR. Tetrahedron Computer Methodology, 3, 389.

Hairston, D.W. (1998). New Molecules Get on the Fast Track. Chemical Engineering:

30-33.

Harper, P.M. & R. Gani (2000). A Multi-Step and Multi-Level Approach for Computer

Aided Molecular Design. Computers & Chemical Engineering 24(2-7): 677-683.

Harper, P.M., M. Hostrup & R. Gani (2003). A Hybrid CAMD Method. in Computer

Aided Melecular Design: Theory and Practice. L. E. K. Achenie, R. Gani & V.

Venkatasubramanian Eds. Amsterdam: Elsevier. 12: 122-169.

Joback, K. G., & G. Stephanopoulos(1989). Designing Molecules Possessing Desired

Physical Property Values. in Proceedings of the Foundations of Computer-Aided

Process Design (FOCAPD), Snowmass, CO(July 12-14). Eds. J.J. Siirola, I.

Grossmann and Geo. Stephanopoulos, CACHE-Elsevier, 363-387.

Johnson, K., B. Lin, D.C. Miller & K.V. Camarda (2002). Molecular Design of Polymer

Coatings using Optimization Techniques. Poster 246b, AIChE Annual Meeting,

Indianapolis, Indiana.

Kier, L.B. & L.H. Hall (1976). Molecular Connectivity in Chemistry and Drug Research.

New York: Academic Press.

Kier, L. B., H. H. Lowell & J. F. Frazer (1993). Design of molecules from Quantitative

Structure-Activity Relationship Model. 1. Information Transfer between Path and

Vertex Degree Counts, J. Chem. Inf. Comput. Sci., 33, 142

Kim, U.-R., K.-S. Min, M.-J. Lee, S.-H. Kim & B.-J. Jeong (1992). The Study of

Physical Properties for the Organic Compounds and their Binary Mixture

according to Molecular Connectivity Method. Journal of the Korean Chemical

Society 36(4): 485-495.

Lin, B., S. Chavali, K. Camarda & D.C. Miller (2003) Using Tabu Search to Solve

MINLP Problems for PSE, In Process Systems Engineering 2003 (Proceeding of

8th International Symposium on Process Systems Engineering, KunMing, China)

B. Chen & A.W. Westerberg, Eds. Elsevier, 541-546.

Lin, B. & D.C. Miller (2004a). Solving Heat Exchanger Network Synthesis Problems

with Tabu Search. Computers & Chemical Engineering 28(8): 1451-1464.

Lin, B. & D.C. Miller (2004b). Tabu search algorithm for chemical process optimization.

Computers & Chemical Engineering, In press.

Maranas, C.D. (1996). Optimal Computer-aided Molecular Design: A Polymer Design

Case Study. Ind. Eng. Chem. Res. 35: 3403-3414.

Marcoulaki, E.C. & A.C. Kokossis (1998). Molecular Design Synthesis Using Stochastic

Optimization as a Tool for Scoping and Screening. Computers & Chemical

Engineering 22(Suppl.): S11-18.

Marrero, J. & R. Gani (2001). Group Contribution Based Estimation of Pure Component

Properties. Fluid Phase Equilibria 183: 183-208.

Meniai, A. H., D. M. T. Newsham & B. Khalfaoui, (1998), Solvent Design for Liquid

Extraction Using Calculated Molecular Interaction Paramters, Trans. IchemE,

76(A11), 942-950.

Ostrovsky, G.M., L.E.K. Achenie & M. Sinha (2003). A Reduced Dimension Branch-

and-Bound Algorithm for Molecular Design. Computers & Chemical Engineering

27(4): 551-567.

Raman, V.S. & C.D. Maranas (1998). Optimization in Product Design with Properties

Correlated with Topological Indices. Computers & Chemical Engineering 22(6):

747-763.

Randic, M. (1975). On Characterization of Molecular Branching. J.Am.Chem.Soc. 97:

6609-6615.

Sahinidis, N.V. & M. Tawarmalani (2000). Applications of Global Optimization to

Process and Molecular Design. Computers & Chemical Engineering 24: 2157-

2169.

Sahinidis, N.V., M. Tawarmalani & M. Yu (2003). Design of Alternative Refrigerants via

Global Optimization. AIChE Journal 49(7): 1761-1774.

Siddhaye, S., K.V. Camarda, E. Topp & M. Southard (2000). Design of Novel

Pharmaceutical Product via Combinatorial Optimization. Computers & Chemical

Engineering 24: 701-704.

Syracuse Research Corporation, (2000) User’s Guide for the Physprop Database.

Syracuse Research Corp., North Syracuse, New York.

Trinajstic, N. (1983). Chemical Graph Theory, CRC Press.

Vaidyanathan, R. & M.M. El-Halwagi (1996). Computer-Aided Synthesis of Polymers

and Blends with Target Properties. Ind. Eng. Chem. Res. 35(2): 627-634.

Venkatasubramanian, V., K. Chan & J.M. Caruthers (1994). Computer-aided Molecular

Design Using Genetic Algorithm. Computers & Chemical Engineering 18(9):

833-844.

Wang, S. & G.W.A. Milne (1994). Graph Theory and Group Contributions in the

Estimation of Boiling Points. J. Chem. Inf. Comput. Sci 34: 1242-1250.

Wang, Y. & L.E.K. Achenie (2002). A Hybrid Global Optimization Approach for

Solvent Design. Computers & Chemical Engineering 26: 1415-1425.

Table 1. Atomic connectivity values of groups in CHCl2F

1 2 3

Group | -CH- -F -Cl

3 1 1 v 3 7 0.77778

Table 2. Calculation of molecular connectivity indices

zero

Vii

21

0

Vi

vi

v 21

0 V - the set of all vertices

First

Ejiji

),(

21

1 )(

Eji

vj

vi

v

),(

21

1 )( E - the set of all edges

Second

Tkjikji

),,(

21

2 )(

Tkji

vk

vj

vi

v

),,(

21

2 )( T- the set of all triplets

Table 3. Basic groups and atomic connectivity values of Case 1

1 2 3 4 5 6 7 8

Group | >Mo<

| >Mo<

| -OH -Cl -O- -CH3 -CH2- -NH2

5 6 1 1 2 1 2 1 v 0.13889 0.17143 5 0.77778 6 1 2 3

NMaxi 2 2 3 3 3 3 3 3

NMax, denotes the maximal number of basic groups allowed in a molecule Table 4. Procedures for building an initial solution

Step 1 2 3 4 5 6

Soln. NH2 MoNH2

MoNH2

OH

MoNH2

OH

O

Cl

MoNH2

OH

O

Cl

CH3

MoNH2

OH

O

Cl

CH3

OH

Table 5. Neighbor generation operators

Step Replace Insert Delete Swap Move

Soln. MoNH2

OH

CH2

Cl

CH3

OH MoNH2

OH

CH2

Cl

CH3

OH

CH2

MoNH2

OH

Cl

CH3

OH MoNH2

OH

O

Cl

CH3

OH

Mo

NH2

OH

O

Cl

CH3

OH

Table 6. Recency-based tabu list

No Tabu tenure Objective function Group list

1 22 0.0036 MoNH2 CH2 CH3

2 20 0.0699 MoNH2 O Cl

3 15 0.5752 MoOH CH3

4 14 0.5666 MoNH2

Cl

5 10 0.5041 MoOHCl

6 9 0.0094 MoOH O CH3

Table 7. Frequency-based tabu list

No Frequency index Objective function Group list

1 0.025 0.4848 MoOHCH3

2 0.103 0.1405 MoOH CH2NH2

3 0.085 0.0648 MoOH O OH

4 12.75 0.0094 MoOH O CH3

5 0.590 0.4879 MoOHCl

6 3.600 0.5666 MoNH2

Cl

Table 8. Results of case 1 with TS approach

Structure N Obj. Value Prob

0 O OOH

OH Cl

CH3OH

CH3

CH3

NH2

NH2

CH2 O CH2 CH2NH2MoMo

18 0.000111 1%

1

CH3 CH3

MoMo

OH

Cl

NH2

NH2NH2

O CH2 OH

Cl

13 0.000115 <1%

2 Mo

OH

Cl

OH

CH2

OH

Cl

7 0.000238 80%

3 O O OH

Cl

CH3

OH

CH3

CH3NH2 NH2

CH2 CH2Mo Mo

Cl

Cl

OH

NH2

Mo CH2

20 0.000571 <1%

4

CH3

CH3

MoMo

OH

Cl

NH2OCH2OH

Cl

CH2 CH2

OH

Cl

CH3

16 0.000659 <1%

5 Mo

OH

Cl

CH3

O

OH

NH2

7 0.001361 19%

Table 9. Basic groups of Case 2

1 2 3 4 5 6 7 8 9 10

Group -Mo- | -Mo-

| -Mo= S= O= -O- -NH2 -NH- -CH2- -Cl

2 3 3 1 1 2 1 2 2 1 v 0.05128 0.07895 0.10811 0.66667 6 6 3 4 2 0.77778

Elec. (ev) 2.16 2.19 2.24 2.58 3.44 3.04 3.04 2.55 2.55 3.16

Oxid. 2 3 4 __ __ __ __ __ __ __

NMax 2 2 2 2 2 2 2 2 2 2

Table 10. Solutions of TS to Case 2 No Structure NGrp Obj. Prob. 0 Cl O CH2 CH2

ClMo MoO 8 1.286401 40% 1 ClO NH CH2Cl Mo MoO 8 1.288636 60% 2 Cl O NHCH2 ClMo MoO 8 1.288698 < 1% 3 Cl O NH CH2 ClMo MoO CH2 9 1.288911 < 1% 4 ClO NHCl Mo MoO 7 1.288913 < 1%

Figure 1. Hydrogen-suppressed graph of CHCl2F

Mo

OH

Cl

OH

CH2

OH

Cl

Figure 2. A sample molecule for case 1 located by TS

Mo

OH

Cl

O

NH2

Mo

OHCH3

CH3

OH

NH2

Cl Figure 3. Integer solution for case 1 found with OA algorithm

222 NHCHMoMoNH Figure 4. Integer solution for case 2 found with OA algorithm

Cl

CH

Cl F

Vertex

Edge Triplet